## Abstract

The use of polychromatic X-ray sources in tomographic X-ray measurements leads to nonlinear X-ray transmission effects. As these nonlinearities are not normally taken into account in tomographic reconstruction, artefacts occur, which can be particularly severe when imaging objects with multiple materials of widely varying X-ray attenuation properties. In these settings, reconstruction algorithms based on a nonlinear X-ray transmission model become valuable. We here study the use of one such model and develop algorithms that impose additional non-convex constraints on the reconstruction. This allows us to reconstruct volumetric data even when limited measurements are available. We propose a nonlinear conjugate gradient iterative hard thresholding algorithm and show how many prior modelling assumptions can be imposed using a range of non-convex constraints.

## 1. Introduction

Tomographic X-ray imaging is used widely in areas as diverse as medical diagnostics and non-destructive material testing. Using sets of X-ray projection images taken from different directions, tomographic inverse algorithms reconstruct volumetric images of an object's internal structure. Most algorithms in use today rely on a linear X-ray attenuation model which assumes that the X-ray source produces mono-energetic X-ray photons. However, most laboratory sources produce polychromatic X-ray beams and so, approximate pre-processing algorithms are often used to correct for some of the nonlinear effects prior to volumetric reconstruction.

While these pre-processing techniques often work well when imaging objects in which X-ray attenuation does not vary too widely, in settings where there is a mix of highly attenuating regions and low-attenuating regions, severe artefacts are introduced in the image reconstruction. Examples include metal implants imaged in human tissue or carbon composite components with metal inclusions.

As a motivating example, consider the simulated phantom shown in figure 1, which is made up of four materials, two materials with low X-ray attenuation factors that differ very little between materials (an aluminium outer ring with a silicon inner disc) and two highly attenuating materials (silver and gold inclusions). Figure 1 shows the mass attenuation for these materials at 80 keV. To differentiate the different materials, the greyscale has been compressed to better show the difference in the high- and low-attenuating regions (see the colourbar).

We simulated two different X-ray acquisitions, one with a monochromatic source at 80 keV and one with a polychromatic source with a maximum energy of 130 keV. Seven hundred and twenty fan beam projections were calculated for each system and reconstructed with the filtered backprojection (FBP) algorithm (using Matlab's image processing toolbox functions) and the algebraic reconstruction technique (ART; using the randomized Kaczmarz algorithm from the AIRTools toolbox [1]). The results are shown in figure 2. While our compressed colour map shows reconstruction artefacts in the monochromatic reconstructions on the right, the left polychromatic reconstructions clearly show additional artefacts typical for polychromatic X-ray reconstructions. These artefacts can be seen as shadows between high-attenuating regions (the darker area in the centre of the reconstruction) and brighter streaks that tangentially connect the boundaries of different high-attenuating objects.

The realization that polychromatic X-rays lead to a nonlinear transmission models is typically known as beam-hardening [2]. Many different techniques for the correction of this effect have been proposed over the years, from physical filtering of the X-ray source spectrum to reduce the nonlinear relationship between path length and X-ray absorption [2] to direct inversions of the full nonlinear model (2.6).

For single material objects, intensity linearization can be performed [2] based on phantom scans [3], initial uncorrected reconstructions [3,4] or simulation studies [5]. Measurements are interpolated [6–9] and this process is often iterated [6,7,10]. Intensity corrections can also be calculated for multiple material objects [3,8,11–16]. The initial reconstruction is segmented which again provides initial path length estimates. This process can then also be iterated. For objects made up of two materials with known attenuation coefficients, it is also possible to instead estimate the volume fraction for each spatial location [6,17,18].

## 2. The nonlinear X-ray observation model

We will here propose an advanced reconstruction algorithm that explicitly models nonlinear X-ray attenuation. Furthermore, we will impose additional regularization constraints, which will not only be able to overcome the general ill-conditioning of the nonlinear inverse problem, but will also allow us to recover volumetric images when few X-ray measurements can be taken.

To a first approximation, ignoring X-ray scattering, the fraction of photons with energy *E* emitted from the X-ray source that travel towards a given detector element *I*_{0}(**r**,*E*) and the number of photons with energy *E* counted at the detector *I*(**r**,*E*) are a function of the mass attenuation coefficient *μ*(**z**,*E*) which describes the likelihood with which X-ray photons with energy *E* are absorbed or scattered at location **z**. We here use **r** to label the line between the X-ray source and the detector element. As most X-ray detectors cannot distinguish photons of different energies, the detector output integrates over all energy levels.

This leads to the following observation model used previously in [6,13,19–29] and [30]:
2.1
where we assume that *I*_{0}(**r**,*E*) is independent of **r** so that we write *I*_{0}(*E*) instead.^{1}

### (a) The monochromatic model

For future reference, if we had a monochromatic source or if we were able to measure X-ray intensities *I*(**r**,*E*) in narrow energy bands, then we would have a simple linear relationship
2.2
This model is the basis for most traditional tomographic reconstruction algorithms such as FBP and algebraic reconstruction.

Algebraic methods start with a discretization of (2.2). Using as the unknown coefficient vector that approximates *μ*(**z**,*E*) in a spatial basis and using the vector **a**(**r**_{i}) to describe the integrals of each individual basis function along the path **r**_{i}. Let
2.3
be the vector containing all our normalized and log-transformed projection measurements. If we furthermore stack all vectors **a**(**r**_{i})^{T} into a matrix **A**, then we have the linear model
2.4
The bar over the vector here reminds us that this is the linear model for monochromatic X-ray interaction.

### (b) The polychromatic model

To use the nonlinear model in (2.1) for reconstruction, we also discretize the photon energies *E* [21]
2.5
where the sum is now over *L* different energy bands *E*_{l}.

We again collect the normalized and log-transformed observations *I*(**r**_{i}) into a single vector
2.6
Using the normalized X-ray intensity vector , we thus have the measurement model
2.7
where the exponent is evaluated component wise and where we introduce the matrix notation .

It is important to note that the columns in **X** are related to X-ray absorption at a certain X-ray energy level and so are the different entries in **I**. To use this model for a fixed set of energies, we would thus either need to know the source intensities at these energy levels or estimate **I** at the same time as **X** from the measurements. If we did this, then we could incorporate additional knowledge about the X-ray absorption properties of materials known to be present in a sample. However, if we estimate **X** without any prior assumption on the materials' absorption properties, then quantization of the energy levels is to some extent arbitrary. Thus, instead of quantizing the energy levels, we might as well quantize X-ray intensities **I**, which in turn would correspond to an unknown quantization of the energy levels. We can then estimate the average X-ray intensity and detector efficiency using a scan without the object or from a region of the detector not in the shadow of the object. This estimate can then be used in the denominator of (2.6) to normalize the observed X-ray intensities before taking the logarithm.

### (c) Measurement errors

So far, we have modelled the X-ray physics using the Beer–Lambert law. While this model is motivated by the photon nature of X-rays, it does not directly model the quantized, probabilistic nature of X-ray generation and detection, which follow Poisson-like distributions. In certain settings, especially those in which few X-rays are measured along each X-ray path, the influence of the X-ray statistics becomes more important and it is then advantageous to directly model the statistical distribution to formulate the inverse problem. In other settings, where the number of measured X-ray photons becomes large, the central limit theory allows us to make Gaussian noise assumptions.

Additional uncertainty in the measurement comes from system noise and from discretization and quantization effects. Spatial discretization is further degraded as X-ray sources and detectors are not ideal points in space but cover extended areas, while X-ray intensities are stored digitally in quantized form. Scattering [31] is another significant process that can degrade X-ray projection images and lead to reconstruction artefacts.

In this paper, we will make Gaussian noise assumptions, though our approach is readily modified to use Poisson models when appropriate. We will start the development of our approach using a maximum-likelihood estimate, though again, extension to maximum *a posteriori* estimates follows similar lines. Assuming a fixed noise variance, the maximum-likelihood estimate for a Gaussian model with observations described by (2.7) is then equivalent to the following minimization problem:
2.8

To demonstrate the benefits of this nonlinear optimization problem over the linear model, we used the same simulated polychromatic X-ray measurements that were used on the left-hand side of figure 2 and optimized (2.8) using 200 iterations of a gradient descent algorithm with fixed step size. We initialized the matrix **X** as **x**(** ϕ**+

**)**

*θ*^{T}, where

**x**is the ART reconstruction shown in figure 2 and where

**and**

*ϕ***are the photoelectric effect and Compton scattering basis functions proposed in [25].**

*θ*The results are shown in figure 3, where we show the estimated mass attenuation averaged over energy. This average is calculated as 2.9 where the exponent and logarithm are again evaluated component wise.

As can be seen, the beam hardening artefacts have been reduced though the performance is not as good as those achieved with monochromatic measurements. Part of this is due to the slow convergence of the gradient descent algorithm. This convergence is also the reason why we had to choose a sensible starting point for the algorithm, using a linear reconstruction to initialize the spatial distribution and a simple model of the energy-dependent X-ray attenuation to model the energy distribution.

## 3. Regularization with non-convex constraints

When using model (2.6) additional assumptions are often imposed. Let us write the matrix **X** as a factorization **X**=**DS**, where **D** and **S** are matrices. The matrices **D** and **S** can now be used to model different assumptions. For example, the *i*th column in **D** could contain the density of material *i* within each of the voxels and the *i*th row in **S** could contain the discretized attenuation spectrum for this material. Thus, if we assume that there are no more than *J* materials, then the matrix **X** will have rank at most *J*. Alternatively, the formulation can be used to describe the model used in [23–26], where **S** has two rows, one representing a known basis function to model the photoelectric effect and one basis function to model Compton scattering [25].

Different model assumptions also often lead to additional constraints on **D**. For example, the assumption in [21] that each voxel contains a single material leads to the constraint that **D** must have only a single entry per row. On the other hand, for the model where **S** contains basis functions for the photoelectric effect and for Compton scattering, Stonestrom *et al*. [24] proposed that the two coefficients in each row of **D** could be modelled using a single parameter. A similar parametrization of **D** is also used in [27].

### (a) Non-convex regularization

The constraints on **X** discussed in the previous paragraph, namely the requirement that **X** should be low rank or that **D** should have one sparse columns can be enforced using non-convex regularization constraints. Regularization is a common approach used to stabilize tomographic reconstruction, preventing noise amplification and reducing artefacts due to non-ideal sampling. Traditional Tikhonov regularization [32] as well as now standard total variation (TV) regularization [33] typically lead to convex optimization problems for which increasingly efficient algorithms are being developed [34]. Non-convex constraints, on the other hand, while offering increased flexibility in the way prior information can be modelled and exploited for regularization, lead to non-convex and often combinatorial optimization problems, which pose significantly more challenges. However, the advent of compressed sensing has, over the last 10 years, shown that many non-convex optimization problems can nevertheless be solved efficiently using polynomial time algorithms [35,36]. Initially, compressed sensing was built around sparse, finite dimensional data models, though this has been extended to more general non-convex constraints and infinite dimensional models [37,38]. Another important extension has been the recent move away from the linear forward model of traditional compressed sensing to nonlinear forward models [39]. These recent innovations can now also be applied to X-ray tomographic reconstruction under nonlinear observation models, such as that of equation (2.7). Before we look at the computational approaches that can be used to solve this inverse problem, we first introduce a range of possible non-convex constraints that might be useful in the tomographic reconstruction setting. As we will see, our algorithm will require the computation of projections of matrices or vectors onto the closest element in the non-convex constraint set so that we will pay particular attention to how this projection can be done efficiently for the different constraints we introduce.

### (b) Sparsity

Sparsity is a powerful constraint. In its simplest form, sparsity constraints restrict model parameters to be predominantly zero so that the effective degrees of freedom of a model can be much smaller than the number of model parameters.

#### (i) Sparsity of a vector

The canonical sparse model is a vector model, where vector entries are restricted to contain few non-zero entries. More formally, consider the vector in Euclidean space. We can denote the number of non-zero entries in this vector by ∥**x**∥_{0}. A sparsity constraint could then be written as
3.1
for some *K*<*N*. This is a non-convex constraint. If we have two *K* sparse vectors **x**_{1} and **x**_{2}, then their convex combination λ**x**_{1}+(1−λ)**x**_{2} is not guaranteed to be *K*-sparse. However, if both **x**_{1} and **x**_{2} have their non-zero elements at the same locations, then the linear combination will also be *K*-sparse. A *K*-sparse constraint thus restricts elements to lie on one of several subspaces. The set of all *K*-sparse vectors is thus a union of subspaces.

For an arbitrary vector **x**, calculation of the *K*-sparse approximation with the smallest least-squares error can be done extremely efficiently. All that is required is to sort the entries in **x** by decreasing magnitude and set all but the *K* largest elements of **x** to zero. We will call the calculation of the best *K*-sparse approximation of a vector a *projection* of the vector onto the union of subspaces.

The power of sparsity constraints comes from the fact that they are not restricted to the canonical basis, but that we can restrict coefficients in any basis. By carefully choosing the right basis, we are thus often able to model and constrain different characteristics of **x**. For example, for our tomographic X-ray problem, wavelet bases can provide good sparse models. Let ** Φ** be a basis matrix and define

**s**such that

**x**=

*Φ***s**. We then say that

**x**is sparse in the basis

**if**

*Φ***s**is sparse. If

**is orthonormal, then it is again easy to commute the best**

*Φ**K*-sparse approximation of

**x**in the basis

**. This is done by calculating**

*Φ***s**=

*Φ*^{T}

**x**, which is then again thresholded by keeping only the

*K*largest entries in magnitude of

**s**. However, if

**is no longer orthonormal, or if**

*Φ***is over-complete, that is**

*Φ***has more than**

*Φ**N*columns, then the estimation of the best

*K*-sparse approximation in this basis becomes a combinatorial problem in general and suboptimal algorithms have to be used.

#### (ii) Sparsity in matrices

We are interested in regularizing the matrix valued optimization problem (2.8) and for matrices, sparsity constraints are even more flexible. Mirroring the vector sparsity model, the simplest constraint could be a restriction on the total number of non-zero entries; however the matrix structure reflects underlying data properties and so it makes sense to treat columns separately. We might for example restrict matrix rows or columns to be sparse. A *K*-row-sparse matrix is then a matrix where all but *K* rows contain only zero entries. Again, for an arbitrary matrix **X** to compute the best approximation with a *K*-row-sparse matrix in terms of the Frobenius norm, we simply order the rows by the size of their Euclidean norm. We then set all but the largest *K* rows to zero, thus again projecting the matrix onto the closest *K*-row-sparse matrix.

Extensions of this principle again allow for the use of basis matrices to capture row or column structure. For example, a matrix **X** is row sparse in a basis ** Φ** if

**X**=

*Φ***Z**with

**Z**being row sparse. Similarly, a matrix

**X**is column sparse in a basis

**if**

*Φ***X**=

**Z**

**with**

*Φ***Z**being column sparse. Again, for orthonormal basis

**, calculating the best column or row sparse matrix in**

*Φ***can be done efficiently.**

*Φ*#### (iii) Block and tree sparse models

Often, additional structure can be imposed on a sparse model, further reducing the size of the constraint set. Block sparse vector models group elements of a vector into either non-overlapping or overlapping blocks. A *K* sparse block model then restricts the number of blocks that can contain non-zero entries. For arbitrary vectors **x**, computing the best block sparse approximation is easy if the model has non-overlapping blocks, but there are no efficient methods for overlapping blocks.

A special subset of block sparse models is a tree sparse model where a tree structure defines the blocks. A *K*-tree-sparse vector is then a rooted subtree with no more than *K* nodes. Optimal tree sparse models are difficult to compute, but fast approximate algorithms are available [40].

### (c) Rank constraints

For matrix models, another powerful non-convex constraint is the rank constraint [41]. By restricting matrices to have a rank of at most *K*, we in effect constrain the sparsity of the singular values. The difference from sparse models is however that the left and right singular vectors can vary continuously. While low-rank matrices with the same left and right singular vectors again lie on the same subspace, there are infinitely many potential left and right singular vectors and so there are infinitely many of these subspaces. Low-rank models thus restrict matrices to lie in one of infinitely many subspaces. For a general matrix **X**, computing the best rank *K* approximation can be done using the singular value decomposition followed by a hard thresholding of the singular values based on their magnitude, which again computes the projection onto the union of subspaces model.

### (d) Non-negative matrix factorization constraint

An extension of a low-rank constraint is the non-negative matrix factorization constraint. For a non-negative matrix **X**, a non-negative matrix factorization is a decomposition of **X** into two positive factors and . For *K* smaller than *M* and *N*, this is again a low-rank factorization; however, the positivity constraint on **D** and **S** imposes additional constraints. Computing the best approximation of a positive matrix as a product of two positive factors is not a straightforward task, yet approximate solutions can be computed using non-negative matrix factorization algorithms [42].

## 4. Iterative hard thresholding and nonlinear iterative hard thresholding

There are two common approaches to deal with non-convex constraints such as those discussed above. The strategy advocated in the original compressed sensing literature is convex relaxation. Here, a convex constraint set is specified that is as close as possible to the non-convex constraint. Surprisingly, in several sparse inverse problems, solving the relaxed convex problems is guaranteed to find good solutions to the original non-convex constraint. However, not all non-convex constraints lend themselves to efficient convex relaxation. Furthermore, the resulting convex algorithms are in many cases slower than alternative approaches.

The alternative to convex relaxation are so-called greedy algorithms. These methods solve simpler optimization problems in an iterative fashion. For sparse models, greedy pursuits are a fast family of methods that build up the sparse representation one (or a few) elements at a time. However, once chosen, a non-zero element is not normally un-selected so that the algorithm is unable to correct errors made in previous iterations. Furthermore, greedy pursuits are restricted to a small set of non-convex constraints. An alternative set of greedy methods are projection-based methods. We have highlighted above the way in which we can compute the best approximation of an element with an element of our constraint set. There are now a range of algorithms that iteratively project elements back onto the non-convex constraint while also trying to reduce the error between the observed measurements and the model. For general non-convex constraints and linear models, the compressed sensing matching pursuit (CoSaMP) [38,43] and the iterative hard thresholding (IHT) algorithms [37,44] are two common choices.

The work reported here is based on the IHT algorithm [44], its extension to nonlinear problems suggested in [39] and the use of the conjugate gradient acceleration suggested in [45].

### (a) Iterative hard thresholding

The IHT algorithm was developed for the inversion of an underdetermined linear system under sparsity constraints. In particular, for given observations **y** and matrix ** Φ**, IHT tries to solve the non-convex optimization problem
4.1

This is done with the simple iteration
4.2
where *P*(⋅) calculates the best *K* sparse approximation of a vector. The step size *μ* has to be chosen carefully to prevent instability. Under certain conditions on ** Φ** and

*μ*, IHT is guaranteed to find near optimal solutions to the above optimization problem [44].

The extension of the IHT algorithm to nonlinear problems starts with the observation that *Φ*^{T}(**y**−*Φ***x**^{n}) in the above expression is proportional to the negative gradient of the unconstraint cost function ∥**y**−*Φ***x**∥_{2}. Thus, if we were trying to optimize a more general nonlinear cost function
4.3
then we simply replace the appropriate gradient into the algorithm and use the iteration
4.4
again, for a suitable choice of *μ* and under certain conditions on *f*(**x**), this algorithm is guaranteed to solve the non-convex optimization problem with a certain precision [39].

While the previous discussion of the IHT algorithm concentrated on sparse models, other non-convex constraints can be enforced. This is achieved through a suitable replacement of the projection operator *P*(⋅). For example, if we want to estimate a matrix with rank less than *K*, then *P*(**X**) is the operator that finds the best rank *K* approximation to **X**. As discussed above, this can be computed using the singular value decomposition.

### (b) Conjugate gradient-based iterative hard thresholding for nonlinear observation models

Several approaches have been suggested to increase convergence of IHT type algorithms [46–48]. For the nonlinear X-ray model, we use the algorithm proposed in [39] for nonlinear compressed sensing problems; however, to improve convergence, we implement an acceleration similar to that proposed by Blanchard *et al*. [45], who developed a conjugate gradient-based variation of the linear IHT algorithm [44].

Our algorithm replaces the gradient descent step with a more general directional descent step. The update
4.5
is replaced by the more general update
4.6
where **d**^{n} is a descend direction. The computation of the descend direction is inspired by the conjugate gradient methods used for nonlinear optimization. In the first iteration or during a conjugate gradient restart iteration, we use **d**^{n}=−∇_{f}(**x**^{n}). In other iterations, **d** is a combination of the current gradient and the previous descend direction
4.7
where *β* is computed as . We here use the projection operator which projects the gradient at the current and the previous estimate onto the subspace , which is the subspace in the union of subspace model in which the current estimate **x**^{n} lies. This is an important difference from the more standard conjugate gradient method. In effect, we assume that our optimization problem is restricted to the subspace . Whenever we change subspaces (that is, whenever the projection *P*(⋅) projects onto a different subspace), then our optimization problem changes so that we restart the conjugate gradient method by using *β*=0 in the next iteration.

In this form, the algorithm can be used for non-convex constraint sets that are unions of linear subspaces. In this setting, the algorithm relies on four operations: the calculation of a cost function , the evaluation of its gradient , the nonlinear projection operator *P*(⋅) and the linear projection operator . For our matrix-valued problem with , we here use
4.8
and
4.9
To better understand the projection operators *P*(⋅) and , assume we have a vector **x**=[−1 0.2 4 −2 0.9]^{T} and we use a 2-sparse non-convex constraint set. The projection of **x** onto this set is *P*(**x**)=[0 0 4 −2 0]^{T}. We can then also specify the subspace as all those vectors that have non-zero entries in the third and fourth position, that is, all vectors that can be written as
4.10
The projection that projects onto this subspace simply sets the first, second and fifth entry in a matrix to zero. Alternatively, for matrix factorization models, we use a similar approach. Assume **X**≈**DS**, where **DS** is the best approximation of **X** as a low-rank (or a non-negative low-rank) decomposition. We then use the matrix **D** to define the subspace onto which we can project any **X** using **D**(**D**^{T}**D**)^{−1}**D**^{T}**X**.

This leaves us with the problem of estimating a good step size *μ*. We here perform a line search in direction **d**. This is done using a quadratic approximation to the line search cost function
4.11
where we again only optimize within the subspace . The approximation of fits a quadratic function to match the magnitude and gradient of *f*(**x**^{n}) evaluated at the current estimate **x**^{n} as well as the value and gradient of for a given test step size *a*. Thus, the step size *μ* used is that which achieves the minimum of the quadratic function. In addition to the conjugate gradient restart due to the algorithm moving from one subspace to another, standard nonlinear conjugate gradient restart conditions are also monitored (such as those based on a gradient orthogonality condition or on maximum iteration counts). Note that for matrix factorization-based constraints, the subspace changes continuously between iterations so that a subspace-based restart would mean that we simply use gradient descend. In this case, it makes sense to only use gradient orthogonality conditions or maximum iteration counts to define the conjugate-gradient restarts.

The algorithm can thus be summarized as follows:

— Input:

**X**^{[0]},*P*,*f*, and*g*—

**p**^{[0]}=0— if ||

*X*^{[0]}||=0–

else

–

—

— : Iterate (

*n*=1,*n*++) until some stopping criterion is met:— if

*n*=1 or if the conjugate gradient restart condition is satisfied–

*β*=0else

–

—

**d**=−0.5*g*(**X**^{[n]})+*β****p**^{[n−1]}—

*a*= linesearch(—

**X**^{[n+1]}=*P*(**X**^{[n]}+*a***d**)— subspace(

**X**^{[n+1]})—

**p**^{[n]}=**d**.

### (c) Extension to mixed projections

The above algorithm assumes that *P* projects onto one of several linear subspaces. While many of the constraints we have discussed above fall into this category, other constraints or combinations of constraints will lead to more general projections. For example, we know that the X-ray attenuation coefficients are positive. It thus would make sense to include positivity constraints into our non-convex constraints. This raises two issues: firstly, a model with a positivity constraint is no longer a subspace model and secondly, projecting onto the union of a non-convex subspace model and the positive orthant is far from trivial in most cases. We here use what we call mixed projections in which we first project onto the subspace model and then project onto the second constraint. The second constraint does not have to be convex (as the positivity constraint) but can be a more general constraint. For example, we might want to constrain the reconstruction matrix to be sparse in the wavelet domain as well as have a low-rank structure. We do this by first projecting onto the sparse wavelet model using a simple wavelet thresholding approach. This is then followed by a low-rank approximation.

We found that inclusion of this additional projection generally improves performance. In the above algorithm, we insert this projection step into the algorithm as the second to last line, that is, after the projection onto the non-convex constraint and the estimation of the subspace. While this works well with a projection onto the positive orthant as well as the non-negative matrix decomposition, for the more general low-rank projection, we found that we were no longer optimizing within a subspace. In this setting, we thus used the conjugate gradient method to optimize in the full space, that is, we compute *β* as (∥*g*(**x**^{[n]})∥^{2})/(∥(*g*(**x**^{[n−1]})∥^{2}), that is, using the unprojected direction vectors. Note however that we still conduct the line search within the subspace defined by the subspace projection.

## 5. Numerical results

### (a) Artificial data

Our simulations are based on two phantoms. The disc phantom in figure 1 and the Shepp–Logan phantom. The disc phantom simulates a phantom with four materials, an aluminium ring with inner silicon disc, four silver inclusions in the aluminium ring and two gold inclusions in the silicon disc. For the Shepp–Logan phantom, we partitioned the original Shepp–Logan phantom into different areas, one for each of the different grey levels. Each of these areas was then assumed to be either a uniform mixture of three materials (aluminium, iron and silver), or was assumed to be one of these materials. For the first case, mixture coefficients are generated randomly, while in the second case, materials were assigned to regions randomly.

In all simulations, mass attenuation coefficients are taken from the NIST database. We model an X-ray source operating at 120 keV using a tungsten target. The detector is assumed to be caesium iodide with a linear response. Energy levels were quantized in bins with centre energies of 30, 35, 40, 45, 50, 52.5, 55, 57.5, 60, 62.5, 65, 67.5, 70, 77.5, 85, 92.5, 100, 112.5 and 115 keV. Non-linear X-ray projections were generated on a different spatial grid from the grid used to generate the system matrix **A** for the reconstruction algorithm. For the reconstruction, we used an energy quantization of 10 levels with centre energies of 30, 40, 50, 55, 60, 65, 70, 85, 100 and 115 keV.

The disc phantom was measured using 40 cone-beam projections from equally spaced angles, covering 360°. Each projection was measured with a linear X-ray detector array with 181 sensor elements. Both the original phantom and the reconstruction were on a 128×128 spatial grid, with the difference that the simulated measurements used a grid rotated by 45°. Measurements from the Shepp–Logan based phantom were generated similarly with the difference that we only used 32 projections. For the reconstruction, we assumed knowledge of *I*_{0} but did not use knowledge of the mass attenuation coefficients for the three materials. Observations are then calculated as
5.1
When simulating the Shepp–Logan phantom in which each homogeneous region consists of a mixture of the three materials, then is a matrix with repeated randomly generated rows (with entries drawn from a uniform distribution and normalized to sum to 1), with a different row for each grey level in the Shepp–Logan phantom. To simulate the Shepp–Logan phantom in which each homogeneous region consists of a single material, the matrix **D** has a single non-zero entry per row, which is set to one. For each homogeneous region, the location of the non-zero entry is generated randomly. In both cases, is a matrix with the quantized mass attenuation coefficients for the three materials at the 19 different energy levels. *I*_{0} is a vector that counts the average number of X-ray photons in a given energy band transmitted by the source and converted to visible light in the scintillator. We added Gaussian noise to the observations, normalized to give a signal to noise ratio (SNR) of 40 dB.

In order to be able to compare linear reconstruction methods which estimate a single X-ray attenuation image and nonlinear methods which estimate attenuation images for each energy level, we here compare the performance in terms of the algorithms' ability to estimate the average attenuation calculated as 5.2 The average attenuation can be compared directly to the estimates from the linear estimates, while for the nonlinear estimates, we also average the estimated attenuation coefficients over energy levels before comparison.

Results for the disc phantom are shown in figures 4 and 5 for the linear reconstruction and the nonlinear reconstructions, respectively. We here compared the FBP algorithm available through the Matlab image processing toolbox, the ART implemented in the AIRTools toolbox [1] (using the randomized Kaczmarz version of the algorithm) with our conjugate gradient algorithm that uses the linear model with two different constraints, wavelet sparsity and wavelet tree sparsity. We also used a total variation constrained inversion using the TVAL3 code (http://www.caam.rice.edu/~optimization/L1/TVAL3/). The algebraic method performs better than FBPs. Including additional constraints further improves performance, with the TV regularized results giving the highest SNR here, though the beam hardening artefacts are still very strong. Also, none of the methods is able to allow us to distinguish between the aluminium and silicon.

The reconstructions with the nonlinear model are shown in figure 5. We here use our conjugate gradient method with a range of constraints, positivity, low rank, wavelet row-sparsity, wavelet-tree sparsity, a combination of wavelet row-sparsity and low rank, a combination of wavelet row-sparsity and non-negative matrix factorization and non-negative matrix factorization on its own. While some of the results have reduced beam-hardening artefacts, the artefacts are still visible here. However, the use of the matrix models has led to better SNR figures, especially for the model that combines wavelet row-sparsity with a non-negative matrix factorization. For the same constraint, the nonlinear model leads here to worse SNR results compared with the linear model. This is due to the fact that the nonlinear model has significantly more parameters. But once we start to use some of the constraints that are only available for matrix models, such as the non-negative matrix decomposition constraint, we see improvements over the linear model.

The results for the Shepp–Logan phantom are summarized in figures 6 and 7, where we show box plots of the SNR values achieved with the different methods for 10 different random instances of the two problem settings.

Again, while the non-convexly constrained linear approaches do in general work as well as if not better than the approaches based on the nonlinear cost using the same constraints, using more advanced constraints such as the non-negative matrix factorization can preform much better.

We found that we consistently got the best results either with a combination of wavelet sparsity in the spatial domain combined with an additional projection onto a non-negative matrix factorization or with the non-negative matrix factorization constraint alone. Looking at these results in the light of the fact that we are unable to calculate exact projections onto the union of these two constraints and the fact that our algorithm is primarily designed for subspace models (which the non-negative factorization is not) shows that the two constraints are capturing important complementary properties of the data. This demonstrates that there is a need to study these joint constraints in more detail in the future.

The only linear cost function-based result that is not matched by the nonlinear methods is the linear approach with the TV constraint. This suggests that the TV constraint is much more powerful for our phantoms than the related wavelet constraints we use here. This suggests the use of similar constraints also with our nonlinear model, which can be easily done by adding a TV regularization term to the nonlinear cost function. Combining this with the non-negative matrix factorization constraint is likely to offer additional benefits, though this is still under investigation.

### (b) Real data

To demonstrate the applicability of the technique to real data, we applied it to two real datasets, comparing standard FBP to the best-performing approach in the previous experiment.

The first dataset was acquired at 360 kV. We used 32 projections and a line array detector. The scanned object was a model of an engine block made of aluminium. The longest dimension of the object was about 15 cm. Reconstruction of a two-dimensional cross section using FBPs is shown in figure 8. The reconstruction with a nonlinear model and a sparsity and non-negative matrix factorization constraint is shown in figure 9. Due to the extremely small number of projections, severe artefacts are visible in the FBP. Using our nonlinear model and a wavelet sparsity constraint, the streak artefacts have been removed but the wavelet constraint has introduced some smoothing.

The second set of X-ray projections was acquired with 200 kV. The test object consisted of two concentric tubes. The outer tube was acrylic and the inner tube aluminium. Within the inner tube, aluminium wires of different diameters were attached. We used 200 projections acquired with a flat panel detector, taking only the central slice for reconstruction. The FBP reconstruction is shown figure 10*a*, whereas our nonlinear reconstruction with a positivity constraint is shown in 10*b*. The red oval highlights beam-hardening artefacts visible in the FBP, which are not visible in the reconstruction achieved with our approach.

## 6. Conclusion

In this paper, we have proposed a projected conjugate gradient algorithm that is able to solve nonlinear inverse problems under a range of non-convex constraints. We have applied the method to the inversion of the nonlinear X-ray transmission model to solve X-ray computed tomography reconstruction. This not only allowed us to address the beam-hardening problem but also allowed us to reconstruct images from few projections. Extensive simulation results on synthetic data have shown that advanced non-convex constraints, such as a combination of wavelet sparsity and non-negative matrix factorization, can have significant benefits over standard positivity constraints alone. While these non-convex constraints do not guarantee globally optimal solutions, we could show that when we initialized our method with a good linear reconstruction, then the non-convexly constraint nonlinear reconstructions were often better. However, the improvement also comes at the cost of increased computational complexity. For real data, inverting the nonlinear model often took hundreds of iterations to achieve the best performance.

The fact that TV regularization outperforms even our best nonlinear model-based estimate shows the power to the TV constraint for objects with uniform material density as we simulated here. It is likely that the inclusion of such a TV regularization term in our nonlinear model will offer similar benefits, especially when combined with the non-negative matrix factorization constraint. Using joint constraints has also demonstrated clear benefits. While we enforced these here with consecutive projections, recent studies (not reported here) have shown that an approach that averages the projections onto the individual constraints tends to perform better than the approach used here, though a detailed study of how this works in the tomographic setting is still to be undertaken.

The main drawback of our approach currently is the slow convergence due to the complex nature of the non-convex cost function. This was here addressed with a sensible initialization approach, though additional benefits are likely if better optimization strategies are adopted. One possible improvement might be the use of a better line search strategy. The quadratic approximation does not always seem to give a good step size and other approaches are currently under investigation.

## Funding statement

T.B. acknowledges support from EPSRC grant nos. EP/K037102/1, EP/J005444/1, the CCPi (funded through EPSRC grant no. EP/J010456/1) and the TSB (grant no. 101804) as well as a University of Southampton, Faculty of Engineering and the Environment ‘New Frontiers Fellowship’.

## Footnotes

One contribution of 11 to a theme issue ‘X-ray tomographic reconstruction for materials science’.

↵1 In practice,

*I*_{0}(**r**,*E*) is a function or**r**; however, the detector is normally calibrated so that photon counts are scaled to compensate for this non-uniformity in the source.

- Accepted March 17, 2015.

- © 2015 The Author(s) Published by the Royal Society. All rights reserved.