Two-dimensional semi-parametric alignment of chromatograms.

Journal of Chromatography A, 1345 (2014) 193–199

Contents lists available at ScienceDirect

Journal of Chromatography A journal homepage: www.elsevier.com/locate/chroma

Two-dimensional semi-parametric alignment of chromatograms Wim P.H. de Boer ∗ , Jan Lankelma Dept. Molecular Cell Physiology, VU University, Amsterdam

a r t i c l e

i n f o

Article history: Received 7 December 2013 Received in revised form 7 April 2014 Accepted 10 April 2014 Available online 18 April 2014 Keywords: Comprehensive Two-dimensional Alignment Warp function

a b s t r a c t We present a comprehensive alignment algorithm that extends the semi-parametric approach to two dimensions. The algorithm is based on modeling shifts with a two-dimensional “warp function” such that the sample chromatogram – its shifts corrected with the warp function – is adjusted to the reference chromatogram by minimizing the squared intensity difference. A warp function approach has the advantage that overlapping peaks are easily dealt with compared to other proposed two-dimensional algorithms. Another advantage is that missing peaks are allowed if the absence of these peaks has little numerical effect on the warp function computation and if these peaks occur between existing peaks. Performance of the algorithm is demonstrated using GC × GC data from three batches of three diesel oil samples and LC-MS data from a mouse breast cancer data set. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Various retention time alignment procedures for GC × GC have been proposed in the past that use a Correlation Optimized Warping (COW) type of algorithm in which the user has to choose some partition of the two-dimensional chromatogram domain [1–3]. Such type of algorithms has recently been assessed [4] and these authors have observed that such algorithms may distort peaks and change peak volume. They present a new, supervised, three-step algorithm, in which the user first selects a number of well-placed peaks to be aligned first. In the second step a distinction is made between first dimension displacements and second dimension displacements. First dimension displacements are linearly interpolated between the selected peaks of step 1. Second dimension displacements are calculated using a Voronoi decomposition of the chromatogram domain generated by the selected peaks of step 1 and the peak under consideration. The second dimension shift of that peak is then calculated as a weighted average of the second dimension shifts of nearby selected peaks. In the third step each signal value is re-interpolated on the basis of nearby signal values and is corrected for stretching and compression. The overall numeric-mathematical behavior of the algorithm is difficult to assess and the assumption of linear interpolation in the first dimension seems to be at variance with our experience that the warp function is often highly non-linear in the first dimension; see Section 4 below.

∗ Corresponding author. Jozef Israëlslaan 333, 2282 TJ Rijswijk (ZH), the Netherlands; Tel.: +31 70 3996929; fax: +31 20 5987229. E-mail address: [email protected] (W.P.H. de Boer). http://dx.doi.org/10.1016/j.chroma.2014.04.034 0021-9673/© 2014 Elsevier B.V. All rights reserved.

Other approaches typically use peak lists obtained after applying a peak-picking procedure onto the raw chromatogram data, and rely on well-defined, distinct peaks being present in both chromatograms [5–7]. Such algorithms have the benefit of being fast because the amount of data to be processed is greatly reduced by aligning only the peaks on the list. Overlapping peaks require special care and are problematic for closely eluted peaks [5] or are managed by approximating peaks as 2-D Gaussian [6]. Overlap is not an issue in our warp function approach. The multi-way decomposition method PARAFAC is a generalization of principle component analysis (PCA) [8,9]. It has the disadvantage that it requires the user to decide on the number of principal components, which is an essential decision for the method to work. Moreover it requires the elution profiles of these components to meet the condition that is the matrix holding the elution profiles in its columns preserves its “inner-product” structure from sample to sample. This condition is for instance satisfied if the elution profiles of all components are shifted by the same amount and if there is no peak broadening and the elution profile only exceeds the elution baseline on the inner chromatogram domain. The PARAFAC method is a linear approach to an essentially non-linear problem, and its performance degrades when different retention time shifts are present. In a “semi-parametric” approach proposed by Eilers [10,11] the retention time shifts along a time-axis are corrected with a “warp function” such that the sample chromatogram is adjusted, using the warp function, to correspond as much as possible with the reference chromatogram by minimizing the squared intensity difference. In other words, the focus is on modeling the warped time-axis. In this paper we extend the semi-parametric approach to

194

W.P.H. de Boer, J. Lankelma / J. Chromatogr. A 1345 (2014) 193–199

the simultaneous alignment of both retention time axes by using a two-dimensional warp function. We call our approach Curfit2D and we aim to have a comprehensive alignment method that is robust against missing peaks and concentration variation of compounds, and addresses the issues raised above. Previously the notion of a warp function has been extended to two dimensions in the context of image analysis [12]. In this approach the warp function is estimated to maximize a penalized likelihood functional, where the likelihood measures the similarity between reference and sample chromatogram images after warping and the penalty is a measure of distortion of the warp function. A conjugate gradient algorithm is used to estimate the warp function, which requires strategies to guard against becoming trapped in local sub-optima. It remains to be seen what this statistical approach has to offer in the context of GC × GC chromatograms where the warp function is generally not a twodimensional quadratic function. In our approach the warp function is by definition a continuous, once differentiable, function defined on a two-dimensional domain. Numerically this implies the need for continuity in the chromatogram profiles to be aligned. Step-like changes in the profile such as spikes therefore require some smoothing of the data. It is worth noting that the warp function varies much slower in time than the chromatogram profile such that the sampling density enough to construct the profile is certainly dense enough to construct the warp function according to the Nyquist-Shannon Sampling Theorem [13]. How similar the profiles of reference and sample chromatograms must be for the construction of the warp function is largely a matter of peak distribution across the chromatogram domain. Peaks act as weights in the computation of the warp function and the larger a peak is the more weight it contributes. Dominant peaks influence the computation most, and few of them, well spread-out, are sufficient to determine a global warp function. The remaining peaks, with a suitable number of splines chosen, contribute to a locally improved warp function. Missing peaks can be tolerated if their absence has little numerical effect on the warp function computation and if they occur between existing peaks. 2. Theory of algorithm Curfit2D A GC × GC chromatogram is a data matrix of intensity values measured against retention time x and retention time y. Given a reference chromatogram f(x, y) and a sample chromatogram g(x, y) on two-dimensional space (x, y), the warp function w(x, y) is defined as a continuous, once differentiable, function to minimize the measure of nearness defined by:

2

[f (x, y) − g(w(x, y))] d(x)d(y)

(1)

over a surface interval covering the space where analytes are present. The warp function is taken to be the Cartesian product of a univariant function u(x) on the x-axis and a univariant function v(y) on the y-axis: w(x, y) = (u × v)(x, y) = (u(x), v(y)) with function u(x) written as a series of cubic splines B3 ing to: u(x) − x =

[ak B3 k(x)],

with k = 1, ..., K,

(2) k (x) accord-

(3)

and function v(y) written as a series of cubic splines B3 l (y) according to:

v(y) − y =

[bl B3 l(y)],

with l = 1, ..., L.

(4)

Thus the warp function is a tensor product spline, expressed in terms of spline coefficients ak and bl . These coefficients are computed iteratively as follows: decompose each spline coefficient ak , bl into a known part ak , bl and an unknown part ak , bl , and thereby decompose the warp function into a known part w(x, y) and an unknown part w(x, y). Then solve the coupled set of quadratic programming problems [14]: Minimize (zaT x + ½xT Ha x) over x = (a1 , ..., aK )

(5)

and Minimize (zbT y + ½yT Ha y) over y = (b1 , ..., bL )

(6)

iteratively, with vectors za and zb and positive-definite matrices Ha and Hb in terms of known quantities: reference chromatogram f(x, y) and sample chromatogram g(w(x, y)) and its first derivatives. Add the linear constraints [ak − ak+1 ] ≤ ha ,

k = 1, ..., K − 1

(7)

and [bl − bl+1 ] ≤ hb ,

l = 1, ..., L − 1

(8)

to implement the condition of a monotonic non-decreasing warp function [15]; here ha and hb are the fixed sampling intervals of the knot sequences of the cubic splines. All spline coefficients are initially set to zero. The algorithm ends when the Root Mean Square (RMS) error, which is the square root of the measure of nearness normalized to the surface interval, has converged to a minimum relative error determined with an accuracy of at least four significant numerical digits. The number four of numerical digits is a compromise between the desired numerical accuracy of the alignment result and the associated number of iteration loops required. To summarize, the warp function is two-dimensional, being the direct product of two univariate warp functions, one for each of the two retention time axes. In each iteration step, both univariate warp functions are calculated in dependence of each other, and the final warp function is then the direct product of the two univariate warp functions after convergence of the iteration procedure. The quadratic programming problem is known to have a unique numerical solution. Therefore each step in the iteration scheme above is numerically stable, but whether the scheme has a well-convergent solution is not necessarily guaranteed. In fact, sometimes the iteration seems to oscillate between two limits, especially when the (chosen) number of splines is too high. The oscillation appears to be an intrinsic feature of the Curfit2D algorithm, because frequently the spline coefficients show a saw-tooth type of oscillation towards their limiting value. More study is required to understand the cause of this oscillation. For the time being, the user should try to achieve monotone convergence by running program Curfit2D with a different number of splines or in symmetrical mode (Appendix A, section 9). In case of a wellconvergent solution, however, the solution is likely to be correct for the chosen number of splines. The two numbers of splines are the only input parameters of the algorithm and are dependent upon the variability of the warp function that the user expects. Generally it is best to start with a low number of splines, such as four, for both retention-time axes and decide by inspection of the achieved RMS error and Matrix Correlation (MC) coefficient [16] on a higher number of splines. Too high a number manifests itself in the algorithm being slowly convergent, or not converging at all, and executing the (set) maximum number of 50 loops. The alignment result may, however, in this case still be useful. The best choice of spline number is in principle the one for which the RMS error is lowest and the MC coefficient is the highest. The Curfit2D run time depends, in order of importance, on the size of the GC × GC intensity matrix, on the number of iteration


Fig. 1. Flowchart of algorithm Curfit2D, showing the main steps of the alignment of the sample chromatogram to a reference chromatogram. The warp function is computed iteratively until the root mean square error between the two chromatograms converges to a relative error of four significant numerical digits.

loops needed to reach convergence, and on the number of splines chosen. The main steps in the algorithm are shown in a flowchart (Fig. 1). Please refer to Appendix A, for more technical detail and a full description of the two-dimensional alignment algorithm Curfit2D. 3. Materials and methods The Curfit2D algorithm has been implemented in a computer program with a graphical user interface under the name Curfit2D. It is a so-called multi-tab dialog box that runs in 32 bit Windows XP and 64 bit Windows 7 environment. It produces a number of output text files including the aligned intensity matrix of the sample chromatogram and a logbook type of summary of the alignment computations. It also offers the option to accelerate the iterative alignment computations by reducing the number of data points with the Component Detection Algorithm (CODA) algorithm [17], thereby also minimizing he contribution of noise. See Appendix B for further information. First the reference chromatogram is loaded, and then the sample chromatogram but limited to shared retention-time intervals with the reference chromatogram. Spurious signals due to column bleeding, for instance, could bias the alignment computation and should be avoided by a judicious choice of the reference chromatogram domain. Alternatively such signals may be given zero weight in the alignment computation by executing the weight option (Appendix A, section 5). Note that in principle this weight option can also be used to give selected peaks extra weight in the computation of the warp function. Program Curfit2D is fitted with two pre-processing algorithms: for baseline-correction and data-smoothing, respectively. Both algorithms are based on a moving window of minimal size, 5 × 5 in practice. The window sizes are specified in terms of data measuring points and must be odd-numbered in order to preserve the symmetry of peaks. The global quality of the alignment computation can be assessed from the MC coefficient [16] which measures the correlation

195

between the intensity matrices of the reference and the sample chromatogram. The global quality can also be assessed from the global shift between the reference and sample chromatogram still present after alignment. The global shift, which should be zero, is computed by cross-correlating the Total Ion Chromatogram (TIC) traces of these intensity matrices along both x-axis and yaxis (Appendix A, section 8). Together the MC coefficient and the remaining global shift give an impression of the quality of the alignment while running program Curfit2D. The final quality test is statistical analysis and visual inspection of the alignment results. Program Curfit2D allows the saving of the aligned sample chromatogram, the warp function, and the TIC traces of the intensity matrices as well as the heat map of their difference. All output forms can be visualized by using freeware visualization programs, such as Array Viewer [18] or Paraview [19]. Note that the TIC traces and the heat maps are displayed normalized with the sum of all ion intensity values of the sample chromatogram set equal to the sum of all ion intensity values of the reference chromatogram. To demonstrate the performance of the algorithm, we have selected two different data sets. The first data set comprises GC × GC data from three batches of three diesel oil samples, each batch from a different oil company. The second data set comprises LC-MS data taken from the Mouse Proteomic Technology Initiative (NCI, 2004). The following three factors are studied: two groups of mice (healthy and with breast cancer), one method of serum depletion (using a murine MARS column from Agilent, Palo Alto, USA, to remove three high-abundance proteins: albumin, IgG and transferrin), and two different laboratories. The diesel oil data files are in CSV format and hold a data matrix of 1600 rows of retention times ranging from 0 to 7.995 s and 1212 columns of retention times ranging from 8 to 9696 s. The data matrices have been pairwise aligned, to a total of 72 alignments, with 4 × 4 up to 8 × 4 splines. The remaining nine pairings, of a sample with itself, requires only a single iteration of the Curfit2D algorithm and yields the obvious 100% alignment. All alignments involved baseline correction and data smoothing, with a running window of 5 × 5 retention time measuring points. In addition it proved to be necessary to normalize the columns of the data matrix to reduce the dominant influence of a large (106 ) spiked peak over the other medium (103 ) peaks. This normalization serves the convergence of the Curfit2D algorithm and is reversed after computation of the warp function. The mouse breast cancer data set contains trypsin digested peptides, is highly complex, and shows high compound concentration variation. Shown here are the alignment results for four MARS serum depleted samples, two of them from healthy mice and two of them from mice with a tumor. The data files are in mzXML 2.1 format and have for this exercise for laboratory 1 (laboratory 2) been reduced to a data matrix of 274 (299) rows of retention times ranging from 1200 to 1500 s and the full unprocessed 26182 to 26396 (20193 to 20285) columns of m/z values ranging from 400 to 700 amu. An LC chromatogram may shift in time, but a mass spectrum does not; in fact, a mass spectrum should normally not require much alignment. So it was decided to run the Curfit2D algorithm in the one-dimensional mode for all 16 alignments, with 4 × 0 up to 8 × 0 splines, thus correcting only shifts along the LC-axis. In all alignments the CODA algorithm with an MCQ index value of 0.8 was used to reduce the number of MS data points by a factor of around five. Given the small number of LC data points no coarse sampling was applied on them. The benefit of running Curfit2D in one-dimensional mode over one-dimensional alignment algorithms is that the twodimensional chromatograms are kept in tact by aligning rows of the sample intensity matrix alongside the alignment of the x-axis. The distinction between different peaks is thereby maintained and the

196


Table 1 Matrix Correlation coefficients for 81 diesel oil sample alignments, after baseline correction.

1a 2a 3a 1b 2b 3b 1c 2c 3c *

1a

2a

3a

1b

2b

3b

1c

2c

3c

1 0.98 0.99 0.99 0.95 0.97 0.98 0.94 0.99

0.98 1 1.00 0.99 0.99 1.00* 1.00* 0.99 0.99*

0.99 1.00 1 1.00* 0.98 1.00 1.00* 0.98 1.00*

0.99 0.99 1.00 1 0.98 1.00 1.00 0.97 1.00*

0.95 0.99 0.98 0.98 1 0.99 0.99 1.00* 0.98

0.97 1.00* 1.00 1.00 0.99 1 1.00* 0.99 0.99*

0.98 1.00* 1.00* 1.00 0.99 1.00* 1 0.99 0.99

0.94 0.99 0.98 0.97 1.00* 0.99 0.99 1 0.97

0.99 0.99* 1.00* 1.00* 0.98 0.99* 0.99 0.97 1

No alignment required after baseline correction.

Table 2 Matrix Correlation coefficients for 81 diesel oil sample alignments, after baseline correction, data smoothing and alignment.

1a 2a 3a 1b 2b 3b 1c 2c 3c a

1a

2a

3a

1b

2b

3b

1c

2c

3c

1 0.99 0.99 1.00 0.96 0.99 0.99 0.97 1.00

1.00 1 1.00 1.00 1.00 1.00a 1.00a 0.99 0.99a

0.99 1.00 1 1.00a 1.00 1.00 1.00a 0.99 1.00a

1.00 1.00 1.00 1 1.00 1.00 1.00 0.99 1.00a

0.96 1.00 1.00 1.00 1 1.00 1.00 1.00a 0.99

0.99 1.00a 1.00 1.00 1.00 1 1.00a 0.99 0.99a

1.00 1.00a 1.00a 1.00 1.00 1.00a 1 1.00 1.00

0.98 0.99 0.99 0.99 1.00a 0.99 1.00 1 1.00

1.00 0.99a 1.00a 1.00a 0.99 0.99a 0.99 0.98 1

No alignment required after baseline correction.

discrete distribution of MS values retained. This option could easily be implemented into the computer code of algorithm Curfit2D (Appendix A, section 6) but should be regarded as a bonus: the real objective of program Curfit2D is full two-dimensional alignment. All computations were done on a 64-bit Z400 Workstation under Windows 7. 4. Results The diesel oil alignments are all of high quality, and virtually insensitive to which of the two chromatograms serves as reference chromatogram. The matrix correlations are presented in Table 1 after baseline correction and in Table 2 after subsequent processing: data smoothing and alignment. The matrix correlations are not very discriminative for the current set of diesel oil chromatograms. In fact, the matrix correlation does not change under alignment for 16 pairs, nor does the global shift for 7 of these pairs. The reason for this is that the current diesel oil chromatograms are very sparse in significant peaks. The major alignment correction proved to be the baseline correction: 19 pairs did not require any further alignment. The MC coefficients in Table 2 vary between 0.96 and 1 with a median value of 1, the relative RMS errors between 0.27 and 1 with a median value of 0.71, and the Curfit2D iteration cycles are between 3 and 14 with a median value of 8. Ten pairs were aligned with the “symmetry” option selected. The Curfit2D algorithm output is not necessarily symmetrical for pair-switching, but the algorithm has been fitted with an explicit symmetrical implementation option (see Appendix A, section 9). This basically doubles the algorithm’s execution time, but can lead to better and faster convergence. Five diesel oil pairs showed oscillatory behavior and were aligned with the “break” option selected (see Appendix B). This means that the alignment is interrupted when the RMS error has been determined with an accuracy of at least three significant numerical digits. Fig. 2 shows an Extracted Ion Chromatogram (EIC) of diesel oil pair 1a–2a both before any processing and after all processing: baseline correction, data-smoothing, and Curfit2D alignment. The data-smoothing applied here lowers the height of a peak and

Fig. 2. EIC traces for diesel oil pair 1a (black) and 2a (grey) along part of the 2nd axis at 0.1 s Retention Time 1, before any processing (above) and after baselinecorrection, smoothing and alignment (below). Note the absence of some peaks in diesel oil 2a.

broadens its base, but does not change its location and hardly its volume. Any change in location is the result of the Curfit2D alignment itself. Tests performed on single-peak model chromatograms support this observation; see also Appendix A, section 10. The baseline correction and data-smoothing is applied to both reference and sample chromatograms with equal-sized moving windows, in order to minimize the effect of these pre-processing steps on the computation of the warp function. The computed warp function for the current pair of chromatograms is shown in Fig. 3. The quality of the alignment can be judged from the heat map of reference and sample chromatogram. Here heat maps are shown for pair 1c–3c, one of the seven pairs mentioned above, for which neither the matrix correlation nor the global shift change under alignment. The heat map in Fig. 4 is shown after baseline correction, and the data have also been smoothed in order to compare with the aligned result in Fig. 5. Both heat maps have been produced using visualization program Array Viewer [18]. Because the heat map shows the difference of the chromatograms, the reference chromatogram has color red (positive values) and the sample chromatogram color blue (negative values). The color green signals values around zero. Observe how the elongated peaks before alignment become short after alignment, demonstrating good alignment. The fact that there is still some peak left is because the peak in the reference chromatogram differs in shape from the peak in the sample chromatogram. The heat maps shown here are part of the total heat maps and cover 400 rows of retention times ranging from 2.250 to 4.245 s and 750 columns of retention times ranging from 808 to 6800 s. How Curfit2D alignment changes peak volume has to do with the shape of the warp function. The warp function describes the local stretching and compression of the chromatogram domain needed to align the peaks of the sample chromatogram with those of the reference chromatogram. The discrete intensity sampling points of a peak are therefore interpolated over a larger


197

Fig. 5. Part of the Heat map for diesel oil pair 1c–3c after baseline correction, data smoothing and alignment. The red color represents the reference chromatogram and the blue color the sample chromatogram. The green color signals intensity values around zero.

Fig. 3. Warp function for diesel oil pair 1a–2a along 1st axis (above) and along 2nd axis (below).

Fig. 4. Part of the Heat map for diesel oil pair 1c–3c after baseline correction and data smoothing. The red color represents the reference chromatogram and the blue color the sample chromatogram. The green color signals intensity values around zero.

or smaller peak area and when considered to be the sampling points of a continuous peak, then peak volume is clearly not conserved under warp function alignment. When alternatively the peak’s signal strength is calculated by adding up all intensity values of the peak, and the sampling density of each of the two retention time axes is constant, one has a relative measure for comparing peak volumes across the chromatogram. An option to allow such a comparison has yet to be implemented into the Curfit2D algorithm. Once peaks have been identified after alignment, however, peak volumes are best determined from the unaligned chromatogram. The alignment outcome for the breast cancer mouse serum data is summarized in Table 3 showing the values of the MC coefficients for the alignment of the 16 pairings of the mouse samples obtained with the MARS depletion method. In addition the alignments of these sample pairs have relative RMS errors varying between 0.83 and 1.00 with a median value of 0.99 and the Curfit2D iteration cycles are between 1 and 21 with a median value of 2. All mice alignments were executed with the “symmetry” option selected. In addition all mice alignments were executed with both reference and sample intensity data smoothed with a 5 × 5 moving window to cope with the essentially discontinuous MS data. The smoothing of the MS data serves the iterative computation of the warp function while the warp function along the MS-axis is not updated. 5. Discussion The MC coefficient provides an independent criterion to judge the global quality of an alignment. However, the Curfit2D algorithm calculates it, like the relative RMS error, on the full intensity

Table 3 Matrix Correlation coefficients for 16 MARS sample alignments after alignment. MC

MARS1Normal

MARS1Tumor

MARS2Normal

MARS2Tumor

MARS1Normal MARS1Tumor MARS2Normal MARS2Tumor

1 0.65 0.04 0.03

0.64 1 0.04 0.03

0.04 0.04 1 0.5

0.02 0.02 0.53 1

Table 4 MS global shift for 16 MARS sample alignments, in number of MS data points. MS Shift (amu)

MARS1Normal

MARS1Tumor

MARS2Normal

MARS2Tumor

MARS1Normal MARS1Tumor MARS2Normal MARS2Tumor

– 0 −13276 −10353

0 – −13276 −10353

17256 17355 – 0

13404 13481 0 –

198


Fig. 6. EIC traces for the MARS serum depleted healthy (black) and tumor (grey) sample pair from laboratory 1 at m/z 587.55 amu before (above) and after (below) alignment. The composite structure of the tumor peak is evident after alignment.

matrices of the reference and sample chromatograms, so that it also includes all ‘noise’ traces. Therefore the MC coefficient will approach but not reach the value of 1.0 under normal circumstances. The diesel oil data are not noisy after the mild smoothing that has been applied. The cross-correlation of the TIC traces of the intensity matrices provides additional information on the global quality of the alignment. A good alignment should have zero global shifts along both axes. Such is the case for the intra-laboratory but not for the interlaboratory mouse sample pairs in Table 3. Indeed the latter samples differ by a large global shift in MS values: the shifts are about half of the total number of MS data points (Table 4). The alignments performed for the intra-laboratory sample pairs provide acceptable MC coefficient values and well aligned TIC traces and total mass spectra. The alignment for the MARS depleted normal and tumor sample pair from laboratory 1 (MARS1Normal-MARS1Tumor) for example yields a reasonable MC coefficient value of 0.64 and TIC traces and total mass spectra with correlation coefficient values of 0.70 and 0.99, respectively. The quality of the alignment is illustrated in Fig. 6 showing an Extracted Ion Chromatogram at 587.55 amu of this sample pair both before and after alignment. The suspected composite structure of the main Mars1Tumor peak becomes evident after alignment. The alignments for the inter-laboratory sample pairs have low MC coefficient values and involve considerable shift corrections primarily along the m/z axis. The MARS1Normal-MARS2Normal sample pair, for example, has a low MC coefficient value of 0.04 and TIC traces and total mass spectra with correlation values of 0.47 and 0.51, respectively. The global shift in retention time is actually equal to zero. The low MC coefficient value therefore has all to do with the mass measurement. Fig. 7 shows how dissimilar the profiles of the two chromatograms are.

Fig. 7. EIC traces for the MARS serum depleted healthy (black) and healthy (grey) sample pair from laboratory 1 and laboratory 2, respectively, at m/z 587.55 amu (above) and at m/z 658.30 amu (below) before any processing. Note how different the EIC traces from the two laboratories are.

The minor asymmetry in MC coefficient values when aligning the reference and sample chromatograms interchanged (cf. Tables 2 and 3) has partly to do with the order in which the chromatograms are loaded into Curfit2D: first the reference chromatogram, and then that part of the sample chromatogram that overlaps with the reference chromatogram as judged from overlapping entries in or intervals of the LC and MS recorded measuring points. The asymmetry is also partly due to the number of iteration loops executed by the Curfit2D algorithm and the round-off error in the associated convergence criterion. 6. Conclusion We have presented a new algorithm, Curfit2D, for the alignment of GC × GC data sets based on a two-dimensional formulation of the semi-parametric warp function. We have demonstrated the performance of the algorithm on a GC × GC diesel oil data set, and obtained excellent alignment results after applying baseline correction. We have also demonstrated the performance of Curfit2D on a complex mouse serum breast cancer data set from two different laboratories produced with similar type of LC-MS measuring equipment. While the intra-laboratory chromatograms have been properly aligned as indicated by their TIC traces and total mass spectra, the alignment of the inter-laboratory chromatograms is complicated by the obvious dissimilar profiles of the chromatograms. The comparison of LC-MS data files from different laboratories is an important challenge for the proteomics community similarly to the comparison of micro-array data files for the gene-expression community. More work is needed towards standardization of both measurement equipment and data analysis in both research areas.


Although it has not been demonstrated here, algorithm Curfit2D may also be useful for the alignment of LC × LC chromatograms. However, the loop sampling technique used in LC × LC experiments as opposed to the cryo-focusing technique used in GC × GC may result in an under-sampled first retention-time axis. The Curfit2D algorithm requires the reference and the sample chromatogram to share an intensity matrix of only three data rows and three data columns to generate a warp function. So the problem is not to generate a warp function, but to generate a warp function that is a fair representation of the retention time shifts between the two chromatograms. It is the intention to make the program freely available after acceptance of the publication, via the VU Molecular Cell Physiology website http://www.falw.vu/∼microb/mcf/research/tumcell.html. Conflict of Interest The authors declare that they have no conflict of interest. Acknowledgement The authors thank Paul H.C. Eilers (Erasmus MC) for his advice and the freely sharing of his MatLab files that motivated the development of the current algorithm. The first author is also grateful to Peter L. Horvatovich and Rainer P.H.Bischoff (RUG) for his introduction into and the cooperation on LC-MS alignment. Peter J. Schoenmakers and Gabriel Vivó-Truyols (UVA) are acknowledged for supplying the diesel oil data files.

199

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.chroma.2014. 04.034. References [1] V.G. Mispelaar, A.C. Tas, A.K. Smilde, P.J. Schoenmakers, A.C. van Asten, J. Chromatogr. A 1019 (2003) 15. [2] K.M. Pierce, L.F. Wood, B.W. Wright, R.E. Synovec, Anal. Chem. 77 (2005) 7735. [3] D. Zhang, X. Huang, F.E. Regnier, M. Zhang, Anal. Chem. 80 (2008) 2664. [4] J. Gros, D. Nabi, P. Dimitriou-Christidis, R. Rutler, J.S. Arey, Anal. Chem. 84 (2012) 9033. [5] S. Peters, G. Vivo-Truyols, J. Marriott, P.J. Schoenmakers, J. Chromatogr. A 1156 (2007) 14. [6] F. Suits, J. Lepre, D. Peicheng, R. Bischoff, P. Horvatovich, Anal. Chem. 80 (2008) 3095. [7] S. Kim, I. Koo, A. Fang, X. Zhang, BMC Bioinformatics 12 (2011) 235. [8] H.A.L. Kiers, J.M.F. ten Berge, R. Bro, J. Chemometr. 13 (1999) 275. [9] R. Bro, C.A. Andersson, H.A.L. Kiers, J. Chemometr. 13 (1999) 295. [10] P.H.C. Eilers, Anal. Chem. 76 (2004) 404. [11] A.M. van Nederkassel, M. Daszykowski, P.H.C. Eilers, Y.J. vander Heyden, J. Chromatogr. A 1118 (2006) 199. [12] C.A. Glasbey, K.V. Mardia, J. Roy. Stat. Soc. B 63 (2001) 465. [13] C.E. Shannon, Proc Institute of Radio Engineers 37 (1949) 10. Reprinted as Classic Paper in Proc. IEEE 86 (1998) 447-457. [14] M.J.D. Powell, Math. Program. Stud. 25 (1985) 46. [15] V. Pravdova, B. Walczak, D.L. Massart, Anal. Chim. Acta 456 (2002) 77. [16] J.O. Ramsey, et al., Psychometrika 49 (1984) 403. [17] W. Windig, W.F. Smith, J. Chromatogr. A 1158 (2007) 251. [18] Array Viewer: http://www.intel.com/cd/software/products/asmo-na/eng/ compilers/226302.htm [19] ParaView. Parallel Visualization Application, Sandia Nat.Lab., http://www. paraview.org.

Pairwise alignment of chromatograms using an extended Fisher-Rao metric.

Semiparametric adjusted exposure-response curves.

Robust analysis of semiparametric renewal process models.

Bioautographic detection of mycotoxins on thin-layer chromatograms.

Semiparametric transformation models for semicompeting survival data.

Semiparametric location estimation under non-random sampling.

Regularized Semiparametric Estimation for Ordinary Differential Equations.

Inferring short-range linkage information from sequencing chromatograms.

Semiparametric M-quantile regression for count data.

Detection of haloacetyl and haloketone affinity labels on paper chromatograms.

Immunochemical detection of glycosphingolipids on thin-layer chromatograms.

Comparison of semiparametric receiver operating characteristic models on observer data.

SEMIPARAMETRIC ESTIMATION OF CONDITIONAL HETEROSCEDASTICITY VIA SINGLE-INDEX MODELING.

Semiparametric modeling of grouped current duration data with preferential reporting.

Semiparametric analysis of linear transformation models with covariate measurement errors.

High Dimensional Semiparametric Scale-Invariant Principal Component Analysis.

Bayesian semiparametric copula estimation with application to psychiatric genetics.

A Mass Balance-Based Semiparametric Approach to Evaluate Neonatal Erythropoiesis.

Causal estimation using semiparametric transformation models under prevalent sampling.

Semiparametric Bayes hierarchical models with mean and variance constraints.

Semiparametric methods for multistate survival models in randomised trials.

Semiparametric models for malaria rapid diagnosis test result.

Variable selection in strong hierarchical semiparametric models for longitudinal data.

Semiparametric Relative-risk Regression for Infectious Disease Transmission Data.