Recent advances in chemometric methods for plant metabolomics: A review Lunzhao Yi, Naiping Dong, Yonghuan Yun, Baichuan Deng, Shao Liu, Yi Zhang, Yizeng Liang PII: DOI: Reference:

S0734-9750(14)00183-9 doi: 10.1016/j.biotechadv.2014.11.008 JBA 6866

To appear in:

Biotechnology Advances

Please cite this article as: Yi Lunzhao, Dong Naiping, Yun Yonghuan, Deng Baichuan, Liu Shao, Zhang Yi, Liang Yizeng, Recent advances in chemometric methods for plant metabolomics: A review, Biotechnology Advances (2014), doi: 10.1016/j.biotechadv.2014.11.008

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Recent advances in chemometric methods for plant metabolomics: a review

T

Lunzhao Yi a*, Naiping Dongc, Yonghuan Yunb, Baichuan Denge, Shao Liud, Yi Zhangb, Yizeng Liangb a

MA NU

SC

RI P

Yunnan Food Safety Research Institute, Kunming University of Science and Technology, Kunming, 650500,China b College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, China c Department of Applied Biology and Chemical Technology, The Hong Kong Polytech nic University, Hong Kong, 999077, China d Xiangya hospital, Central South University, Changsha, 410008, China e Department of Chemistry, University of Bergen, Bergen, N-5007, Norway *Correspondence to: Lunzhao Yi, Yunnan Food safety research institute, Kunming University of Science and Technology, Kunming, 650500, China. Tel.: +86 871 65920302. E-mail address: [email protected].

ED

Abstract

This review focuses on the recent and potential advances of currently available

PT

chemometric methods in relation to data processing in plant metabolomics, especially

CE

for the data generated from the mass spectrometry (MS) techniques. Recently, plant metabolomics has been gradually regarded as a valuable and promising biotechnology

AC

rather than an ambitious advancement. We here outline some significant developments of plant metabolomics, especially, in the combination of modern chemical analysis techniques, dedicated statistical, chemometric data analysis strategies. The advanced skills in the preprocessing of raw data, identification of metabolites, variable selection and modeling are illustrated. We believe that the insights into these developments are helpful to narrow down the knowledge gap between the molecular organization and metabolism control of plants. We here also discuss

the

limitations

and

perspectives

in

extracting

information

from

high-throughput datasets.

Keywords: plant metabolomics; chemometrics; biomarker; identification of metabolites; data preprocessing; modeling 1

ACCEPTED MANUSCRIPT Contents

CE

PT

ED

MA NU

SC

RI P

T

1. Introduction ................................................................................................................................... 2 2. Critique and discussion ................................................................................................................. 5 2.1 Pre-processing of raw data .................................................................................................. 5 2.1.1 Noise filtering and baseline correction..................................................................... 6 2.1.2 Peak detection and deconvolution ............................................................................ 8 2.1.3 Alignment ............................................................................................................... 11 2.1.4 Normalization......................................................................................................... 14 2.2 Identification of metabolites ............................................................................................. 17 2.2.1 Standards for reporting metabolite identification results ....................................... 17 2.2.2 Metabolite identification using GC-MS ................................................................. 19 2.2.3 Metabolite identification using LC-MS ................................................................. 23 2.3 Variable selection .............................................................................................................. 31 2.3.1Variable ranking ...................................................................................................... 33 2.3.2 Variable subset selection ........................................................................................ 34 2.3.3 Variable selection considering the interaction effect among variables ................... 36 2.4 Modeling of the data ......................................................................................................... 39 2.4.1 Unsupervised methods ........................................................................................... 40 2.4.2 Supervised methods ............................................................................................... 42 2.4.3 Non-linear methods ................................................................................................ 44 2.4.4 Model tuning and model validation ....................................................................... 47 2.5. One eye on the future ....................................................................................................... 51 3. Conclusions ................................................................................................................................. 52

AC

1. Introduction

Metabolomics refers to the comprehensive and quantitative analysis of metabolites and tries to gather as many as metabolic information from a biological system (Goodacre et al. , 2004). It is intriguing to be a reproducible and efficient method which can directly reflect biological events. Taking advantages of this method, a number of applications of cell, human and plant systems have already been published or predicted (Bertrand et al. , 2014, Deborde et al. , 2011, Rasmussen et al. , 2012, Toya and Shimizu, 2013). Recently, plant metabolomics has been rapidly upgraded from a promising concept to a widespread and valuable biotechnology (Cusido et al. , 2014, Davey et al. , 2005, Hall, 2006). The information gained from plant 2

ACCEPTED MANUSCRIPT metabolomics reflects much more details of a biological endpoint than what obtained from transcriptomics or proteomics. So far, it has been employed to quality control of

T

crop plants ((Biais et al. , 2012, Osorio et al. , 2012), plant ecology (van Dam and

RI P

Meijden, 2011), the study of stress biology in plants (Genga et al. , 2011), natural product discovery (Kell, 2006), et al.. Advances in plant metabolomics have increased

SC

exponentially in recent years (see Figure 1). At the same time, two modern analytical platforms, namely nuclear magnetic resonance (NMR) and mass spectrometry (MS),

MA NU

have become the methods of choice for metabolic analysis and have been generating massive amounts of data to answer the biological questions in plant metabolomics (Allwood and Goodacre, 2010, Allwood et al. , 2012, Kim et al. , 2011, Kueger et al. ,

ED

2012).

Insert Figure 1

PT

The raw data from metabolomics are the eventual sources of information, then in the turn of final knowledge (Goodacre, 2005). To make the mess metabolic information to

CE

be the valuable knowledge requires considerable data analysis including data

AC

preprocessing, statistical analysis, and suitable data storage (Allwood et al. , 2008, Goodacre, Vaidyanathan, 2004). So far, the improvement in analytical technologies makes the metabolomics datasets become gradually larger and more intricate in their inner structures (Boccard and Rudaz, 2014). It makes the coverage of metabolomics more comprehensive but will, as a result, demand chemometrics more and more (van der Greef and Smilde, 2005). The bottle-necks of plant metabolomics do not only depend on sample preparation and analytical platforms but also, even more importantly, on data analysis. Major changes on dimensionality and complexity of datasets lead to a significant shift for the knowledge discovery. In order to take advantages of metabolomics data to the largest extent, chemometrics has become a 3

ACCEPTED MANUSCRIPT crucial and dedicated tool for extracting valuable data from mess information (Boccard and Rudaz, 2014, Wolfender et al. , 2013). Chemometrics has a complete

T

theory and methodology for every step of metabolomics research, including sampling,

RI P

experiment design, data pre-processing, metabolite identification, variable selection and modeling. Chemometrics perfectly matches the requirement of metabolomics

SC

research. The reality is that chemometrics is one of the cornerstones of plant metabolomics. On the other hand, the complexity of metabolomics also puts massive

MA NU

challenges on chemometrics to deal with such massive high-dimensional data (van der Greef and Smilde, 2005).

Nowadays, plentiful review papers and guide books on plant metabolomics have been published (BaniMustafa and Hardy, 2012, Hall, 2011, Kim and Verpoorte, 2010,

ED

Villas ‐ Bôas et al. , 2005), providing informative and valuable guidance for

PT

researchers. Insights into the metabolomics experimental skills, including sample preparation and metabolite analysis, have also been revealed this year (Ernst et al. ,

CE

2014). Here, we attend to the recent advances in chemometric methods for data

AC

analysis of plant metabolomics. This review gives a brief but broad overview of the developed methods as well as challenges remaining in the data analysis of plant metabolomics, specifically, generated by MS, and perspectives on this topic. Various aspects are discussed, including raw data pre-processing, metabolite identification, variable selection and modeling. The flowchart of data analysis of plant metabolomics is shown in Figure 2. Insert Figure 2

4

ACCEPTED MANUSCRIPT 2. Critique and discussion

T

2.1 Pre-processing of raw data

RI P

Analytical instrument does not provide clean and comparable lists of metabolites, and raw data must be processed to generate a practicable data matrix in a variety of ways,

SC

including noise filtering and baseline correction, peak detection and deconvolution,

MA NU

alignment, and normalization (Castillo et al. , 2011). Data preprocessing is a very important step in data analysis of metabolomics. The key step here is eliminating the variance and bias in the data analysis process to reduce the complexity and enhance metabolically significant signals (Smith et al. , 2006). The development of algorithms

ED

and tools for data preprocessing is a critical issue of bioinformatics and chemometrics researches. As a result, many algorithms have been developed and multiple open

PT

source programs are applied to process raw MS data acquired by liquid

CE

chromatography–mass

spectrometry

(LC-MS)

or

gas

chromatography–mass

spectrometry (GC-MS). Among these tools, XCMS (https://xcmsonline.scripps.edu/) et

al.

,

2008,

Smith,

Want,

2006),

MZmine

AC

(Benton

(http://sourceforge.net/projects/mzmine/) (Katajamaa et al. , 2006, Pluskal et al. , 2010), OpenMS (http://open-ms.sourceforge.net/) (Sturm et al. , 2008) and MetAlign (http://www.metalign.nl)

(De Vos et al. , 2007, Keurentjes et al. , 2006, Moco et al. ,

2006, Tikunov et al. , 2005) have attracted particular attentions due to their practicability and effectiveness. Most research community of metabolomics is working with them. In addition, new programs have been steadily developed to increase the quality and efficiency of data preprocessing, such as MetSign (Wei et al. , 2011), MSFACTs (Duran et al. , 2003), TagFinder (Luedemann et al. , 2008, Luedemann et al. , 2012), MET-IDEA(Lei et al. , 2012), MathDAMP (Baran et al. , 5

ACCEPTED MANUSCRIPT 2006), and MetaboliteDetector (Hiller et al. , 2009). It needs to be pointed out that most of these tools as well as others are open-source and can be downloaded from the

T

internet freely. Furthermore, it is convenient to exchange such algorithms and data

RI P

within the community. Generally, tools for raw data preprocessing contain four basic modules, namely, noise filtering and baseline correction, peak detection and

SC

deconvolution, alignment and normalization. Hereby, we will introduce different

MA NU

chemometric algorithms and strategies for these modules.

2.1.1 Noise filtering and baseline correction

Noise filtering is designed to separate components‟ signals from background originating from chemical matrix or instrumental interference, remove measurement

ED

noises or baseline errors (Katajamaa and Orešič, 2007). Conventionally, in baseline correction of one way data (chromatographic or mass direction), two ends of a signal

PT

peak are manually pointed out by analysts, then, piecewise linear approximation was

CE

applied to fit a curve to be the baseline (Zhang et al. , 2010). However, the process is manual and time-consuming. Its accuracy highly depends on the user‟s operating

AC

skills (Jirasek et al. , 2004). In order to solve this problem, a large amount of algorithms have been developed for better estimation of the baseline. In 1977, Pearson proposed a classic baseline correction estimation algorithm (Pearson, 1977). It works iteratively and inspects the points in a specific interval, taking their standard deviation into account. For this method, the selection of parameters is extremely important; any slight mistake will lead to unacceptable large deviations. To overcome this weakness, numerous modifications have been developed to optimize the baseline correction method to make it faster, more robust and automatic. A large amount of algorithms were thus proposed, such as improved iterative polynomial fitting (Gan et al. , 2006), wavelet transform for de-noising (Shao et al. , 2003), low-order polynomial 6

ACCEPTED MANUSCRIPT smoothing filter based on Savitzky-Golay algorithm (Wang et al. , 2003), iterative asymmetric least-squares estimation (Eilers, 2004), and the elimination of background

T

spectrum (EBS) method (Boelens et al. , 2004). Most recently, two powerful

RI P

algorithms, selective iteratively reweighted quantile regression (sirOR) (Liu et al. , 2014a) and adaptive iteratively reweighted penalized least squares (airPLS) (Zhang,

SC

Chen, 2010), were developed by Liang‟s group. These algorithms can automatically and effectively remove baseline, regardless of whether it is linear or non-linear.

MA NU

Furthermore, they do not require the intervention experiences and prior knowledge, like peak detection, and run very fast and robust.

For MS-based datasets, the methods for removing random noises are typically

ED

implemented using traditional signal processing techniques in chemometrcs such as moving average window (Radulovic et al. , 2004) and median filter (Hastings et al. ,

PT

2002) in chromatographic direction, and Savitzky-Golay type of local polynomial fitting (Wang, Zhou, 2003) and wavelet transformation (Li et al. , 2005) in m/z

CE

direction. Noise filtering of LC-MS data is more complicated than that of GC-MS

AC

because chemical noises and random noises are both included. Chemical noise is typically induced by molecules in buffers and solvents and can be especially strong at the beginning and the end of the elution (Hilario et al. , 2006), while the random noise is mainly attributed to the detector. Chemical noise will cause a shift in the baseline in the intermediate mass range in LC-MS spectra. To resolve this problem, many filtering methods were proposed. For example, Haimi et al. fitted the baseline by first segmenting a spectrum and then performing a linear regression through the lowest points of smoothed spectrum segments (Haimi et al. , 2006). In addition, baseline removal method has also been approached by estimating background from a two-dimensional intensity image and then removing it with two orthogonal (retention 7

ACCEPTED MANUSCRIPT time and m/z) one-dimensional passes (Bellew et al. , 2006).

2.1.2 Peak detection and deconvolution

RI P

T

The purpose of peak detection and deconvolution is to identify and quantify the signals corresponding to the molecules (e.g. the metabolites) in a sample (Castillo,

SC

Gopalacharyulu, 2011). It is the fundamental step for the downstream data analysis, such as profile alignment and biomarker identification, and can reduce the complexity

MA NU

of the data (Katajamaa and Orešič, 2007). However, owing to the complexity of the signals and the multiple sources of noises in data, the automatic identification of the noise and compound signals is a very difficult issue. The threshold between noise and signal is hard to identify, especially in detecting peaks with low response values.

ED

A peak detection method can identify the true signals correctly and avoid the false positives. Here appears a big problem that the high response values do not always

PT

guarantee real peaks because some sources of noise can also produce high signals.

CE

Conversely, low peaks may correspond to real signals. Therefore, constraints on the peak shapes and criteria of minimal intensity, area or signal-to-noise are widely

AC

applied to distinguish real peaks from noises. Several parameters generally need to be adjusted to match the characteristics of the MS-based data. Traditionally, peak detection algorithms have followed two strategies, either by derivative techniques or by matched filter response (Felinger, 1998). For derivative-based peak detection methods, they make use of the fact that the first derivative of a peak will have a positive-to-negative zero-crossing at the local maxima of a peak (Vivó-Truyols et al. , 2005). Derivative-based methods commonly require increasingly elaborate pre-processing to prevent compounding noise effects (Krishnan et al. , 2012, Pierce and Mohler, 2012). A threshold on the slope is often imposed to avoid false positives. 8

ACCEPTED MANUSCRIPT Matched filtering is achieved by the application of a linear filter, which is designed to detect the presence of a particular pulse event with a known structure embedded in

T

additive noises (North, 1963). One may perform a threshold in the response function

RI P

to determine the location of chromatographic peaks when applied to chromatographic data, assuming a Gaussian peak shape (Danielsson et al. , 2002). Matched filter

SC

methods are becoming progressively sophisticated as data complexity increases. So far, some popular or open source software packages were derived, such as MapQuant

MA NU

(Leptos et al. , 2006) and XCMS (Smith, Want, 2006). XCMS includes three steps, binning, signal determination and filter. One weakness of initially proposed method in XCMS is that the peaks can sometimes be alternatively assigned to two adjacent m/z

ED

bins. One potential solution to the problem involves combining adjacent extracted ion chromatograms which represent the analyses of interest, but this algorithm cannot

PT

resolve pairs of co-eluting peaks that fall within half of m/z bin. Then, the developers of XCMS software added another algorithm to solve this problem, called centWave

CE

(Tautenhahn et al. , 2008). The centWave algorithm was performed by using

AC

continuous wavelet transform (CWT) to detect the chromatographic peaks with different widths and intensities. Every peak is checked by the maximum value of the centroid peak in the estimated peak boundaries. In addition, CWT was also applied to build a robust pattern matching method in MS peak detection. The CWT is directly applied to the raw spectrum. The information from the two-dimensional CWT coefficients matrix is utilized. By identifying peaks and assigning signal-to-noise ratio in the wavelet space, the pattern matching problem was simplified. The issues surrounding the baseline correction were removed simultaneously, and the preprocessing steps, such as noise filtering and baseline correction, are not required before peak detection (Du et al. , 2006). 9

ACCEPTED MANUSCRIPT Selecting an optimal threshold for the above mentioned two strategies is an of essential importance and difficult problem which has been thoroughly discussed in

T

various peak detection approaches (Hastings, Norton, 2002, Leptos, Sarracino, 2006,

RI P

Vivó-Truyols, Torres-Lapasió, 2005), but with no general consensus being reached. Recently, some algorithms were developed based on Bayesian inference (Lopatka et iv -Truyols, 2012). The algorithm makes use of chromatographic

SC

al. , 2014,

information (i.e. the expected width of a single peak and the standard deviation of

MA NU

baseline noise), which is regarded as prior information. Then, the probability of a signal being a peak is estimated, based on some theories or hypotheses, such as statistical overlap theory (Lopatka, Vivó-Truyols, 2014).

ED

In the high-throughput analysis of metabolites, overlapping peaks are ineluctable. This kind of problems can be resolved by mass spectral deconvolution or

PT

two-dimensional data resolution methods that have been well developed by chemometrics community using matrix computation combined with characteristics of

CE

MS data (Hantao et al. , 2012, Liang and Kvalheim, 2001, Ruckebusch and Blanchet,

AC

2013). So far, many multivariate resolution methods, such as heuristic evolving latent projections (HELP) (Kvalheim and Liang, 1992, Liang et al. , 1992), evolving factor analysis (EFA) (Maeder, 1987), subwindow factor analysis(Manne et al. , 1999) , alternative moving window factor analysis (AMWFA) (Zeng et al. , 2006) and evolving window orthogonal projection (EWOP)(Xu et al. , 1999) have been employed in different application fields. These methods provide strong ability to resolve overlapped mixture peaks, even embedded chromatographic peaks, into pure chromatograms and MS spectra of the components in mixture. With the help of these deconvolution methods, the coverage of metabolites will be enlarged in a single run with the present analytical instrument. Furthermore, the identification accuracy of 10

ACCEPTED MANUSCRIPT metabolites will be improved. For example, AMWFA has been successfully applied to resolve the overlapped peaks in GC-MS analysis of Pericarpium Citri Reticulatae

T

Viride (PCRV) and Pericarpium Citri Reticulatae (PCR) (Wang et al. , 2008), as

RI P

shown in Figure 3. More volatile components were identified in this study with the help of AMWFA. And, the identification accuracy was significantly increased for the

SC

overlapped peaks. In addition to these resolution methods, Automated Mass spectral Deconvolution and Identification System (AMDIS, NIST) and commercially

MA NU

available tools, Deconvolution Reporting Software (DRS, Agilent), AnalyzerPro (SpectralWorks) and ChromaTOF®(LECO) can also be used for this aim. Insert Figure 3

ED

2.1.3 Alignment

The shifts of retention time or m/z are inevitable in experimental analysis of

PT

metabolites. Experimental factors, including the temperature, column, pH, sample

CE

carryover or degradation, are able to lead to deviations that may affect the overall signal. The alignment of detected features in different samples aims at removing the

AC

shifts among samples for a given signal, which will guarantee the extraction of useful information by using chemometrics or informatics methods in the following steps. Peak shifts have strong impact on the multivariate statistical analysis, such as principle component analysis (PCA) and partial least squares-discriminant analysis (PLS-DA). Inappropriate peak alignments will result in totally illusions of classification and biomarker screening. So far, several alignment techniques have been developed to minimize the run-to-run shifts (Bylund et al. , 2002, Prince and Marcotte, 2006, Tomasi et al. , 2004). Chromatographic systems coupled with sophisticated detection instruments, e.g. LC-MS, have yielded large amounts of two-dimensional data in metabolites‟ analysis. 11

ACCEPTED MANUSCRIPT If we use the traditional peak alignment methods to process these data, the dimensionality needs to be reduced. It could be achieved by generating integrated

T

peak areas or total ion chromatograms (TICs). For one dimensional data (such as

RI P

TICs), some kinds of time alignment procedures could be employed as a useful method for tackling this problem of retention time shifts (Johnson et al. , 2003), such

SC

as correlation optimized warping (COW) (Nielsen et al. , 1998) and dynamic time warping (DTW) (Pravdova et al. , 2002), recursive alignment by fast Fourier

MA NU

transform (RAFFT) (Wong et al. , 2005). Original COW needs large execution time and memory when dealing with huge hyphenated datasets. And, artifacts often appear in the aligned fingerprints by DTW because it often over-warps signals when signals

ED

are just recorded by a mono-channel detector. RAFFT efficiently accelerates the aligning procedure by fast Fourier transform cross-correlation. However, using

PT

RAFFT may take the risk of distorting the shapes of peaks. It is because RAFFT does not consider the peak information when moving segments, just inserting and deleting

CE

data points only at the start and the end of segments, which may introduce artifacts

AC

and remove peak points. Note that nonlinear retention time shifts often exist for the chromatogram results of a real sample. To solve the nonlinear shift problem, some algorithms were proposed, such as nonlinear alignment by moving window fast Fourier transform (MWFFT) cross-correlation (Li et al. , 2013b), a multi-scale peak alignment (MSPA) approach (Zhang et al. , 2012b). Among these, MSPA involves iteratively dividing a chromatogram into small segments to solve the nonlinear retention time shifts problem in alignment. Then, FFT cross correlation is used to estimate candidate shifts and gradually align peaks step by step. A simple example of the application of MSPA method is demonstrated in Figure 4. Retention time shifts of GC-MS TICs in different samples are removed successfully by it. There are some 12

ACCEPTED MANUSCRIPT other alternatives existing, such as kernel density (Smith, Want, 2006), component-resolving algorithms (Andreev et al. , 2003), progressive clustering (De

T

Souza et al. , 2006), etc.. The other attempts for alignment are to integrate the peak

RI P

areas. Though it is time-consuming and meticulous, it is also the process of “data cleaning” because retention time shift, noise pollution, background shift will be

SC

cleared at the same time. Insert Figure 4

MA NU

In the dimension reduction step, loss of information is inevitable. To handle this problem, another kind of strategy is to model the high-dimensional data by multi-way analysis methods directly. Then, the so-called two dimensional advantages (e.g. mass

ED

spectra information of metabolites) will be maintained. For examples, the alignment methods by Prakash et al. (Prakash et al. , 2006) and ChromAlign method (Sadygov et

PT

al. , 2006) are both using the raw high-way data . Firstly, these algorithms compute the similarity scores of a matrix between pairs of spectra. Then, dynamic

CE

programming is applied to find an optimal path through the matrix and define the

AC

mapping of paired spectra. In the method proposed by Pierce et al., a piecewise single dimension retention time alignment algorithm is applied to align two-dimensional data (Pierce et al. , 2005). As to the continuous profile model (CPM) method, the two dimensional data is divided into four m/z bins as opposed to align only a single TIC (Listgarten et al. , 2007). In addition, some algorithms were upgraded to alignment the two-way retention time shift more comprehensively, such as the algorithm using novel indexing scheme (Pierce, Wood, 2005). The algorithm aligns the fingerprints in different dimensions at the same time, preserving the separation information in both dimensions. This method is suitable to correct shifting in different kinds of two-dimensional separations, such as GCⅹGC, LCⅹLC, LCⅹCE and LCⅹGC. In 13

ACCEPTED MANUSCRIPT addition, gap filling is often used after peak alignment to fill the missing values when not all peaks could be detected in samples. This procedure is necessary to avoid the

T

inclusion of too many zero values because it will have negative effects on the next

RI P

data modeling.

SC

2.1.4 Normalization

Normalization is to remove the confounding variation due to experimental sources,

MA NU

such as analytical noise, experimental bias, and retain the relevant variation due to biological events (Castillo, Gopalacharyulu, 2011). If the signal of the majority of all metabolites is stable, a simple and efficient normalization could be achieved by calculating the relative ratio of the abundance of analytes to all other peaks, such as

ED

the unit norm and median intensities normalization (Wang, Zhou, 2003). However, the assumption of negligible overall concentration changes is hard to be achieved; the

PT

total concentrations of analytes may change considerably due to both laboratory

CE

system errors and differences of large scale biological experiments. In this case, scaling based on the total chromatogram may seriously distort the data.

AC

Due to analytical reasons, measurement errors of minor metabolites with low concentrations are bigger than those primary metabolites with high abundance. As we all know, compounds with lower concentrations will be easily altered by analytical noise. To make the different metabolites comparable, scaling procedure is required to normalize the variances. Different scaling methods, such as, autoscaling (1/SD) and Pareto (1/sqrt(SD) could be performed. Autoscaling is the most popular normalization method used in metabolomics, in which each variable has equal (unit) variance by multiplying with the inverse of standard deviation (SD). Pareto is softer than autoscaling. For this method, each variable is weighted with the inverse of sqrt(SD). It can increase the importance of low abundant compounds without significantly 14

ACCEPTED MANUSCRIPT amplifying the noise. During data analysis, many researchers tended to assume that the total variation

T

originating from sampling, analytical measurements and biological events is with

RI P

equal standard deviations and symmetrically around zero (van den Berg et al. , 2006). However, this assumption is not available in many cases. Biological effects related to

SC

concentration alterations could vary dramatically for different metabolites..This variation related to certain metabolite is named heteroscedasticity, which could be

MA NU

detrimental for observing a particular biological situation (van den Berg, Hoefsloot, 2006). A mathematical transformation is helpful to correct the skewed data before modeling. The log transformation (Kvalheim et al. , 1994) and the power

ED

transformation(Sokal and Rohlf, 1995) are two well-known methods that have been applied to correct heteroscedasticity. When the relative standard deviation is constant,

PT

a log transformation can perfectly remove heteroscedasticity (Kvalheim, Brakstad, 1994). However, the log transformation has a drawback. It will approach minus

CE

infinity when the values are transformed approaching zero. Power transformation

AC

does not have the near zero artifacts and has the similar results with that after the log transformation.

Another sophisticated strategy for normalization is the internal standards (ISs) method, e.g. isotopically labeled internal standards, and quality control (QC) samples in each data acquisition procedure (Gika et al. , 2008b). Comprehensive and representative IS-based normalization is based on a key assumption that variance exhibited by ISs solely comes from component with systematic error. Unfortunately, this is not always the real situation. Both insufficient chromatographic separation and ion suppression will result in both concentration alterations in one component and variance in the measurement of a different one. If such confounding between analytes and ISs occurs, 15

ACCEPTED MANUSCRIPT direct normalization using the ISs may suppress the signal and lead to the loss of information. There is a principle that a representative IS used for normalization

T

should be similar with the analyte, and the systematic error should have an effect on

RI P

them indiscriminately. The IS is often selected from specific regions of retention time (RT) or m/z. However, the selected RT or m/z could not represent all matrix or

SC

chemical properties, which will result in obscuring data variation (Castillo, Gopalacharyulu, 2011). Any single IS could not estimate the systematic error of a

MA NU

complex biological matrix. Therefore, multiple ISs works better in these cases. Furthermore, the using of IS must try to decrease the risk of cross-contribution (CC). If the masses using for quantify the IS are carefully selected, this problem can be

ED

solved easily (Liu et al. , 2002). However, this attempt is nontrivial in the metabolomics research because the biological sample is too complex. It is difficult to

PT

predict which ions will be cross-interfering. The presence of CC effects can cause the loss of information seriously, especially when the interfering analytes are related with

CE

the interested factors in metabolomics data sets. H. Redestig et al. presented an

AC

effective normalization algorithm which could compensate for systematic CC effects. It is able to improve the normalization of mass spectrometry based metabolomics data (Redestig et al. , 2009). On the other side, in order to image the global variability of a measurement system, performing QC before normalization is recommended when visualizing the data by PCA. Generally, QCs is a pool of several individuals having similar characteristics. The studied samples are compared with QCs to evaluate the variability. In multivariate statistical analysis, such as PCA, the QC samples should appear closely on the scores plot, which indicates that the analytical system has good reproducible performance (Gika et al. , 2008a).

16

ACCEPTED MANUSCRIPT 2.2 Identification of metabolites Confidently identifying metabolites from MS spectrum data has been generally

T

recognized as a significant challenge in plant metabolomics community, especially in

RI P

untargeted analysis. Though the investigation of this topic was initiated much earlier than that for protein and DNA sequencing, identification of metabolites started

SC

entering upon the high-throughput and automated level just until recent years. The

MA NU

main factor causing this delayed breakthrough is the biochemical diversity of metabolites. However, benefitting from advanced computational techniques and methods, advanced mass spectrometry instrumentation, wealth of knowledge on ion fragmentation, well established databases and libraries, especially fruitful works in the

ED

past one decade, metabolite identification has the ability to cover unknowns with reasonable accuracy and could be performed in a high-throughput manner. A variety

PT

of overviews have been published on this topic and comprehensive summaries of

CE

different identification strategies can be found (Kind and Fiehn, 2010, Scheubert et al. , 2013, Wishart, 2009), whereas instructions for practical use can be obtained from

AC

(Neumann et al. , 2013, Watson, 2013) and a nice guide for beginners of mass spectrometry is in (Holcapek et al. , 2010). Thus, we are going to briefly cover currently available algorithms and tools valuable for metabolites identification using MS in this section.

2.2.1 Standards for reporting metabolite identification results Since outcome of metabolomics research strongly depends on biology conditions and experimental measurement, formulating a standard for reporting the outcome is of great importance for adopting common semantics in literatures, evaluating current works (both for readers and peer-reviewers of journals), exchanging data and storing 17

ACCEPTED MANUSCRIPT data in public repository(Field and Sansone, 2006). The most frequently referenced standard for metabolite identification is proposed by Chemical Analysis Working

T

Group (CAWG), a part of Metabolomics Standards Initiative (MSI)(Sumner et al. ,

LEVEL 1. Unambiguously identified compounds

RI P

2007), which can be classified into four levels:

require

comparison

of

SC

authentic chemical standard with other two or more orthogonal properties

MA NU

analyzed under identical analytical conditions; LEVEL 2. Putatively identified compounds

based

on

comparison

of

physicochemical properties and/or spectral similarity with public or commercial spectral libraries without authentic chemical standard; based on comparison of

ED

LEVEL 3. Putative identification of compound classes

physicochemical properties of a chemical class of compounds, or spectral

PT

similarity to known compounds of a chemical class; unidentified and unclassified, but can still be

CE

LEVEL 4. Unknown compounds

differentiated and quantified by spectral or chromatographic data;

AC

This standard along with for example guidelines for hyphenated MS experiments and data preprocessing provides a sound basis for metabolomics studies. The drawback is also apparent:

it cannot be quantitatively characterized for example using

probability, thus is still arbitrary(Creek et al. , 2014). For instance, in a high throughput analysis, it is impossible to collect all authentic metabolites to achieve the LEVEL 1 identification but more practical to achieve the LEVEL 2 using accurate mass with the assistance of isotopic distribution or reference library search assisted with retention index. However, the latter can also gain unambiguous identifications by appropriate criteria like false discovery rate of 5%(Kim and Zhang, 2014). A recent suggestion is to employ confidence levels to solve this problem (Schymanski et al. , 18

ACCEPTED MANUSCRIPT 2014), but much more efforts are still required. In addition to this standard, there are some other guidelines which may be referential, for example, EU Commission 2002/657/EC

T

Decision

using

RI P

(http://ec.europa.eu/food/food/chemicalsafety/residues/lab_analysis_en.htm),

2.2.2 Metabolite identification using GC-MS

SC

identification point system to score the identification results.

MA NU

GC-MS has been routinely used in metabolomics and is considered as a golden standard technique for high throughput metabolomics studies (Fiehn and Spranger, 2003, Lisec et al. , 2006). The main advantages of this technique are its robustness, high reproducibility in both chromatographic and mass spectrometry directions, high

ED

sensitivity, and existence of mature protocols for sample preparation and data processing. It is one of the oldest techniques in analytical science. Moreover, though

PT

GC-MS analysis requires that the analytes are volatile and thermally stable, the range

CE

can be considerably extended by chemical derivation(Villas-Boas et al. , 2005). Therefore, a great effort has been made to interpret MS spectra from electron impact

AC

(EI) ion source. Currently, most frequently adopted and reliable method is library search. In this method, each experimental MS spectrum is compared with reference MS spectra in mass spectral library. Then, similarity scores were calculated. The corresponding library compound gaining the highest similarity score is theoretically considered as the one that generates this experimental spectrum. The commonly adopted mass spectral libraries are listed in Table 1. Main factors that influence the search results include quality of experimental MS spectra, size of mass spectral library and similarity score calculation algorithms(Koo et al. , 2013). From the arithmetic point of view, the method for calculating the similarity score is the most important factor because the quality of MS spectra significantly depends on the 19

ACCEPTED MANUSCRIPT experiment, and the libraries are generally commercially available that cannot be freely configured by users and still relatively small size. Previous investigation

T

showed that the most robust similarity score calculation method was dot product

RI P

configuring with square root operation of mass spectral intensities(Stein and Scott, 1994). However, several other algorithms such as cross-correlation(Eng et al. , 2008,

SC

Powell and Hieftje, 1978) and probability based algorithms such as probability based matching (PBM) system (McLafferty et al. , 1974) and X-Rank (Mylonas et al. , 2009)

MA NU

were also powerful. In addition, each MS workstation of GC-MS instrument also has its own algorithm to calculate the similarity between experimental and library MS spectrum as well. And MassLib adopted SISCOM(Damen et al. , 1978) system to

ED

perform the library search.

PT

Insert Table 1

Due to the complexity of metabolites and their EI-MS spectra, for example, the

CE

existing of isomers and mass spectra generated by co-eluted components, a target compound does not ideally gain the highest similarity score. More generally, it locates

AC

at higher rank (e.g. second or third rank or higher) in the hitlist. Thus, careful manual checking is always required. As a consequence, taking other information like retention index (RI, e.g. Kovat‟s retention index) of target compound into consideration will be very helpful (Dunn et al. , 2011, Kopka, 2006). RI is a structurally and physicochemicallyspecific indicator and can effectively differentiate compounds having similar mass spectra. Actually, this indicator along with EI-MS spectrum makes up the mass spectral tag (MST) widely accepted in plant metabolomics and organizes the Golm Metabolome Database(GMD)(Kopka et al. , 2005, Schauer et al. , 2005, Wagner et al. , 2003) and BinBase/FiehnLib(Kind et al. , 2009). And, NIST standard reference database includes a large number of RI values. 20

ACCEPTED MANUSCRIPT Another improvement, especially for the case of co-elution, can be achieved by mass spectral deconvolution or two-dimensional data resolution methods which can resolve

RI P

and mass spectrum of each component (see section 2.1.2).

T

overlapped peaks generated by those co-eluted components into pure chromatography

Though mass spectral library search provides promising confidence for metabolite

SC

identification , the greatest drawback of this strategy is that currently released libraries

MA NU

are still far from covering the whole metabolites in plants, making a large number of metabolites not in the libraries unidentifiable. This problem is faced by all scientific fields involving compound identification. Attempting to overcome this disadvantage triggered the development of one of the oldest artificial intelligence system named

ED

DENDRAL Project(Lindsay et al. , 1993), initiated in 1965, to study the relationships between mass spectra and compound structure. Unfortunately, this project was finally

PT

failed. However, based on pioneer works for DENDRAL Project and other mass

CE

spectrum interpretation systems in the early days, numerous methods suitable for compound identification independent of mass spectral library were developed. These

AC

methods can be divided into two series. One series of methods are to learn structure features of compounds from their experimental mass spectra and then deduce unknown structure from the features of a given spectrum according to the previously constructed learning models. There are two ways to achieve this. The first one is to exhaust all possible isomers according to the molecular mass extracted from input MS spectrum by structure generation module (e.g. MOLGEN(Benecke et al. , 1995) and OMG(Peironcely et al. , 2012)) and retain the structures that best explain the spectrum according to fragmentation rules. Generally, machine learning algorithms are adopted in this procedure to identify whether a substructure is presented in the unknown compound. This can filter out large number of isomers without the identified 21

ACCEPTED MANUSCRIPT substructures (Schymanski et al. , 2008). MOLGEN-MS(Kerber et al. , 2001) and MassLib have been developed in this manner. The web based algorithm embedded in

T

GMD, however, employs decision trees to predict 166 commonest functional groups

RI P

in plant metabolites after training known metabolites in GMD using corresponding mass spectra data and retention indices(Hummel et al. , 2010), providing invaluable

SC

information for inferring structures of unknown metabolites. The other way is based on the library search under the assumption that similar structures have similar spectra.

MA NU

Possible substructures of unknown compound can then be deduced from the library compounds having top similarity scores(Stein, 1995).The alternative series of methods are to predict mass spectrum for input molecule directly. Based on wealth of

ED

knowledge of ion fragmentation and aided by advanced computational technologies, accurately predicting mass spectrum becomes feasible. Mass Frontier (Thermo

PT

Scientific), one of the most commonly adopted software for structure elucidation, uses HighChem Fragmentation Library which stores about 31,000 fragmentation

CE

mechanisms to predict and interpret experimental mass spectra. Commercial software

AC

ACD/MS Fragmenter(ACD/Labs) is also very powerful in MS spectrum prediction and gains its popularity in metabolomics community. Similarly, freely available tool Mass Spectrum Interpreter released by NIST uses thermo chemical kinetics of general fragmentation reactions summarized from known fragmentation rules to predict mass spectrum. Among these powerful methods, a common difficulty is that they cannot effectively extract correct structures from their isomers, as has been pointed out after comparing different tools(Schymanski et al. , 2009). However, improvements can be made by the combination of different tools (Schymanski et al. , 2012). In addition to the above methods, unknown compounds can also be putatively identified by MS spectral characteristics combining with other information. For 22

ACCEPTED MANUSCRIPT example, combination of accurate molecular mass to charge ratio (m/z) provided by chemical ionization, in-silico predicted retention index and fragmentation pattern can

T

effectively constraint the number of candidate compounds in histlist (even to single

RI P

one) without requiring any mass spectral library(Fiehn et al. , 2000, Kumari et al. , 2011). A practical guide for small molecule structure elucidation using several

MA NU

libraries can be found in(Zhang et al. , 2013b).

SC

strategies which differ from above computational methods and without mass spectral

2.2.3 Metabolite identification using LC-MS

As has been mentioned in previous section, GC-MS analysis is standardized by, e.g. fixing ionization method (i.e. EI) under fixed energy (i.e. 70ev), which ensures that

ED

the mass spectra generated are robust and highly reproducible among instruments and laboratories. As consequences, the reference mass spectral libraries are standardized

PT

and well quality controlled and the mechanisms of fragmentation during EI are

CE

extensively known now, making the identification of compounds highly maneuverable and the quality of results assessable. The biggest limitation for GC-MS,

AC

however, is the requirement of volatile and thermally stable analytes or additional derivatization step to render some polar and non-volatile species (Villas-Boas, Mas, 2005). This dramatically reduces the range of analyzable species since much more species like secondary metabolites are non-volatile or have higher molecular weight. Further, derivatization will complicate the interpretation of mass spectra. In contrast, LC-MS does not require the species to be volatile and can be used to analyze compounds with heat-labile functional groups, chemically unstable substructures or high molecular weights and so forth, thus can analyze a much wider range of plant metabolites than GC-MS. Moreover, the sample preparation for LC-MS is simpler (Kim and Verpoorte, 2010, Wu et al. , 2012). These great advantages of LC-MS along 23

ACCEPTED MANUSCRIPT with advanced instrumentation, for example, development of electrospray ionization (ESI) technique(Fenn et al. , 1989) well compatible with tandem mass spectrometry

T

and the increasing improvement of resolving power, make LC-MS be the method of

RI P

choice in „omics‟ research, especially in high throughput analysis.

While LC-MS establishes itself as an indispensable technology, identifying

SC

metabolites from MS spectra is not amenable due to variation of experimental settings,

MA NU

for example chromatographic conditions, mass spectrometry parameters (Halket et al. , 2005).. This becomes even serious for discovering unknowns from large and complex metabolite space for example untargeted metabolomics analysis. Additionally, the fragmentation mechanisms during ionization in LC-MS platform under various

ED

activation energies are still unclear and the investigation of them is far behind that of EI. These factors leave the confident interpretation of MS spectra derived from

PT

different LC-MS and LC-MS/MS platforms a significant challenge. Fortunately,

CE

recent active studies have made remarkable advances in metabolite identification and several tools and various databases are publically available (see Table 1 and Table 2).

AC

In general, currently available tools are developed based on two aspects of LC-MS data: accurate mass together with other information like isotopic distribution and MS/MS spectra.

Insert Table 2 2.2.3.1 Structure inference by accurate mass and other information The ability of accurate measurement of m/z is one of the most important features of high resolution mass spectrometry, which has greatly facilitated the whole MS data analysis workflow. For metabolite identification, using accurate mass calculated from determined m/z is generally the first step (Holcapek, Jirasko, 2010) as it is the 24

ACCEPTED MANUSCRIPT simplest and most straight-forward. Either formula generation method or large compound database or metabolism network search can be adopted here. For formula

T

generation, all combinations of predefined elements with constraints of element

RI P

number and mass range are exhausted. A number of tools commercially or freely available have been developed to assist this (see Table 2). As expected, very large

SC

number of candidate formulas will be generated, especially for relatively large molecular mass. This makes it impracticable to obtain a single assignment of formula

MA NU

to each m/z solely basing on the accurate mass. Thus it becomes nontrivial to define rules to filter out those false positives.

Among all the developed rules, similarity checking in isotopic distribution is

ED

commonly accepted as the most critical criterion. And it has been demonstrated that most spurious formulas could be rejected under this checking (Erve et al. , 2009, Kind

PT

and Fiehn, 2007). Isotopes of an element are naturally stable variants of the element

CE

that differ in number of neutrons as well as natural abundances (represented as percentage of each isotope, e.g. natural abundances of

12

C and

13

C are 98.93% and

AC

1.07%, respectively). Different elements have distinct isotopic abundance distributions in nature. Therefore, theoretically, each elemental composition or formula has unique isotopic distribution. This is the basic principle in compound identification using isotopic distribution. Namely, by comparing instrument determined isotopic distribution to the simulated one, the formula candidates can be ranked with top ones being the most similar via so called spectral comparison(Wang and Cu, 2010) or rejected if the relative abundances (RIA) between the two distributions are unacceptably different. The exploration to precisely simulate isotopic distribution has been undertaken for decades and several tools are now freely available (Valkenborg et al. , 2012). If the resolution of an MS instrument is high 25

ACCEPTED MANUSCRIPT enough, formulas can be exclusively identified using the RIA of single element. For example, the number of element carbon could be accurately estimated by comparing 13

C only to mono-isotopic peak for it was well resolved

T

isotope peak generated by

RI P

from other [M+1]+ peak in FT-ICR-MS due to the fact that mass difference between 13

C- and15N-substituted peak is 0.00632 Da(Miura et al. , 2010). Similarly, the

SC

number of N or the presence of Cl, Br and S can be deduced from the fine structure of isotopic distribution(Kaufmann, 2010). This strategy is now extended and confirmed

MA NU

using higher resolution instrument for high throughput metabolomics analysis(Nagao et al. , 2014).However, recent investigations have shown that the accuracy of RIA measurements are highly dependent on type and resolution of an MS instrument, peak

ED

intensity, accurate mass and data handling methods, whereas high RIA measurement error appears in peaks with low signal to noise ratio (S/N), low m/z and presence of

PT

co-eluting species(Knolhoff et al. , 2014, Koch et al. , 2007, Weber et al. , 2011, Xu et al. , 2010). These will terribly mislead the identification results(Koch, Dittmar,

CE

2007).Unfortunately, no systematic evaluation of the influence of RIA measurement

AC

error on formulae inference is performed. While suggestion for eliminating this influence can be setting larger error tolerance during comparison (Weber, Southam, 2011), cautions still should be paid when using RIA to identify metabolite and additional information are required. The exhaustion of elemental compositions according to the input parameters, for example element types and accurate mass with allowed errors, tends to generate meaningless formula that unlikely appear in plant metabolites. Many of those formulas cannot be rejected by RIA criteria only. Hence it is necessary to check the formulas using other rules. The famous “Seven Golden Rules” was defined after statistically analyzing formulas extracted from Wiley and NIST02 mass spectral 26

ACCEPTED MANUSCRIPT database and the Dictionary of Natural Products(Kind and Fiehn, 2007) and has been demonstrated to be an efficient tool in metabolomics study. An updated version of

T

these rules is defined recently after analyzing large scale formulas in PubChem

RI P

database(Lommen, 2014).

Once formulas are determined or ranked, decoding them to known metabolites is

SC

subsequently performed, typically by searching large databases (Little et al. , 2012,

MA NU

Zhu et al. , 2013). The databases frequently adopted in metabolomics are listed in Table 1. Further constraints of compounds can be realized by prior biological knowledge using for example lists of expected metabolites of the analyzed organism. Since metabolites in biological sample are biochemically connected (e.g. chemical

ED

transformation) rather than randomly mixed(Breitling et al. , 2006), it is beneficial to map the metabolite candidates onto metabolism networks to gain confident

PT

identification(Gipson et al. , 2008, Rogers et al. , 2009, Weber and Viant, 2010).For

CE

example, MI-Pack maps mass spectral peaks onto KEGG network database(Ogata et al. , 1999) and uses rigidly defined mass error surface of mass differences between

AC

substrate-product pairs derived from the database for metabolite identification (Weber and Viant, 2010). Significant reduction of both false negatives and false positives is consequently obtained, respectively. This approach is advantageous not only for metabolite identification but also mining related subnetworks which represents the activity or functions of the metabolites, as has been demonstrated in recent works(Doerfler et al. , 2014, Li et al. , 2013a). LC-MS can detect ion series (so called satellite ions) of a metabolite generated by fragmentation reactions during ionization, including neutral losses, ions with different adducts (Brown et al. , 2009, Huang et al. , 1999). Other types of fragments for example artifact ions, background ions, multiply charged fragments and so on can 27

ACCEPTED MANUSCRIPT also be generated by LC-MS(Keller et al. , 2008). These fragments can derive a large number of false positives during metabolite identification using accurate mass. free

tools

including

PUTMEDID-LCMS(Brown

et

al.

,

T

Several

RI P

2011),CAMERA(Kuhl et al. , 2012), IDEOM(Creek et al. , 2012), MZedDB tool(Draper et al. , 2009) and MAIT(Fernandez-Albert et al. , 2014) can be applied to

SC

identify these fragments in data preprocessing. In addition to fragment interferences, another factor that can potentially disturb metabolite identification is the extraction of

MA NU

accurate mass since the algorithm to calculate it from profile peak which hidden in commercial workstation. Also, LC-MS instrument has systematic mass deviations during conversion of ion signals to mass spectrum representation(Savitski et al. , 2004)

ED

or in different experimental settings(Petyuk et al. , 2008).In proteomics, the improvement of mass accuracy has been well studied via calibration by peptide

PT

MS/MS spectra(Egertson et al. , 2012, Venable et al. , 2006) or correctly identified peptides(Petyuk et al. , 2010)and background ions(Haas et al. , 2006). The calibrated mass

provides

CE

accurate

superior

discrimination

between

false

and

true

AC

identifications(Haas, Faherty, 2006). The situation becomes more complex in metabolomics, however, because more fragments are produced accompanying a metabolite precursor ion and fragmentation of metabolites is more sophisticated than peptides. In a recent approach, mass accuracy was improved via background ions (Scheltema et al. , 2008). Whereas a great advance in mass accuracy is achieved in commercial software MassWorksTM (Cerno Bioscience) by peak shape calibration with the aid of internal standards which offers unit resolution instruments (e.g. ion trap instrument) the capability of compound identification using accurate mass and isotopic distribution as high resolution instruments (Kuehl and Wang, 2006). A correction of automatic gain control system calibrated by multiple linear regressions 28

ACCEPTED MANUSCRIPT can also obtain mass accuracy up to ppb (part per billion) level (Williams and Muddiman, 2007). Nevertheless, more studies on this topic are still required when

T

identifying metabolites especially in untargeted analysis.

RI P

2.2.3.2 Metabolite identification by MS/MS

SC

MS/MS is one of the best techniques for structure elucidation and has been widely applied in analytical fields. As an indispensable part of LC-MS system, ionized

MA NU

molecules selected by instruments are dissociated into charged or neutral pieces by hard ionization methods for example collision induced dissociation (CID) method. Recording all the charged fragments and intact molecule ions forms the MS/MS spectrum. This MS/MS spectra generation procedure indicates that the structure of a

ED

molecule can be unambiguously deduced from its MS/MS spectrum, and moreover,

PT

the strategies for interpretation of GC-MS spectra, i.e. library search and de novo analyzing, can be applied in this deduction. Therefore, several MS/MS spectral

CE

libraries as well as computational methods for spectral prediction or structure elucidation are developed (see Table1 and Table 2).

Since the experimental

AC

conditions (e.g. collision energy) in MS/MS analysis are not as standardized as in GC-MS analysis and sizes of currently constructed libraries are much smaller comparing to the whole metabolism or structure databases and other factors(Stein, 2012, Werner et al. , 2008), metabolite identification via spectral library search does not gain its popularity as is in GC-MS analysis. As a consequence, much more studies are focused on developing computational methods to interpret MS/MS spectra without querying spectral library. Algorithms employed in currently developed software for computational MS/MS can be categorized into three basic approaches, namely, mass spectrum prediction, in 29

ACCEPTED MANUSCRIPT silico fragmentation and de novo elucidation(Hufsky et al. , 2014). Mass spectrum prediction has been well studied in interpretation of EI spectra and is the basic and

T

one of the most important modules in peptide identification in hypothesis-driven

RI P

proteomics. Due to the enormous diversity of small compounds, accurate MS/MS spectra prediction is still a tough challenge. To predict MS/MS spectrum for a given

SC

structure, Mass Frontier extracts possible reactions that occur during fragmentation of this structure from its fragmentation reaction library as rules to predict the fragments

MA NU

and intensities. ACD/MS Fragmenter handles spectrum prediction in a similar way. While MetISIS uses machine learning algorithm to learn CID kinetics from lipid experimental MS/MS spectra to predict spectra for lipids in silico (Kangas et al. ,

ED

2012).

Instead of directly predicting mass spectrum, in silico fragmentation attempts to find

PT

out a structure from all candidates that best explain the given MS/MS spectrum. This

CE

approach was firstly employed in EPIC using bond disconnection algorithm to exhaust all possible substructures of a molecule and comparing the substructures to

AC

formulas inferred from fragment ions. Then relevant structures were listed for user confirmation(Hill and Mortishire-Smith, 2005). Later, FiD(Heinonen et al. , 2008)exhausted all substructures from molecule graph using depth-first graph traversal algorithm and matching them to fragment ions. All candidates were ranked according to the total bond dissociation energy (BDE) calculated from bond cleavages in each molecule candidate and the first rank has the least BDE. Similar procedure was implemented in Mass-MetaSite(Bonn et al. , 2010). MetFrag used a more complex procedure than the above algorithms to extract substructure-fragment pairs with additional consideration of

rearrangement reactions during molecule

fragmentation and scored each candidate using both matched fragments and 30

ACCEPTED MANUSCRIPT BDE(Wolf et al. , 2010). This algorithm was integrated into MetFusion(Gerlich and Neumann, 2013) recently as a complement of library search and vice versa. An

T

alternative procedure was implemented in FingerID via calculating the likelihood

RI P

between metabolites in database and a given experimental MS/MS spectrum in feature space called fingerprints using SVM model obtained by training fingerprints

SC

extracted from Mass Bank MS/MS spectral library(Heinonen et al. , 2012). Whereas CFM calculated the likelihood between database metabolites and given MS/MS

MA NU

spectra according to the competitive fragmentation process learned from spectral library by expectation maximum algorithm(Allen et al. , 2014a). De novo analysis, however, infers structure from the observed fragments in a given

ED

MS/MS spectrum. This approach firstly determines formulas of fragments according to their high resolution m/z and then deduces the structure of a precursor ion using

PT

these formulas and known fragmentation pathways that generate them. The most

CE

appropriated method that was employed for this deduction to date seems to be fragmentation tree with nodes being fragment formulas, edges being neutral losses

AC

and root being the precursor (Bocker and Rasche, 2008, Rasche et al. , 2011). Therefore, with appropriate scoring scheme, an experimental MS/MS spectrum can be identified by extracting the most optimal fragmentation tree defined by the scores. But later on this procedure is demonstrated to be extremely computationally intensive, even though formulae of precursor has been determined(Hufsky et al. , 2012). This obstacle can be partly solved by heuristic methods(Rauf et al. , 2012). Developments of much more efficient methods for high throughput analysis are still challenging.

2.3 Variable selection Biomarker screening (variable selection) plays an essential role in metabolomics 31

ACCEPTED MANUSCRIPT because biomarker identification aims to convert metabolomic results into valuable biological knowledge. It has been developed for decades and is active in various

T

research fields, such as statistical pattern recognition (Mitra et al. , 2002), machine

RI P

learning (Robnik-Šikonja and Kononenko, 2003), data mining (Liu and Motoda, 1998) and statistics (Miller, 2002). Moreover, it has been proven to be effective in both

SC

theory and practice in improving learning efficiency, enhancing predictive accuracy and explanation of learned results (Yu and Liu, 2004, Yun et al. , 2013). Nowadays,

MA NU

high-throughput chemical data generated from modern analytical platforms such as GC-MS, LC-MS and NMR usually contain a large number of data points (variables) while the samples is relatively less, so called "large p, small n problem" in statistical

ED

learning (Boccard et al. , 2010). Actually, variable selection is an optimization problem to find an optimal variable combination from the considerable body of

PT

variables. However, it faces a great challenge to address this NP-hard problem. So far, a lot of chemometricians or statisticians have proposed a great deal of variable

CE

selection methods specific to this problem. Some are based on statistical features of

AC

variables such as uninformative variable elimination (UVE) (Centner et al. , 1996a), Monte Carlo based UVE (MC-UVE) (Cai et al. , 2008), competitive adaptive reweighted sampling (CARS) (Li et al. , 2009a, Zheng et al. , 2012) , iterative predictor weighting (IPW) (Forina et al. , 1999), successive projection algorithm (Araújo et al. , 2001) and Bayesian linear regression (BLR) (Chen and Martin, 2009). Some are based on the optimization algorithm, such as rough set (Swiniarski and Skowron, 2003), particle swarm optimization (PSO) (Wang, Yi, 2008), stepwise selection (H Martens, 1989), forward selection (Blanchet et al. , 2008, H Martens, 1989), backward elimination (H Martens, 1989, Sutter and Kalivas, 1993), genetic algorithm (GA) (Leardi, 2000, 2001, Yang and Honavar, 1998, Yun et al. , 2014a) and 32

ACCEPTED MANUSCRIPT simulated annealing (SA) (Kalivas et al. , 1989). We here divide them into two kinds of directions: variable ranking and variable subset selection (Narsky and Porter,

RI P

T

2013).

2.3.1Variable ranking

SC

Variable ranking is mostly used in revealing informative metabolites or biomarkers. Ranking assigns a measure of importance to each variable based on some certain

MA NU

criteria. This measurement is usually with a nonnegative value indicating the importance of a variable. PLS is a basic tool of chemometrics. Many PLS-based criteria are frequently employed to assess the importance of variables and rank the variables, especially in building a partial least squares-discriminant analysis (PLS-DA)

ED

classification model. The most popular criteria include PLS loading weights (LW) (Wold et al. , 2002), variable importance on projection (VIP) scores (Favilla et al. ,

PT

2013, Wold, Sjöström, 2002), and regression coefficient (RC) (Wold et al. , 2001a)

CE

and target projection (TP) (Kvalheim, 2010, Rajalahti et al. , 2009b). PLS loading weights can be used as a measure of variables from the fitted PLS model for each

AC

principal component (latent variable), and different principal components generate different ranking results. VIP is to represent the importance of each variable being reflected by loading weights from each latent variable of PLS. Some researchers suggested that a variable should be retained if VIP score is greater than 1 (Chong and Jun, 2005, Gosselin et al. , 2010). However, as the determination of a threshold is one of the most difficult steps in variable selection of metabolomics data even now, this criterion needs further verification. RC uses the regression coefficients of PLS modeling. It just measures the association between the single variable and its response. The variables that have small absolute value of regression coefficient could be eliminated as uninformative (Centner, Massart, 1996a). TP provides a projection of 33

ACCEPTED MANUSCRIPT RC on the X matrix, so that the target-projected loadings are proportional to the product of RC and the covariance matrix (XTX) (Kvalheim and Karstang, 1989,

T

Kvalheim et al. , 2009). Selectivity ratio (SR) measures the ratio between explained

RI P

variance and residual variance of each variable after TP, and it has been used quantitatively to select biomarker candidates (Rajalahti et al. , 2009a, Rajalahti,

SC

Arneberg, 2009b). In addition, variable ranking can be conducted based on statistical features between variables and classification label, such as correlation (Hall, 1999),

MA NU

information gain (Ben-Bassat, 1982), Euclidean distance (Liang et al. , 2008) and mutual information (Yu and Liu, 2004). This kind of methods only provides a measure of importance on a single variable (i.e. single metabolite) but without

ED

considering the interaction effects among multiple variables.

2.3.2 Variable subset selection

PT

The ranking of variables by their importance does not really tell us which variables or

CE

how many variables should be discarded. Although variable ranking is simple and time-saving, it works with low efficiency in identifying the optimal subset of

AC

variables. Subset selection is to seek an optimal subset from all variables that satisfy optimality criteria. Any variable ranking method can be turned into a variable subset selection algorithm by introducing a threshold on variable importance values. Those variables with importance values above this threshold are kept, while the variables with those below this threshold are eliminated. The choice of this threshold can be subjective or conducted by statistical method (Narsky and Porter, 2013). Usually, a trade-off between the model prediction accuracy and the number of selected variables is considered. The most straightforward proposal for this is to use a cross validation (CV) procedure to determine the threshold by estimating the generalization error according to the number of variables and choose the number which minimizes the 34

ACCEPTED MANUSCRIPT prediction error. When the variables are ranked by some criteria, the model can be built by adding these sorted variables one by one until all of them are included. The

T

best variables subset has the lowest CV error. For example, when adding the fifth

RI P

variables to build the model, there should appear the lowest prediction error of CV. Thus the first five variables constitute the best variable subset. For subset selection,

SC

some criteria related to the classification algorithm are employed. The objective function is a pattern classifier, which evaluates variable subsets by their predictive

MA NU

accuracy on test data by statistical re-sampling or CV. Moreover, the optimization algorithm is usually combined with the classification algorithm. In brief, variable subset selection seeks the subset that is optimal or near-optimal with respect to an For example, genetic algorithm-partial least squares-discriminant

ED

objective function.

analysis (GA-PLS-DA) is subjected to combine optimization algorithm GA with a

PT

classifier. GA was based on the Darwin‟s classical rules about natural evolution: the best individual or the generation produced by mating of the best individuals, leading

CE

to a better offspring and have a higher chance to survive in the living environment.

AC

This combination has successfully been applied to the classification of the normal and pre-cancer tissue samples (Cao et al. , 2012). In this work, PLS-DA served as a classifier building on each subset generated by GA. Besides, particle swarm optimization coupled with support vector machine (PSO-SVM) (Alba et al. , 2007) was designed for selecting variable subsets as solutions in order to reduce the high dimensionality of variables for subsequent classification. The SVM classifier is employed whenever the fitness evaluation of a temporary variable subset is required. Compared with variable ranking method, subset selection generally achieves better prediction accuracy since it turns to the specific interactions between the classifier and dataset, with a mechanism to avoid overfitting using re-sampling or cross-validation 35

ACCEPTED MANUSCRIPT measure of prediction accuracy. However, it has to train a classifier for each variable subset, leading to low execution and high computation. And the solution is lack of

RI P

T

generality since it ties up the bias of the classifier in the fitness evaluation function.

2.3.3 Variable selection considering the interaction effect among variables

SC

In fact, to find an optimal subset or ranking of variables is not always a favorite case unless the interaction among multiple variables is considered. The combination effect

MA NU

among variables should be considered since the joint performance of a set of variables is better than the additive independent contributions of its individuals (Anastassiou, 2007). To address the variable interaction in subset selection effectively and efficiently, Zhao and Liu introduced a variable subset selection method, called

ED

INTERACT, which is based on inconsistency and symmetrical uncertainty measurements for finding interacting features (Zhao and Liu, 2009). They proposed

PT

that variable interactions could be implicitly coped with a carefully designed variable

CE

evaluation metric and a search strategy with a specially designed data structure, which together considered combination effects among variables when performing variable

AC

selection. The method proposed in Breiman‟s work (Breiman, 2001) have taken into account the combination effect among numerous variables to some extent according to random forest and permutation test. The variable importance is assessed by the percent increase of misclassification error when the variable is permuted randomly in random forest. However, all variables are involved in the model of random forest, so it is difficult to provide a good reflection on the synergetic effect among multiple variables. Recently, Li and Liang proposed a new strategy for variable selection, called model population analysis (MPA) (Li et al. , 2010a). It provides a general framework for development of data-analysis methods. Figure 5 illustrates the outline of MPA idea. 36

ACCEPTED MANUSCRIPT There are three steps in MPA: (1) Monte Carlo sampling (MCS) is employed to randomly produce N sub-datasets (e.g., 10,000); (2) A sub-model is built on each

T

sub-dataset; and, (3) Statistical analysis is employed to evaluate an outcome of

RI P

interest (e.g., prediction errors) for all the established N sub-models. With this approach, the variables are identified as informative, uninformative and interfering

SC

variables (Wang et al. , 2011) based on the differences among the cases and control samples, respectively. Figure 6 illustrates the prediction error distributions of the three

MA NU

kinds of variables after permutation. The prediction error of informative variables increased after permutation, implying that they could significantely improve the prediction performance of the classification model. As to the uninformative variables,

ED

no significant difference emerged before and after permutation. They performed like noises. As for the interfering variables, their prediction errors significantly decreased

PT

after permutation, indicating that this kind of variables may bring negative impact on the model and influence the classification. Uninformative and interfering variables are

CE

useless as they may have a bad influence on the modeling. Thus, discovering the

results.

AC

optimal variables subset or ranking in the informative variables can output compelling

Insert Figure 5 Insert Figure 6

Subwindow permutation analysis (SPA) (Li et al. , 2010b), a variable ranking method, combines above mentioned ideas with the Monte Carlo sampling (MCS) method and MPA. It assesses each variable‟s importance based on the sub-models obtained by MCS technique. Firstly, each sub-dataset, so called sub-window, is generated from the whole data through MCS in not only sample but also variable space. Secondly, the software builds up a sub-model on each sub-dataset and each permutation of this 37

ACCEPTED MANUSCRIPT sub-dataset. Thirdly, the differences between the prediction errors of normal and permutated sub-window are distinguished for each variable, respectively. If a large

T

number MCSs are performed, two distributions of prediction errors can be obtained

RI P

for each variable. Finally, informative variables are identified and ranked based on the P value of Mann-Whitney U test (Mann and Whitney, 1947) on these two distribution.

SC

Besides, margin influence analysis (MIA) (Li et al. , 2011) is also based on the idea of MCS and MPA. Although being designed to work with SVM for identifying

MA NU

informative variables, this method also gives a measure for each variable according to the differences between the prediction errors of inclusion and exclusion of this variable. However, the chance of each variable to be sampled by MCS is not the same.

ED

With the condition that some variables are selected more frequently and some are less, it is not appropriate to assess the importance of each variable using the above

PT

introduced strategy. To address this problem, a new sampling method in the variable space, called binary matrix sampling (BMS) (Zhang et al. , 2012a), was proposed.

CE

This method not only considers the synergetic effect among multiple variables, but

AC

also guarantees that each variable is selected with equal probability and a population of different variable combinations is generated as well. With this population of variable subset, Yun etc. introduced a method called iteratively retains informative variables (IRIV) (Yun et al. , 2014b), to employ MPA strategy and find the optimal subset of variables through observing the differences between the prediction errors of inclusion and exclusion of each variable. Deng etc. developed an optimization algorithm called variable iterative space shrinkage approach (VISSA) to search for the optimal variable combinations (Deng et al. , 2014). Each variable is assigned a weight according to its importance in modeling in VISSA. The weight of each variable accumulates through an iterative procedure and the variables are selected when their 38

ACCEPTED MANUSCRIPT weights reach “1”. Two rules are highlighted in the VISSA algorithm. First, the variable space shrinks smoothly in each step. Second, the variable space is optimized

T

in each step.

RI P

Although SPA, MIA and IRIV have considered the synergetic effect among multiple variables, they rarely investigate the complementary information between variables.

SC

Variable complementary network (VCN) is an overall method to visualize the complement process among biological variables (Li et al. , 2012). It accumulates

MA NU

information of several classification models obtained by MCS in variable space, and quantitatively computes the complementary information among variables and then effectively discovers biomarker with the help of mutual associations of metabolites.

ED

To clearly show the above mentioned method, Table 3 lists several variable selection methods based on whether considering the interaction effects among variables or not.

PT

Insert Table 3

CE

2.4 Modeling of the data

AC

To explore the high-dimensional metabolomics datasets and discover valuable information on biological events, a number of machine learning methods are developed for modeling. Usually, these methods start with a blind and unsupervised data exploration and continue with supervised analysis in which a priori knowledge of data structure is utilized. Main characteristics of the machine learning methods which will be described below are summarized in Table 4. It contains the category, advantages and disadvantages of each method, and also some applications in metabolomics. Insert Table 4 39

ACCEPTED MANUSCRIPT 2.4.1 Unsupervised methods

T

Unsupervised methods are usually used to explore the overall structure of a dataset,

RI P

finding trends and groupings within the dataset. These methods contribute an unbiased view of the data. Several unsupervised methods are available, among them

SC

principal component analysis (PCA), hierarchical cluster analysis (HCA) and

MA NU

self-organization mapping (SOM) are the most frequently used examples in metabolomics.

PCA is one of the most popular multivariate statistical analysis method in metabolomics (Pearson, 1901). The purpose of PCA is to obtain a linear

ED

transformation of the high dimensional variables into a small number of factors,

PT

called principal components (PCs). This transformation defines that the first PC has

CE

the largest variance, and each following PC has the largest variance in turn under the constraint of being orthogonal to the preceding PC. PCA produces two matrices

AC

known as scores (i.e. PCs) and loadings. Scores show the new coordination of the samples. Loadings represent the method in which the original variables are combined into PCs linearly. The distribution of samples could be visualized by PCA using a scores plot, which demonstrates the projection of samples on a plane spanned by the first and second PCs. PCA is a suitable method to summarize high-dimensional data. However, this method is unable to find the optimal direction or pattern of variables which can separates classes of objects best. Yi et al. employed the PCA method to represent the metabolic footprints of tangerine peels successfully (Yi et al. , 2009, Yi et al. ). The idea and main results of volatile metabolic footprinting are shown in 40

ACCEPTED MANUSCRIPT Figure 7 (Yi, Dong). In this study, based on the tangerine peels‟ metabolic footprints, characteristic secondary metabolites were screened out, such as D-limonene and

RI P

T

linalool. In addition, compounds such as 4-carene, 3-carene, β-pinene and γ-terpinene were screened as major components for the pungent smell of Pericarpium Citri

SC

Reticulatae Viride (PCRV). Geranyl acetate, farnesyl acetate and three alcohols (6-hepten-1-ol, 3-methyl-1-hexanol, 1-octanol) were for the pleasant odor of

MA NU

Pericarpium Citri Reticulatae (PCR). The results indicated that plant metabolomics analysis focusing on ripening process will be an effective strategy for revealing the chemical features of closely related herbal medicines, such as PCR and PCRV, and is

ED

helpful for quality control of them.

PT

Insert Figure 7

HCA is another widely used unsupervised method in modeling of metabolomics

CE

(Webb, 2003). HCA aims to group samples that are relatively similar and the

AC

relatively dissimilar objects will be in another cluster. The input of HCA is a distance or a dissimilarity matrix (e.g. Euclidian, Mahalanobis or Minkowski distances) that represents the dissimilarities among samples. The choice of distance matrix exhibits significant influence on the clustering structure. Clearly, HCA works the best only when a hierarchical structure is available. The HCA clusters the data forming a tree called dendrogram. It stands for the similarities and differences among objects in a two-way structure. When HCA is used for classification, similarity cut-off should be decided firstly. It can separate the dendrogram into different clusters. HCA cannot give us the information about why a certain clustering is obtained. That is to say, 41

ACCEPTED MANUSCRIPT HCA cannot identify which metabolites are corresponding to the clusters‟ differences. It is the main drawback of HCA.

RI P

T

SOM is one of neural-network algorithms belonging to unsupervised-learning category (Kohonen and Maps, 1995). For a high-dimensional data, SOM can form a

SC

nonlinear projection on a regular, low-dimensional grid. The clustering in the data

MA NU

space and the metric-topological relations of the data items is clearly visible.

2.4.2 Supervised methods

Supervised techniques support a priori known structure of the data to train patterns

ED

and rules, using to predict new data. A wide range of supervised methods has been employed in metabolomics. The advantage of these methods is that they can provide

PT

variables information about their discrimination ability between two or more classes

CE

when modeling; and therefore, they are widely used in metabolomics for biomarker

AC

screening. Supervised techniques can be classified as linear methods such as partial least squares-discriminant analysis (PLS-DA), linear discriminant analysis (LDA), orthogonal projections to latent structures discriminant analysis (OPLS-DA), and non-linear methods such as random forest (RF) and support vector machines (SVM).. LDA tries to find a linear function on the basis of original variables, which maximize the ratio of between-class variance and minimize the ratio of within-class variance. (Webb, 2003). It is fast and powerful and parameters optimization is not necessary. However, several limitations exist. LDA uses a between-class covariance matrix. Therefore, it is not always appropriate if the variance structure differs from various

42

ACCEPTED MANUSCRIPT classes. Besides, the number of samples needs to be larger than that of variables so that the inverse of the covariance matrix is obtainable (Bishop, 2006). LDA is

RI P

T

particularly fitting for the data structure that multi-collinearity of compounds is not serious. For metabolomics dataset, it is common that the number of samples is less

SC

than that of variables. In these cases, LDA cannot be applied directly. A possible solution is to reduce the dimension of variables before LDA. For example, we can use

MA NU

an unsupervised method such as PCA for variable reduction firstly, then, apply LDA on the relevant PCs. The number of PCs can be optimized by cross-validation (Stone, 1974, Wold, 1978).

ED

The most widely used supervised method for classification is partial least squares-

PT

discriminant analysis (PLS-DA) (Barker and Rayens, 2003), which is a combination of partial least squares (PLS) regression (Wold et al. , 2001b) and LDA. This

CE

technique is also a latent variable extraction approach, which is similar to PCA. The

AC

assumption is that the data could be well approximated in a lower dimensional subspace by latent variables (LVs). These LVs are assumed to linear combined by original variables. The first PC (PC1) is obtained in the direction with the highest variance of the data,.The first LV (LV1) of PLS-DA lies in the direction explaining most information of between-class variation for the objects. PLS-DA can deal with the highly collinear data. It is a very important advantage of this method. And it is also suitable for spectrometric data. This method has engrained in most commercial chemometrics software but still poorly understood by most of the users (Brereton and Lloyd, 2014). For instance, the projection plot of PLS-DA (scores‟ plot) is very 43

ACCEPTED MANUSCRIPT popular for classification in metabolomics because it separates the different classes from, an overoptimistic view (Westerhuis et al. , 2008). We must admit that there

RI P

T

are some pitfalls when using PLS-DA to model the data of unequal class sizes (Brereton and Lloyd, 2014). However, PLS-DA can provide excellent insights into the

SC

cause of discrimination via weights (Hoskuldsson, 2001), loadings, regression coefficients (Centner et al. , 1996b), VIP (Wold et al. , 1993) and selectivity ratio (SR)

MA NU

(Rajalahti et al. , 2009c); and therefore has become a useful tool for biomarker discovery.

The recent modification of PLS-DA is orthogonal projections to latent structures

ED

discriminant analysis (OPLS-DA) (Trygg and Wold, 2002). The systematic variations

PT

in data matrix X are split into two parts via the orthogonal signal correction (OSC) technique (Wold et al. , 1998): one part exhibits linear responsiveness to response and

CE

another is linearly orthogonal to response. For OPLS-DA, only variation related to the

AC

response is useful for modeling. It is important to note that OPLS-DA has similar prediction results with PLS-DA (Tapp and Kemsley, 2009). But OPLS-DA has better visualization and interpretation ability since fewer latent variables are required to explain the same variation of the data compared to PLS-DA (Verron et al. , 2004).

2.4.3 Non-linear methods

The above introduced supervised methods are very well established to investigate linear relationship between variables and the class labels. However, these approaches are not suitable for investigating the serious nonlinear datasets that may present in

44

ACCEPTED MANUSCRIPT intricate biological systems. Because complex interactions occurring in different levels of biological organizations, it is common that biological processes following a

RI P

T

nonlinear response. In these cases, non-linear pattern recognition methods are required to characterize metabolomics data. Many non-linear techniques have been

SC

proposed in pattern recognition and machine learning research fields. Among them,

methods used in metabolomics.

MA NU

kernel-PLS, random forest (RF) and support vector machines (SVM) are three popular

Kernel-based models transformed the data via some specific functions, the kernels. Using the kernel transformation, the nonlinear problem of the original data is

ED

transformed into a higher-dimensional feature space. After that, the nonlinear problem

PT

becomes linear and can be solved easily. The kernel functions have various types and users can choice the suitable kernel transformation for certain dataset. Positive

CE

semi-definite is one of the requirements of the kernel matrix and many kernel

AC

functions are capable for it (Shawe-Taylor and Cristianini, 2004). Dot product is the simplest kernel function for the data matrix. Radial basic function is another frequently used kernel function, which requires tuning of parameters relating to the width of Gaussian. Kernel based classification methods such as kernel Fisher discriminant analysis (K-FDA) (Cao et al. , 2011, Scholkopft and Mullert, 1999), kernel PLS (K-PLS) (Walczak and Massart, 1996) and kernel OPLS (KO-PLS) (Bylesjo et al. , 2008) were developed and they all exhibited obvious advantages in solving nonlinear problems. Support vector machine (SVM) is a powerful kernel based classifier which makes use 45

ACCEPTED MANUSCRIPT of a set of objects, named support vectors, to define decision boundaries separating different classes (Vapnik, 1998). SVM focus on finding a hyper-plane that splits two

RI P

T

classes perfectly, while the thickness of the margins is maximized. So that, for each class, the distance of the plane to the data point is the closest (Li et al. , 2009b, Luts et

SC

al. , 2010). If a point stands on the wrong side of the margin, the margin is maximized by penalizing the point.. The step can split the overlapping classes. Support vectors

MA NU

are the points which are on the boundary or on the wrong side of the margin supporting the split. When the classes are separated by a non-linear boundary, the kernel method is used to find the boundary. SVM is particularly suitable for the data

ED

of small sample size. And it is capable to handle both linear and nonlinear problems of

PT

classification by applying linear and non-linear kernels. The major disadvantage of SVM is that it does not provide a universal way of solving non-linear problems.

CE

Hence, the kernel functions should be selected discreetly (Burges, 1998).

AC

Random forest (RF) (Breiman, 2001) is an ensemble learning method consisting of a large number of classification trees (Breiman et al. , 1984). It is one of the most powerful classifiers for high dimensional data (Scott et al. , 2013). A bootstrap method is used to select samples with replacement from the original samples (so called bootstrap samples) for training classification trees. All trees in the forest are grown to the maximum size, without pruning. Two machine learning techniques, bagging and random feature selection are employed in RF. They are both powerful and efficient techniques. For bagging, each tree is trained on the bootstrap samples of the training dataset. Predictions are obtained from the majority votes of the trees. When RF 46

ACCEPTED MANUSCRIPT constructs an individual tree, not all training samples are used. So, a set of out-of-bag (OOB) samples exist, which could be applied to gain the validated classification

RI P

T

accuracy. In RF, the variable importance is measured by permuting the variable randomly but keeping all other variables fixed and computing the classification

SC

accuracy loss in estimation of OOB samples. It is defined as the average accuracy loss over all trees and all samples in the forest. Because RF performs variable selection

MA NU

simultaneously during classification, it is suitable for high dimensional data where irrelevant variables exit. It has been proved that RF gained much better performance than many of the classifiers such as PLS-DA and OPLS-DA with external validation.

ED

Unfortunately, it does not draw enough attention in metabolomics (Scott, Lin, 2013).

PT

2.4.4 Model tuning and model validation

CE

The tuning of parameters is of great importance when building a model. For example,

AC

one of the crucial step in PCA and PLS is to optimize the number of components.. Many methods can be used for model tuning including the Mallows' Cp statistics (Mallows, 1973), Akeike information criterion (AIC) (Akaike, 1974), Bayesian information criterion (BIC) (Schwarz, 1978) and cross-validation (CV) (Stone, 1974, Wold, 1978). Among them, CV is the most commonly used method because it chooses a model from the prediction ability point of view. The simplest CV method is leave-one-out CV (LOOCV) in which one sample is left out successively for prediction while the others are used for training. However, it is reported that LOOCV tends to select large models if only the prediction error is used (Shao, 1993). One

47

ACCEPTED MANUSCRIPT solution is to use different sample partition way, such as K-fold CV (Geisser, 1975) and Monte Carlo CV (MCCV) (Shao, 1993).

RI P

T

Model validation is a process of deciding whether the results quantify hypothesized relationships between the variables and the responses and provide accurate estimation

SC

of the model prediction ability. Supervised machine learning methods such as PLS-DA have high tendency of over-fitting especially on high dimensional data

MA NU

(Brereton, 2006, Li et al. , 2010c). Thus a careful model validation is desired. Figure 8 shows an example of CV for model validation. The four data sets with random values are simulated on computer. Each data set has 100 samples and the number of variables

ED

is set to 5, 50, 500 and 5000, respectively. For each sample, the class label is

PT

randomly assigned. PLS-DA is used to classify the samples. In PLS scores plots (Fig. 8 (A)), the two groups can be separated, and with the increase of variables the

CE

separation gets better clearly, indicating the presence of over-fitting. While in the PLS

AC

CV plots (Fig. 8 (B)), the two groups cannot be separated, suggesting that the model has no predictive ability and should not be used for prediction. Insert Figure 8

There are several criteria that can improve the prediction ability of a model such as sensitivity, specificity, accuracy, receiver operating characteristic (ROC) curve and Q2. Sensitivity, i.e. the true positive rate, is the proportion of the actual positive samples which are correctly identified as positives. Specificity, i.e. the true negative rate, is the proportion of the actual negative samples which are correctly identified as negatives. Accuracy of a classification model is the rate that how many objects are correctly 48

ACCEPTED MANUSCRIPT classified. A ROC curve is the plot of sensitivity versus 1-specificity at different classification boundaries. For a perfect classification, the value of specificity should

RI P

T

be close to 1, and 1-specificity should be preferably close to 0. The ROC curve is a method to describe the sensitivity and specificity of a classification model at different

SC

classification boundaries. The area under this curve (AUC) is finally used to quantify the performance of this method. The AUC is closer to 1, the method is better performs.

MA NU

The prediction error measurement Q2 measures how well these class labels could be predicted for the new data. It is defined as follows:

ED

Q2

(y  1 (y

i

 yˆ i ) 2

i

 yi ) 2

i

i

PT

where yˆ i denotes to the predicted value of sample i, while yi denotes to the mean value of y for all samples. If all the samples are predicted very well, Q2 is close to 1.

CE

In CV, model tuning and model validation is carried out simultaneously. When the

AC

optimal model parameter is determined, the characteristics of prediction ability such as Q2 are obtained by tuning parameters. However, this strategy may provide over optimistic results of the model prediction ability (Krstajic et al. , 2014). A more systematic way is to use double CV (DCV) (Filzmoser et al. , 2009, Stone, 1974). It consists of two loops, the inner loop is used for model tuning and outer loop is used for model validation. The samples used for prediction are participated in model tuning. DCV has showed more accurate estimation of the error rate than 6-fold CV (Westerhuis, Hoefsloot, 2008). The ideal situation of model validation is to use independent test set which is assumed 49

ACCEPTED MANUSCRIPT to be representative, independent from the training data. There are a number of algorithms to divide samples into training set and test set, including the Duplex

RI P

T

algorithm (Snee, 1977), Kennard-and-Stone (KS) algorithm (Kennard and Stone, 1969) and SPXY algorithm (Galvao et al. , 2005). However, the ideal situation is

SC

usually unsatisfied in real conditions and therefore the results of test set should be biased.

MA NU

Permutation is another way for validating a model. The classification ability between the established classification model and other random classification models are compared. It evaluates whether the former is significantly better than the latter

ED

(Golland et al. , 2005). , The class labels of samples are permutated in a permutation

PT

test. The rationale behind permutation test is that the model with wrong class labels cannot predict the classes very well. By repeating the permutation test a large number

CE

of times, a group of "wrong" models are built and the distribution for accuracy, Q2 and

AC

AUC can be obtained. For a validated model, the difference between the "right" models and the "wrong" models should be significant. This can be characterized through a statistical hypothesis testing. And it has many applications in metabolomics studies (Blaise et al. , 2013, Huang et al. , 2013). Modeling of metabolomics data is a systematic work. For exploratory studies, unsupervised method such as PCA provides an informative first-look at the dataset structure and relationships between groups. Then supervised methods such as PLS-DA and OPLS-DA are applied to classify the samples as well as biomarker discovery. When these classifiers work improperly, non-linear models, SVM and RF, 50

ACCEPTED MANUSCRIPT are applied to further explore the non-linear relationship within the data. In addition,

RI P

with caution to ensure its prediction ability for future samples.

T

the parameters of each model should be well tuned and the model should be validated

SC

2.5. One eye on the future

Compared to animals, plants have an extremely wide diversity of metabolites. It is

MA NU

estimated that there are more than 200,000 metabolites presenting in the plant kingdom.(Oksman-Caldentey and Saito, 2005). So far, numerous authors have demonstrated that data analysis based on an individual dataset exhibited limitations

ED

for grasping the chemical complexity of plant metabolome. A large amount of data and information is generating from numerous experimental platforms (e.g. NMR,

PT

GC-MS or LC-MS). Consequently, information combination becomes more and more

CE

necessary and important to extend the metabolites‟ coverage and characterize a

AC

biological system (Boccard and Rudaz, 2014). The greatest future challenge is how to efficiently integrate the mess information from various sources, i.e. the data fusion problem. Merging information from multiple datasets with different structure characteristics and extracting the common or distinctive features will unquestionably form a crucial element for the more comprehensive prospect of plant metabolomics. More and more papers have been published to discuss the problem of data fusion since 2005 (Boccard and Rudaz, 2014, Smilde et al. , 2005). A further fusion includes the integration with various “Omics” fields, such as genomics, transcriptomics and proteomics. They are all effective strategies to describe a whole biological system.

51

ACCEPTED MANUSCRIPT However, we should be careful to avoid the network discordance when metabolomics

T

are integrated with other “Omics”.(Fernie and Stitt, 2012).

RI P

3. Conclusions

SC

In summary, metabolomics, as a fundamental biotechnology, plays an essential role in

MA NU

basic research for elucidating environmentally effects, gene functions, and defining cellular processes. So far, it still needs to exercise with cautious about the data acquisition, processing and information interpretation due to numerous limitations related to data analysis of plant metabolomics. We here emphasize four questions

ED

which are of great importance to the advance for data analysis of plant metabolomics.

PT

1) Automatic and effective data preprocessing: this development is still a

CE

hard-to-achieve task up to now, especially for detection, alignment, and deconvolution of peaks with low responses. 2) NP-hard problem in variable selection:

AC

to address this question is an attractive but intractable mission for all of researchers. 3) Confidently identification of an unknown metabolite from complex MS spectrum data still remains great challenge. 4) Model validation: new efficient model validation methods and indexes are urgently desired. And, they should be carefully selected in practice to guarantee that the objective models are fully validated and with good prediction ability for future real samples. All these questions together with the high-dimensional characteristics of metabolomics datasets poses lots of fundamental questions in chemometrics, facing enormous challenges on chemometrics to develop robust and efficient methods to answer various biological questions derived from 52

ACCEPTED MANUSCRIPT metabolomics. We hope that this review can provide a guide for practitioners of plant metabolomics, provide insights with regard to its present use and applications of data

RI P

T

analysis.

Conflicts of interest statement The author declares no conflicts of interest.

SC

Acknowledgements

MA NU

This work was supported financially by National Nature Foundation Committee of P.R. China (No.21465016, No.21105129, No.91127024, and No.21473257), Science and Technological Program for Dongguan‟s Higher Education, Science and Research, and Health Care Institutions (2012108102032).

ED

References

Akaike H. A new look at the statistical model identification. Automatic Control, IEEE Transactions on.

PT

1974;19:716-23.

Alba E, Garcia-Nieto J, Jourdan L, Talbi E. Gene selection in cancer classification using PSO/SVM and 284-90.

Evolutionary Computation, 2007 CEC 2007 IEEE Congress on2007. p.

CE

GA/SVM hybrid algorithms.

Allen F, Greiner R, Wishart D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics. 2014a:1-13.

AC

Allen F, Pon A, Wilson M, Greiner R, Wishart D. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res. 2014b;42:W94-9.

Allwood JW, Ellis DI, Goodacre R. Metabolomic technologies and their application to the study of plants and plant–host interactions. Physiologia Plantarum. 2008;132:117-35. Allwood JW, Goodacre R. An introduction to liquid chromatography–mass spectrometry instrumentation applied in plant metabolomic analyses. Phytochemical analysis. 2010;21:33-47. Allwood JW, Parker D, Beckmann M, Draper J, Goodacre R. Fourier transform ion cyclotron resonance mass spectrometry for plant metabolite profiling and metabolite identification.

Plant Metabolomics:

Springer; 2012. p. 157-76. Anastassiou D. Computational analysis of the synergy among multiple interacting genes. Molecular Systems Biology. 2007;3:n/a-n/a. Andreev VP, Rejtar T, Chen H-S, Moskovets EV, Ivanov AR, Karger BL. A universal denoising and peak picking algorithm for LC-MS based on matched filtration in the chromatographic time domain. Analytical chemistry. 2003;75:6314-26. Araújo MCU, Saldanha TCB, Galvão RKH, Yoneyama T, Chame HC, Visani V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemometr Intell Lab. 53

ACCEPTED MANUSCRIPT 2001;57:65-73. BaniMustafa AH, Hardy NW. A Strategy for Selecting Data Mining Techniques in Metabolomics.

Plant

Metabolomics: Springer; 2012. p. 317-33. Baran R, Kochi H, Saito N, Suematsu M, Soga T, Nishioka T, et al. MathDAMP: a package for differential

T

analysis of metabolite profiles. BMC bioinformatics. 2006;7:530. Barker M, Rayens W. Partial least squares for discrimination. J Chemometr. 2003;17:166-73.

RI P

Bellew M, Coram M, Fitzgibbon M, Igra M, Randolph T, Wang P, et al. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics. 2006;22:1902-9.

SC

Ben-Bassat M. Pattern recognition and reduction of dimensionality. Handbook of Statistics. 1982;2:773-910.

Benecke C, Grund R, Hohberger R, Kerber A, Laue R, Wieland T. Molgen(+), a Generator of

MA NU

Connectivity Isomers and Stereoisomers for Molecular-Structure Elucidation. Analytica Chimica Acta. 1995;314:141-7.

Benton H, Wong D, Trauger S, Siuzdak G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Analytical chemistry. 2008;80:6382-9. Bertini I, Luchinat C, Miniati M, Monti S, Tenori L. Phenotyping COPD by 1H NMR metabolomics of exhaled breath condensate. Metabolomics. 2014;10:302-11.

Bertrand S, Bohni N, Schnee S, Schumpp O, Gindro K, Wolfender J-L. Metabolite induction via

ED

microorganism co-culture: A potential way to enhance chemical diversity for drug discovery. Biotechnology Advances. 2014;32:1180-204. Biais B, Bernillon S, Deborde C, Cabasson C, Rolin D, Tadmor Y, et al. Precautions for harvest, sampling,

PT

storage, and transport of crop plant metabolomics samples. 51-63.

Plant Metabolomics: Springer; 2012. p.

CE

Bishop CM. Pattern recognition and machine learning: springer New York; 2006. Blaise BJ, Gouel-Cheron A, Floccard B, Monneret G, Allaouchiche B. Metabolic Phenotyping of Traumatized Patients Reveals a Susceptibility to Sepsis. Analytical Chemistry. 2013;85:10850-5.

AC

Blanchet FG, Legendre P, Borcard D. Forward selection of explanatory variables. Ecology. 2008;89:2623-32.

Boccard J, Rudaz S. Harnessing the complexity of metabolomic data with chemometrics. Journal of Chemometrics. 2014;28:1-9. Boccard J, Veuthey JL, Rudaz S. Knowledge discovery in metabolomics: an overview of MS data handling. Journal of separation science. 2010;33:290-304. Bocker S, Letzel MC, Liptak Z, Pervukhin A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics. 2009;25:218-24. Bocker S, Rasche F. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics. 2008;24:i49-i55. Boelens HF, Dijkstra RJ, Eilers PH, Fitzpatrick F, Westerhuis JA. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and Raman spectroscopic detection. Journal of Chromatography A. 2004;1057:21-30. Bonn B, Leandersson C, Fontaine F, Zamora I. Enhanced metabolite identification with MS(E) and a semi-automated

software

for

structural

elucidation.

2010;24:3127-38. Breiman L. Random forests. Mach Learn. 2001;45:5-32. 54

Rapid

Commun

Mass

Spectrom.

ACCEPTED MANUSCRIPT Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees: CRC press; 1984. Breitling R, Ritchie S, Goodenowe D, Stewart ML, Barrett MP. Ab initio prediction of metabolic networks using Fourier transform mass spectrometry data. Metabolomics. 2006;2:155-64. Brereton RG. Consequences of sample size, variable selection, and model validation and optimisation,

T

for predicting classification ability from analytical data. Trac-Trend Anal Chem. 2006;25:1103-11. Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away. J Chemometr.

RI P

2014;28:213-25.

Brown M, Dunn WB, Dobson P, Patel Y, Winder CL, Francis-McIntyre S, et al. Mass spectrometry tools and metabolite-specific databases for molecular identification in metabolomics. Analyst.

SC

2009;134:1322-32.

Brown M, Wedge DC, Goodacre R, Kell DB, Baker PN, Kenny LC, et al. Automated workflows for accurate mass-based putative metabolite identification in LC/MS-derived metabolomic datasets.

MA NU

Bioinformatics. 2011;27:1108-12.

Burges CJ. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery. 1998;2:121-67.

Bylesjo M, Rantalainen M, Nicholson J, Holmes E, Trygg J. K-OPLS package: Kernel-based orthogonal projections to latent structures for prediction and interpretation in feature space. Bmc Bioinformatics. 2008;9:106.

Bylund D, Danielsson R, Malmquist G, Markides KE. Chromatographic alignment by warping and programming

as

a

pre-processing

ED

dynamic

tool

for

PARAFAC

modelling

of

liquid

chromatography–mass spectrometry data. Journal of Chromatography A. 2002;961:237-44. Cai W, Li Y, Shao X. A variable selection method based on uninformative variable elimination for

PT

multivariate calibration of near-infrared spectra. Chemometr Intell Lab. 2008;90:188-94. Cao DS, Zeng MM, Yi LZ, Wang B, Xu QS, Hu QN, et al. A novel kernel Fisher discriminant analysis:

CE

constructing informative kernel by decision tree ensemble for metabolomics data analysis. Anal Chim Acta. 2011;706:97-104.

Cao MD, Sitter B, Bathen TF, Bofin A, Lønning PE, Lundgren S, et al. Predicting long-term survival and

AC

treatment response in breast cancer patients receiving neoadjuvant chemotherapy by MR metabolic profiling. NMR in Biomedicine. 2012;25:369-78. Castillo S, Gopalacharyulu P, Yetukuri L, Orešič M. Algorithms and tools for the preprocessing of LC–MS metabolomics data. Chemometrics and Intelligent Laboratory Systems. 2011;108:23-32. Centner V, Massart D-L, de Noord OE, de Jong S, Vandeginste BM, Sterna C. Elimination of Uninformative Variables for Multivariate Calibration. Anal Chem. 1996a;68:3851-8. Centner V, Massart DL, de Noord OE, de Jong S, Vandeginste BM, Sterna C. Elimination of uninformative variables for multivariate calibration. Anal Chem. 1996b;68:3851-8. Chan ECY, Koh PK, Mal M, Cheah PY, Eu KW, Backshall A, et al. Metabolic Profiling of Human Colorectal Cancer Using High-Resolution Magic Angle Spinning Nuclear Magnetic Resonance (HR-MAS NMR) Spectroscopy and Gas Chromatography Mass Spectrometry (GC/MS). J Proteome Res. 2009;8:352-61. Chen T, Martin E. Bayesian linear regression and variable selection for spectroscopic calibration. Ana Chim Acta. 2009;631:13-21. Chong IG, Jun CH. Performance of some variable selection methods when multicollinearity is present. Chemometrics and Intelligent Laboratory Systems. 2005;78:103-12. Creek D, Dunn W, Fiehn O, Griffin J, Hall R, Lei Z, et al. Metabolite identification: are you sure? And how do your peers gauge your confidence? Metabolomics. 2014;10:350-3. 55

ACCEPTED MANUSCRIPT Creek DJ, Jankevics A, Burgess KE, Breitling R, Barrett MP. IDEOM: an Excel interface for analysis of LC-MS-based metabolomics data. Bioinformatics. 2012;28:1048-9. Cusido RM, Onrubia M, Sabater-Jara AB, Moyano E, Bonfill M, Goossens A, et al. A rational approach to improving the biotechnological production of taxanes in plant cell cultures of Taxus spp.

T

Biotechnology Advances. 2014;32:1157-67. Damen H, Henneberg D, Weimann B. Siscom — a new library search system for mass spectra.

RI P

Analytica Chimica Acta. 1978;103:289-302.

Danielsson R, Bylund D, Markides KE. Matched filtering with background suppression for improved quality of base peak chromatograms and mass spectra in liquid chromatography–mass spectrometry.

SC

Analytica Chimica Acta. 2002;454:167-84.

Davey MR, Anthony P, Power JB, Lowe KC. Plant protoplasts: status and biotechnological perspectives. Biotechnology Advances. 2005;23:131-71.

MA NU

De Souza DP, Saunders EC, McConville MJ, Likić VA. Progressive peak clustering in GC-MS Metabolomic experiments applied to Leishmania parasites. Bioinformatics. 2006;22:1391-6. De Vos RC, Moco S, Lommen A, Keurentjes JJ, Bino RJ, Hall RD. Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nature protocols. 2007;2:778-91.

Deborde C, Erban A, Kopka J, Goodacre R, Hall RD. Plant metabolomics and its potential for systems biology research: Background concepts, technology, and methodology. Methods in Systems Biology.

ED

2011;500:299.

Deng B-c, Yun Y-h, Liang Y-z, Yi L-z. A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. Analyst. 2014;139:4836-45.

PT

Doerfler H, Sun X, Wang L, Engelmeier D, Lyon D, Weckwerth W. mzGroupAnalyzer--predicting pathways and novel chemical structures from untargeted high-throughput metabolomics data. PLoS

CE

One. 2014;9:e96188.

Dong H, Zhang AH, Sun H, Wang HY, Lu X, Wang M, et al. Ingenuity pathways analysis of urine metabolomics phenotypes toxicity of Chuanwu in Wistar rats by UPLC-Q-TOF-HDMS coupled with

AC

pattern recognition methods. Mol Biosyst. 2012;8:1206-21. Draisma HHM, Reijmers TH, Meulman JJ, van der Greef J, Hankemeier T, Boomsma DI. Hierarchical clustering analysis of blood plasma lipidomics profiles from mono- and dizygotic twin families. Eur J Hum Genet. 2013;21:95-101. Draper J, Enot DP, Parker D, Beckmann M, Snowdon S, Lin W, et al. Metabolite signal identification in accurate mass metabolomics data with MZedDB, an interactive m/z annotation tool utilising predicted ionisation behaviour 'rules'. BMC Bioinformatics. 2009;10:227. Du P, Kibbe WA, Lin SM. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics. 2006;22:2059-65. Dunn WB, Broadhurst D, Begley P, Zelena E, Francis-McIntyre S, Anderson N, et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nat Protoc. 2011;6:1060-83. Duran AL, Yang J, Wang L, Sumner LW. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics. 2003;19:2283-93. Egertson JD, Eng JK, Bereman MS, Hsieh EJ, Merrihew GE, MacCoss MJ. De novo correction of mass measurement error in low resolution tandem MS spectra for shotgun proteomics. J Am Soc Mass Spectrom. 2012;23:2075-82. 56

ACCEPTED MANUSCRIPT Eilers PH. Parametric time warping. Analytical chemistry. 2004;76:404-11. Eng JK, Fischer B, Grossmann J, MacCoss MJ. A Fast SEQUEST Cross Correlation Algorithm. J Proteome Res. 2008;7:4598-602. Ernst M, Silva DB, Silva RR, Vêncio RZ, Lopes NP. Mass spectrometry in plant metabolomics strategies:

T

from analytical platforms to data acquisition and processing. Natural product reports. 2014. Erve JCL, Gu M, Wang YD, DeMaio W, Talaat RE. Spectral Accuracy of Molecular Ions in an

RI P

LTQ/Orbitrap Mass Spectrometer and Implications for Elemental Composition Determination. J Am Soc Mass Spectr. 2009;20:2058-69.

Fan Y, Murphy TB, Byrne JC, Brennan L, Fitzpatrick JM, Watson RWG. Applying Random Forests To

SC

Identify Biomarker Panels in Serum 2D-DIGE Data for the Detection and Staging of Prostate Cancer. J Proteome Res. 2010;10:1361-73.

Favilla S, Durante C, Vigni ML, Cocchi M. Assessing feature relevance in NPLS models by VIP. Chemom

MA NU

Intell Lab Syst. 2013;129:76-86.

Felinger A. Data analysis and signal processing in chromatography: Elsevier; 1998. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM. Electrospray Ionization for Mass Spectrometry of Large Biomolecules. Science. 1989;246:64-71.

Fernandez-Albert F, Llorach R, Andres-Lacueva C, Perera A. An R package to analyse LC/MS metabolomic

data:

MAIT

(Metabolite

2014;30:1937-9.

Automatic

Identification

Toolkit).

Bioinformatics.

ED

Fernie AR, Stitt M. On the discordance of metabolomics with proteomics and transcriptomics: coping with increasing complexity in logic, chemistry, and network interactions scientific correspondence. Plant Physiology. 2012;158:1139-45.

PT

Fiehn O, Kopka J, Trethewey RN, Willmitzer L. Identification of uncommon plant metabolites based on calculation of elemental compositions using gas chromatography and quadrupole mass spectrometry.

CE

Anal Chem. 2000;72:3573-80.

Fiehn O, Spranger J. Use of Metabolomics to Discover Metabolic Patterns Associated with Human Diseases. In: Harrigan G, Goodacre R, editors. Metabolic Profiling: Its Role in Biomarker Discovery and

AC

Gene Function Analysis: Springer US; 2003. p. 199-215. Field D, Sansone SA. A special issue on data standards. Omics-a Journal of Integrative Biology. 2006;10:84-93.

Filzmoser P, Liebmann B, Varmuza K. Repeated double cross validation. J Chemometr. 2009;23:160-71. Forina M, Casolino C, Pizarro Millan C. Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems. J Chemometr. 1999;13:165-84. Galvao RKH, Araujo MCU, Jose GE, Pontes MJC, Silva EC, Saldanha TCB. A method for calibration and validation subset partitioning. Talanta. 2005;67:736-40. Gan F, Ruan G, Mo J. Baseline correction by improved iterative polynomial fitting with automatic threshold. Chemometrics and Intelligent Laboratory Systems. 2006;82:59-65. Geisser S. The predictive sample reuse method with applications. J Am Stat Assoc. 1975;70:320-8. Genga A, Mattana M, Coraggio I, Locatelli F, Piffanelli P, Consonni R. Plant Metabolomics: A characterisation of plant responses to abiotic stresses. 2011. Gerlich M, Neumann S. MetFusion: integration of compound identification strategies. Journal of Mass Spectrometry. 2013;48:291-8. Gika HG, Macpherson E, Theodoridis GA, Wilson ID. Evaluation of the repeatability of ultra-performance liquid chromatography–TOF-MS for global metabolic profiling of human urine 57

ACCEPTED MANUSCRIPT samples. Journal of Chromatography B. 2008a;871:299-305. Gika HG, Theodoridis G, Extance J, Edge AM, Wilson ID. High temperature-ultra performance liquid chromatography–mass spectrometry for the metabonomic analysis of Zucker rat urine. Journal of Chromatography B. 2008b;871:279-87.

T

Gipson GT, Tatsuoka KS, Sokhansanj BA, Ball RJ, Connor SC. Assignment of MS-based metabolomic datasets via compound interaction pair mapping. Metabolomics. 2008;4:94-103. Springer; 2005. p. 501-15.

RI P

Golland P, Liang F, Mukherjee S, Panchenko D. Permutation tests for classification.

Learning Theory:

Goodacre R. Making sense of the metabolome using evolutionary computation: seeing the wood with

SC

the trees. Journal of experimental botany. 2005;56:245-54.

Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in biotechnology. 2004;22:245-52.

MA NU

Gosselin R, Rodrigue D, Duchesne C. A Bootstrap-VIP approach for selecting wavelength intervals in spectral imaging applications. Chemometrics and Intelligent Laboratory Systems. 2010;100:12-21. H Martens TN. Multivariate Calibration. New York: Wiley; 1989. Haas W, Faherty BK, Gerber SA, Elias JE, Beausoleil SA, Bakalarski CE, et al. Optimization and use of peptide mass measurement accuracy in shotgun proteomics. Molecular & Cellular Proteomics. 2006;5:1326-37.

Haimi P, Uphoff A, Hermansson M, Somerharju P. Software tools for analysis of mass spectrometric

ED

lipidome data. Analytical chemistry. 2006;78:8324-31. Halket JM, Waterman D, Przyborowska AM, Patel RKP, Fraser PD, Bramley PM. Chemical derivatization and mass spectral libraries in metabolic profiling by GC/MS and LC/MS/MS. Journal of Experimental

PT

Botany. 2005;56:219-43.

Hall MA. Correlation-based feature selection for machine learning: The University of Waikato; 1999.

CE

Hall RD. Plant metabolomics: from holistic hope, to hype, to hot topic. New Phytologist. 2006;169:453-68.

Hall RD. Annual Plant Reviews, Biology of Plant Metabolomics: John Wiley & Sons; 2011.

AC

Hantao LW, Aleme HG, Pedroso MP, Sabin GP, Poppi RJ, Augusto F. Multivariate curve resolution combined with gas chromatography to enhance analytical separation in complex samples: A review. Analytica Chimica Acta. 2012;731:11-23. Hastings CA, Norton SM, Roy S. New algorithms for processing and peak detection in liquid chromatography/mass

spectrometry

data.

Rapid

communications

in

mass

spectrometry.

2002;16:462-7. Heinonen M, Rantanen A, Mielikainen T, Kokkonen J, Kiuru J, Ketola RA, et al. FiD: a software for ab initio structural identification of product ions from tandem mass spectrometric data. Rapid Commun Mass Spectrom. 2008;22:3043-52. Heinonen M, Shen H, Zamboni N, Rousu J. Metabolite identification and molecular fingerprint prediction through machine learning. Bioinformatics. 2012;28:2333-41. Hilario M, Kalousis A, Pellegrini C, Mueller M. Processing and classification of protein mass spectra. Mass spectrometry reviews. 2006;25:409-49. Hill AW, Mortishire-Smith RJ. Automated assignment of high-resolution collisionally activated dissociation mass spectra using a systematic bond disconnection approach. Rapid Commun Mass Sp. 2005;19:3111-8. iller 58

,

ange rauk J, J ger C, Spura J, Schreiber K, Schomburg D. MetaboliteDetector:

ACCEPTED MANUSCRIPT comprehensive analysis tool for targeted and nontargeted GC/MS based metabolome analysis. Analytical chemistry. 2009;81:3429-39. Holcapek M, Jirasko R, Lisa M. Basic rules for the interpretation of atmospheric pressure ionization mass spectra of small molecules. J Chromatogr A. 2010;1217:3908-21.

T

Holmes E, Loo RL, Stamler J, Bictash M, Yap IKS, Chan Q, et al. Human metabolic phenotype diversity and its association with diet and blood pressure. Nature. 2008;453:396-U50.

RI P

Hoskuldsson A. Variable and subset selection in PLS regression. Chemometr Intell Lab. 2001;55:23-38. Huang N, Siegel MM, Kruppa GH, Laukien FH. Automation of a Fourier transform ion cyclotron resonance mass spectrometer for acquisition, analysis, and E-mailing of high-resolution exact-mass

SC

electrospray ionization mass spectral data. J Am Soc Mass Spectr. 1999;10:1166-73. Huang ZZ, Chen YJ, Hang W, Gao Y, Lin L, Li DY, et al. Holistic metabonomic profiling of urine affords potential early diagnosis for bladder and kidney cancers. Metabolomics. 2013;9:119-29.

MA NU

Hubert J, Nuzillard JM, Purson S, Hamzaoui M, Borie N, Reynaud R, et al. Identification of Natural Metabolites in Mixture: A Pattern Recognition Strategy Based on C-13 NMR. Analytical Chemistry. 2014;86:2955-62.

Hufsky F, Rempt M, Rasche F, Pohnert G, Böcker S. De novo analysis of electron impact mass spectra using fragmentation trees. Analytica Chimica Acta. 2012;739:67-76. Hufsky F, Scheubert K, Bocker S. Computational mass spectrometry for small-molecule fragmentation. Trac-Trend Anal Chem. 2014;53:41-8.

ED

Hummel J, Strehmel N, Selbig J, Walther D, Kopka J. Decision tree supported substructure prediction of metabolites from GC-MS profiles. Metabolomics. 2010;6:322-33. Jirasek A, Schulze G, Yu M, Blades M, Turner R. Accuracy and precision of manual baseline

PT

determination. Applied spectroscopy. 2004;58:1488-99. Johnson KJ, Wright BW, Jarman KH, Synovec RE. High-speed peak matching algorithm for retention

CE

time alignment of gas chromatographic data for chemometric analysis. Journal of Chromatography A. 2003;996:141-55.

Kalivas JH, Roberts N, Sutter JM. Global optimization by simulated annealing with wavelength

AC

selection for ultraviolet-visible spectrophotometry. Anal Chem. 1989;61:2024-30. Kangas LJ, Metz TO, Isaac G, Schrom BT, Ginovska-Pangovska B, Wang L, et al. In silico identification software (ISIS): a machine learning approach to tandem mass spectral identification of lipids. Bioinformatics. 2012;28:1705-13. atajamaa M, Miettinen J, Orešič M. MZmine: tool ox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics. 2006;22:634-6. atajamaa M, Orešič M. Data processing for mass spectrometry-based metabolomics. Journal of Chromatography A. 2007;1158:318-28. Kaufmann A. Strategy for the elucidation of elemental compositions of trace analytes based on a mass resolution of 100 000 full width at half maximum. Rapid Commun Mass Sp. 2010;24:2035-45. Kell DB. Systems biology, metabolic modelling and metabolomics in drug discovery and development. Drug discovery today. 2006;11:1085-92. Keller BO, Suj J, Young AB, Whittal RM. Interferences and contaminants encountered in modern mass spectrometry. Analytica Chimica Acta. 2008;627:71-81. Kennard RW, Stone LA. Computer Aided Design of Experiments. Technometrics. 1969;11:137-48. Kerber A, Laue R, Meringer M, Varmuza K. MOLGEN-MS: Evaluation of low resolution electron impact mass spectra with MS classification and exhaustive structure generation. In: Gelpi E, editor. Advances 59

ACCEPTED MANUSCRIPT in Mass Spectrometry 15: Wiley; 2001. p. 939-40. Keurentjes JJ, Fu J, De Vos CR, Lommen A, Hall RD, Bino RJ, et al. The genetics of plant metabolism. Nature genetics. 2006;38:842-9. Kim HK, Choi YH, Verpoorte R. NMR-based plant metabolomics: where do we stand, where do we go?

T

Trends in biotechnology. 2011;29:267-75. 2010;21:4-13.

RI P

Kim HK, Verpoorte R. Sample preparation for plant metabolomics. Phytochemical analysis. Kim S, Zhang X. Discovery of false identification using similarity difference in GC–MS-based metabolomics. J Chemometr. 2014:doi: 10.1002/cem.2665.

SC

Kind T, Fiehn O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics. 2007;8:105.

Kind T, Fiehn O. Advances in structure elucidation of small molecules using mass spectrometry. Bioanal

MA NU

Rev. 2010;2:23-60.

Kind T, Wohlgemuth G, Lee DY, Lu Y, Palazoglu M, Shahbaz S, et al. FiehnLib: Mass Spectral and Retention Index Libraries for Metabolomics Based on Quadrupole and Time-of-Flight Gas Chromatography/Mass Spectrometry. Anal Chem. 2009;81:10038-48. Knolhoff A, Callahan J, Croley T. Mass Accuracy and Isotopic Abundance Measurements for HR-MS Instrumentation: Capabilities for Non-Targeted Analyses. J Am Soc Mass Spectr. 2014;25:1285-94. Koch BP, Dittmar T, Witt M, Kattner G. Fundamentals of molecular formula assignment to ultrahigh

ED

resolution mass data of natural organic matter. Anal Chem. 2007;79:1758-63. Koekemoer G, Dercksen M, Allison J, Santana L, Reinecke CJ. Concurrent class analysis identifies discriminatory variables from metabolomics data on isovaleric acidemia. Metabolomics.

PT

2012;8:S17-S28.

Kohonen T, Maps S-O. Springer series in information sciences. Self-organizing maps. 1995;30.

CE

Koo I, Kim S, Zhang X. Comparative analysis of mass spectral matching-based compound identification in gas chromatography-mass spectrometry. J Chromatogr A. 2013;1298:132-8. Kopka J. Current challenges and developments in GC-MS based metabolite profiling technology.

AC

Journal of Biotechnology. 2006;124:312-22. Kopka J, Schauer N, Krueger S, Birkemeyer C, Usadel B, Bergmuller E, et al. [email protected]: the Golm Metabolome Database. Bioinformatics. 2005;21:1635-8. Kriegel H-P, Kröger P, Zimek A. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD). 2009;3:1. Krishnan S, Vogels JT, Coulier L, Bas RC, Hendriks MW, Hankemeier T, et al. Instrument and process independent

binning

and

baseline

correction

methods

for

liquid

chromatography–high

resolution-mass spectrometry deconvolution. Analytica Chimica Acta. 2012;740:12-9. Krooshof PWT, Ustun B, Postma GJ, Buydens LMC. Visualization and Recovery of the (Bio)chemical Interesting Variables in Data Analysis with Support Vector Machine Classification. Analytical Chemistry. 2010;82:7000-7. Krstajic D, Buturovic L, Leahy D, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminformatics. 2014;6:10. Kueger S, Steinhauser D, Willmitzer L, Giavalisco P. High‐resolution plant metabolomics: from mass spectral features to metabolites and from whole‐cell analysis to subcellular metabolite distributions. The Plant Journal. 2012;70:39-50. 60

ACCEPTED MANUSCRIPT Kuehl D, Wang YD. Peak shape calibration method improves the mass accuracy of mass spectrometers. Biopharm Int. 2006;19:32-+. Kuhl C, Tautenhahn R, Bottcher C, Larson TR, Neumann S. CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets.

T

Anal Chem. 2012;84:283-9. Kumari S, Stevens D, Kind T, Denkert C, Fiehn O. Applying In-Silico Retention Index and Mass Spectra

RI P

Matching for Identification of Unknown Metabolites in Accurate Mass GC-TOF Mass Spectrometry. Anal Chem. 2011;83:5895-902.

Kvalheim OM. Interpretation of partial least squares regression models by means of target projection

SC

and selectivity ratio plots. Journal of Chemometrics. 2010;24:496-504.

Kvalheim OM, Brakstad F, Liang Y. Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Analytical chemistry. 1994;66:43-51.

MA NU

Kvalheim OM, Karstang TV. Interpretation of latent-variable regression models. Chemometr Intell Lab. 1989;7:39-51.

Kvalheim OM, Liang YZ. Heuristic evolving latent projections: resolving two-way multicomponent data. 1. Selectivity, latent-projective graph, datascope, local rank, and unique resolution. Anal Chem. 1992;64:936-46.

Kvalheim OM, Rajalahti T, Arneberg R. X-tended target projection (XTP)—comparison with orthogonal Chemometr. 2009;23:49-55.

ED

partial least squares (OPLS) and PLS post-processing by similarity transformation (PLS + ST). J Leardi R. Application of genetic algorithm–PLS for feature selection in spectral data sets. J Chemometrics. 2000;14:643-55.

PT

Leardi R. Genetic algorithms in chemometrics and chemistry: a review. J Chemometrics. 2001;15:559-69.

CE

Lei Z, Li H, Chang J, Zhao PX, Sumner LW. MET-IDEA version 2.06; improved efficiency and additional functions for mass spectrometry-based metabolomics data processing. Metabolomics. 2012;8:105-10. Leptos KC, Sarracino DA, Jaffe JD, Krastins B, Church GM. MapQuant: Open‐source software for large

AC

‐scale protein quantification. Proteomics. 2006;6:1770-82. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Ana Chim Acta. 2009a;648:77-84. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Model population analysis for variable selection. J Chemometr. 2010a;24:418-23. Li H-D, Liang Y-Z, Xu Q-S, Cao D-S. Recipe for Uncovering Predictive Genes Using Support Vector Machines Based on Model Population Analysis. IEEE ACM T Comput Bi. 2011;8:1633-41. Li H-D, Xu Q-S, Zhang W, Liang Y-Z. Variable complementary network: a novel approach for identifying biomarkers and their mutual associations. Metabolomics. 2012;8:1218-26. Li H-D, Zeng M-M, Tan B-B, Liang Y-Z, Xu Q-S, Cao D-S. Recipe for revealing informative metabolites based on model population analysis. Metabolomics. 2010b;6:353-61. Li HD, Liang YZ, Xu QS. Support vector machines and its applications in chemistry. Chemometr Intell Lab. 2009b;95:188-98. Li HD, Liang YZ, Xu QS, Cao DS. Model population analysis for variable selection. J Chemometr. 2010c;24:418-23. Li S, Park Y, Duraisingham S, Strobel FH, Khan N, Soltow QA, et al. Predicting Network Activity from High Throughput Metabolomics. PLoS Comput Biol. 2013a;9:e1003123. 61

ACCEPTED MANUSCRIPT Li X-j, Eugene CY, Kemp CJ, Zhang H, Aebersold R. A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Molecular & cellular proteomics. 2005;4:1328-40. Li Z, Wang JJ, Huang J, Zhang ZM, Lu HM, Zheng YB, et al. Nonlinear alignment of chromatograms by

T

means of moving window fast Fourier transfrom cross‐correlation. Journal of separation science. 2013b;36:1677-84.

RI P

Liang J, Yang S, Winstanley A. Invariant optimal feature selection: A distance discriminant and feature ranking based solution. Pattern Recognition. 2008;41:1429-39.

Liang YZ, Kvalheim OM. Resolution of two-way data: theoretical background and practical

SC

problem-solving - Part 1: Theoretical background and methodology. Fresen J Anal Chem. 2001;370:694-704.

Liang YZ, Kvalheim OM, Keller HR, Massart DL, Kiechle P, Erni F. Heuristic evolving latent projections:

MA NU

resolving two-way multicomponent data. 2. Detection and resolution of minor constituents. Anal Chem. 1992;64:946-53.

Lin X, Wang Q, Yin P, Tang L, Tan Y, Li H, et al. A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics. 2011;7:549-58. Lindsay RK, Buchanan BG, Feigenbaum EA, Lederberg J. Dendral - a Case-Study of the 1st Expert-System for Scientific Hypothesis Formation. Artif Intell. 1993;61:209-61.

ED

Lisec J, Schauer N, Kopka J, Willmitzer L, Fernie AR. Gas chromatography mass spectrometry-based metabolite profiling in plants. Nature Protocols. 2006;1:387-96. Listgarten J, Neal RM, Roweis ST, Wong P, Emili A. Difference detection in LC-MS data for protein

PT

biomarker discovery. Bioinformatics. 2007;23:e198-e204. Little J, Williams A, Pshenichnov A, Tkachenko V. Identification of “ nown Unknowns” Utilizing

CE

Accurate Mass Data and ChemSpider. J Am Soc Mass Spectr. 2012;23:179-85. Liu H, Motoda H. Feature selection for knowledge discovery and data mining: Springer; 1998. Liu R, Lin D, Chang W, Liu C, Tsay W, Li J, et al. Issues to address when isotopically labeled analogues of

AC

analytes are used as internal standards. Anal Chem. 2002;74:618AJ26A. Liu X, Zhang Z, Sousa PF, Chen C, Ouyang M, Wei Y, et al. Selective iteratively reweighted quantile regression for baseline correction. Analytical and bioanalytical chemistry. 2014a:1-14. Liu Y, Hong Z, Tan G, Dong X, Yang G, Zhao L, et al. NMR and LC/MS-based global metabolomics to identify serum biomarkers differentiating hepatocellular carcinoma from liver cirrhosis. International Journal of Cancer. 2014b;135:658-68. Lommen A. Ultrafast PubChem Searching Combined with Improved Filtering Rules for Elemental Composition Analysis. Anal Chem. 2014;86:5463-9. Lopatka M, Vivó-Truyols G, Sjerps M. Probabilistic peak detection for first-order chromatographic data. Analytica Chimica Acta. 2014;817:9-16. Luedemann A, Strassburg K, Erban A, Kopka J. TagFinder for the quantitative analysis of gas chromatography—mass

spectrometry

(GC-MS)-based

metabolite

profiling

experiments.

Bioinformatics. 2008;24:732-7. Luedemann A, von Malotky L, Erban A, Kopka J. TagFinder: Preprocessing software for the fingerprinting and the profiling of gas chromatography–mass spectrometry based metabolome analyses.

Plant Metabolomics: Springer; 2012. p. 255-86.

Luts J, Ojeda F, Van de Plas R, De Moor B, Van Huffel S, Suykens JAK. A tutorial on support vector 62

ACCEPTED MANUSCRIPT machine-based methods for classification problems in chemometrics. Anal Chim Acta. 2010;665:129-45. Maeder M. Evolving factor analysis for the resolution of overlapping chromatographic peaks. Analytical chemistry. 1987;59:527-30.

T

Mahadevan S, Shah SL, Marrie TJ, Slupsky CM. Analysis of metabolomic data using support vector machines. Analytical Chemistry. 2008;80:7562-70.

RI P

Makinen VP, Soininen P, Forsblom C, Parkkonen M, Ingman P, Kaski K, et al. (1)H NMR metabonomics approach to the disease continuum of diabetic complications and premature death. Mol Syst Biol. 2008;4.

SC

Mallows CL. Some comments on C p. Technometrics. 1973;15:661-75.

Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other Stochastically Larger than the Other. Ann Math Statist. 1947;18:50-60.

MA NU

Manne R, Shen HL, Liang YZ. Subwindow factor analysis. Chemometrics and Intelligent Laboratory Systems. 1999;45:171-6.

Mao Q, Bai M, Xu JD, Kong M, Zhu LY, Zhu H, et al. Discrimination of leaves of Panax ginseng and P. quinquefolius by ultra high performance liquid chromatography quadrupole/time-of-flight mass spectrometry based metabolomics approach. Journal of Pharmaceutical and Biomedical Analysis. 2014;97:129-40.

McLafferty FW, Hertel RH, Villwock RD. Computer Identification of Mass-Spectra .6. Probability Based Spectrom. 1974;9:690-702.

ED

Matching of Mass-Spectra - Rapid Identification of Specific Compounds in Mixtures. Org Mass Miller A. Subset selection in regression: CRC Press; 2002.

PT

Mitra P, Murthy C, Pal SK. Unsupervised feature selection using feature similarity. IEEE transactions on pattern analysis and machine intelligence. 2002;24:301-12.

CE

Miura D, Tsuji Y, Takahashi K, Wariishi H, Saito K. A Strategy for the Determination of the Elemental Composition by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry Based on Isotopic Peak Ratios. Anal Chem. 2010;82:5887-91.

AC

Moco S, Bino RJ, Vorst O, Verhoeven HA, de Groot J, van Beek TA, et al. A liquid chromatography-mass spectrometry-based metabolome database for tomato. Plant Physiology. 2006;141:1205-18. Mylonas R, Mauron Y, Masselot A, Binz PA, Budin N, Fathi M, et al. X-Rank: A Robust Algorithm for Small Molecule Identification Using Tandem Mass Spectrometry. Anal Chem. 2009;81:7604-10. Nagao T, Yukihira D, Fujimura Y, Saito K, Takahashi K, Miura D, et al. Power of isotopic fine structure for unambiguous determination of metabolite elemental compositions: in silico evaluation and metabolomic application. Anal Chim Acta. 2014;813:70-6. Narsky I, Porter FC. Methods for Variable Ranking and Selection.

Statistical Analysis Techniques in

Particle Physics: Wiley-VCH Verlag GmbH & Co. KGaA; 2013. p. 385-415. Neumann S, Rasche F, Wolf S, Böcker S. Metabolite Identification and Computational Mass Spectrometry.

The Handbook of Plant Metabolomics: Wiley-VCH Verlag GmbH & Co. KGaA; 2013. p.

289-303. Nielsen N-PV, Carstensen JM, Smedsgaard J. Aligning of single and multiple wavelength chromatographic profiles for chemometric data analysis using correlation optimised warping. Journal of Chromatography A. 1998;805:17-35. North DO. An analysis of the factors which determine signal/noise discrimination in pulsed-carrier systems. Proceedings of the IEEE. 1963;51:1016-27. 63

ACCEPTED MANUSCRIPT Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999;27:29-34. Oksman-Caldentey K-M, Saito K. Integrating genomics and metabolomics for engineering plant metabolic pathways. Current opinion in biotechnology. 2005;16:174-9. chromatography/mass spectrometry.

T

Osorio S, Do PT, Fernie AR. Profiling primary metabolites of tomato fruit with gas Plant Metabolomics: Springer; 2012. p. 101-9.

RI P

Patterson AD, Li H, Eichler GS, Krausz KW, Weinstein JN, Fornace AJ, et al. UPLC-ESI-TOFMS-Based Metabolomics and Gene Expression Dynamics Inspector Self-Organizing Metabolomic Maps as Tools for Understanding the Cellular Response to Ionizing Radiation. Analytical Chemistry. 2008;80:665-74.

SC

Pearson GA. A general baseline-recognition and baseline-flattening algorithm. Journal of Magnetic Resonance (1969). 1977;27:265-72. 1901;2:559-72.

MA NU

Pearson K. On lines and planes of closest fit to systems of points in space Philosophical Magazine. Peironcely JE, Rojas-Cherto M, Fichera D, Reijmers T, Coulier L, Faulon JL, et al. OMG: Open Molecule Generator. J Cheminform. 2012;4:21.

Petyuk VA, Jaitly N, Moore RJ, Ding J, Metz TO, Tang K, et al. Elimination of systematic mass measurement errors in liquid chromatography-mass spectrometry based proteomics using regression models and a Priori partial knowledge of the sample content. Anal Chem. 2008;80:693-706. Petyuk VA, Mayampurath AM, Monroe ME, Polpitiya AD, Purvine SO, Anderson GA, et al. DtaRefinery,

ED

a Software Tool for Elimination of Systematic Errors from Parent Ion Mass Measurements in Tandem Mass Spectra Data Sets. Molecular & Cellular Proteomics. 2010;9:486-96. Pierce KM, Mohler RE. A Review of chemometrics applied to comprehensive two-dimensional

PT

separations from 2008–2010. Separation & Purification Reviews. 2012;41:143-68. Pierce KM, Wood LF, Wright BW, Synovec RE. A comprehensive two-dimensional retention time

CE

alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Analytical chemistry. 2005;77:7735-43. Pluskal T, Castillo S, Villar-Briones A, Orešič M. MZmine 2: modular framework for processing,

AC

visualizing, and analyzing mass spectrometry-based molecular profile data. BMC bioinformatics. 2010;11:395.

Powell LA, Hieftje GM. Computer identification of infrared spectra by correlation-based file searching. Anal Chim Acta. 1978;100:313-27. Prakash A, Mallick P, Whiteaker J, Zhang H, Paulovich A, Flory M, et al. Signal maps for mass spectrometry-based comparative proteomics. Molecular & cellular proteomics. 2006;5:423-32. Pravdova V, Walczak B, Massart D. A comparison of two algorithms for warping of analytical signals. Analytica Chimica Acta. 2002;456:77-92. Prince JT, Marcotte EM. Chromatographic alignment of ESI-LC-MS proteomics data sets by ordered bijective interpolated warping. Analytical chemistry. 2006;78:6140-52. Radulovic D, Jelveh S, Ryu S, Hamilton TG, Foss E, Mao Y, et al. Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry. Molecular & cellular proteomics. 2004;3:984-97. Rago D, Mette K, Gurdeniz G, Marini F, Poulsen M, Dragsted LO. A LC-MS metabolomics approach to investigate the effect of raw apple intake in the rat plasma metabolome. Metabolomics. 2013;9:1202-15. Rajalahti T, Arneberg R, Berven FS, Myhr K-M, Ulvik RJ, Kvalheim OM. Biomarker discovery in mass 64

ACCEPTED MANUSCRIPT spectral profiles by means of selectivity ratio plot. Chemometr Intell Lab. 2009a;95:35-48. Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr K-M, Kvalheim OM. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in Complex Spectral or Chromatographic Profiles. Analytical Chemistry. 2009b;81:2581-90.

T

Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr KM, Kvalheim OM. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in

RI P

Complex Spectral or Chromatographic Profiles. Analytical Chemistry. 2009c;81:2581-90. Rasche F, Svatos A, Maddula RK, Bottcher C, Bocker S. Computing fragmentation trees from tandem mass spectrometry data. Anal Chem. 2011;83:1243-51.

SC

Rasmussen S, Lane GA, Mace W, Parsons AJ, Fraser K, Xue H. The use of genomics and metabolomics methods to quantify fungal endosymbionts and alkaloids in grasses. 2012. p. 213-26.

Plant Metabolomics: Springer;

MA NU

Rauf I, Rasche F, Nicolas F, Böcker S. Finding Maximum Colorful Subtrees in Practice. In: Chor B, editor. Lect N Bioinformat: Springer Berlin Heidelberg; 2012. p. 213-23. Redestig H, Fukushima A, Stenlund H, Moritz T, Arita M, Saito K, et al. Compensation for Systematic Cross-Contribution Improves Normalization of Mass Spectrometry Based Metabolomics Data. Analytical chemistry. 2009;81:7974-80.

Robnik-Šikonja M, ononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning. 2003;53:23-69.

ED

Rogers S, Scheltema RA, Girolami M, Breitling R. Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics. 2009;25:512-8. Ruckebusch C, Blanchet L. Multivariate curve resolution: A review of advanced and tailored

PT

applications and challenges. Analytica Chimica Acta. 2013;765:28-36. Sadygov RG, Martin Maroto F, Hühmer AF. ChromAlign: a two-step algorithmic procedure for time of

three-dimensional

LC-MS

chromatographic

surfaces.

Analytical

chemistry.

CE

alignment

2006;78:8207-17.

Savitski MM, Ivonin IA, Nielsen ML, Zubarev RA, Tsybin YO, Hakansson P. Shifted-basis technique

AC

improves accuracy of peak position determination in Fourier transform mass spectrometry. J Am Soc Mass Spectr. 2004;15:457-61. Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, Moritz T, et al. GC-MS libraries for the rapid identification of metabolites in complex biological samples. Febs Letters. 2005;579:1332-7. Scheltema RA, Kamleh A, Wildridge D, Ebikeme C, Watson DG, Barrett MR, et al. Increasing the mass accuracy of high-resolution LC-MS data using background ions - a case study on the LTQ-Orbitrap. Proteomics. 2008;8:4647-56. Scheubert K, Hufsky F, Bocker S. Computational mass spectrometry for small molecules. J Cheminformatics. 2013;5. Scholkopft B, Mullert K-R. Fisher discriminant analysis with kernels. Neural networks for signal processing IX. 1999. Schwarz G. Estimating the dimension of a model. The annals of statistics. 1978;6:461-4. Schymanski EL, Gallampois CMJ, Krauss M, Meringer M, Neumann S, Schulze T, et al. Consensus Structure Elucidation Combining GC/EI-MS, Structure Generation, and Calculated Properties. Anal Chem. 2012;84:3287-95. Schymanski EL, Jeon J, Gulde R, Fenner K, Ruff M, Singer HP, et al. Identifying small molecules via high resolution mass spectrometry: communicating confidence. Environ Sci Technol. 2014;48:2097-8. 65

ACCEPTED MANUSCRIPT Schymanski EL, Meinert C, Meringer M, Brack W. The use of MS classifiers and structure generation to assist in the identification of unknowns in effect-directed analysis. Analytica Chimica Acta. 2008;615:136-47. Schymanski EL, Meringer M, Brack W. Matching Structures to Mass Spectra Using Fragmentation

T

Patterns: Are the Results As Good As They Look? Anal Chem. 2009;81:3608-17. Scott IM, Lin W, Liakata M, Wood JE, Vermeer CP, Allaway D, et al. Merits of random forests emerge in

RI P

evaluation of chemometric classifiers by external validation. Analytica Chimica Acta. 2013;801:22-33. Shao J. Linear Model Selection by Cross-validation. J Am Stat Assoc. 1993;88:486-94. Shao X-G, Leung AK-M, Chau F-T. Wavelet: a new trend in chemistry. Accounts of chemical research.

SC

2003;36:276-83.

Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis: Cambridge university press; 2004. Smilde AK, van der Werf MJ, Bijlsma S, van der Werff-van-der Vat BJC, Jellema RH. Fusion of mass

MA NU

spectrometry-based metabolomics data. Analytical chemistry. 2005;77:6729-36. Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. Analytical chemistry. 2006;78:779-87.

Snee RD. Validation of regression models: methods and examples. Technometrics. 1977;19:415-28. Sokal R, Rohlf F. Assumptions of analysis of variance. Biometry: The Principles and Practice of Statistics in Biological Research 3rd ed New York: WH Freeman. 1995:396-406.

ED

Solinas A, Chessa M, Culeddu N, Porcu M, Virgilio G, Arcadu F, et al. High resolution-magic angle spinning (HR-MAS) NMR-based metabolomic fingerprinting of early and recurrent hepatocellular carcinoma. Metabolomics. 2014;10:616-26.

PT

Stein S. Mass Spectral Reference Libraries: An Ever-Expanding Resource for Chemical Identification. Anal Chem. 2012;84:7274-82.

CE

Stein SE. Chemical substructure identification by mass spectral library searching. J Am Soc Mass Spectrom. 1995;6:644-55.

Stein SE. An integrated method for spectrum extraction and compound identification from gas

AC

chromatography/mass spectrometry data. J Am Soc Mass Spectr. 1999;10:770-81. Stein SE, Scott DR. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom. 1994;5:859-66. Steinbeck C, Han YQ, Kuhn S, Horlacher O, Luttmann E, Willighagen E. The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics. J Chem Inf Comp Sci. 2003;43:493-500. Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society Series B (Methodological). 1974:111-47. Sturm M, Bertsch A, Gropl C, Hildebrandt A, Hussong R, Lange E, et al. OpenMS-An open-source software framework for mass spectrometry. BMC Bioinformatics. 2008;9. Sumner LW, Amberg A, Barrett D, Beale MH, Beger R, Daykin CA, et al. Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics. 2007;3:211-21. Sutter JM, Kalivas JH. Comparison of Forward Selection, Backward Elimination, and Generalized Simulated Annealing for Variable Selection. MicrochemJ. 1993;47:60-6. Swiniarski RW, Skowron A. Rough set methods in feature selection and recognition. Pattern Recogn Lett. 2003;24:833-49. 66

ACCEPTED MANUSCRIPT Tapp HS, Kemsley EK. Notes on the practical utility of OPLS. Trac-Trend Anal Chem. 2009;28:1322-7. Tautenhahn R, Böttcher C, Neumann S. Highly sensitive feature detection for high resolution LC/MS. BMC bioinformatics. 2008;9:504. Tikunov Y, Lommen A, de Vos CR, Verhoeven HA, Bino RJ, Hall RD, et al. A novel approach for

T

nontargeted data analysis for metabolomics. Large-scale profiling of tomato fruit volatiles. Plant Physiology. 2005;139:1125-37.

RI P

Tomasi G, van den Berg F, Andersson C. Correlation optimized warping and dynamic time warping as preprocessing methods for chromatographic data. Journal of Chemometrics. 2004;18:231-41. Toya Y, Shimizu H. Flux analysis and metabolomics for systematic metabolic engineering of

SC

microorganisms. Biotechnology Advances. 2013;31:818-26.

Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS). J Chemometr. 2002;16:119-28. Uarrota VG, Moresco R, Coelho B, Nunes EdC, Peruch LAM, Neubert EdO, et al. Metabolomics

MA NU

combined with chemometric tools (PCA, HCA, PLS-DA and SVM) for screening cassava (Manihot esculenta Crantz) roots during postharvest physiological deterioration. Food Chem. 2014;161:67-78. Valkenborg D, Mertens I, Lemiere F, Witters E, Burzykowski T. The isotopic distribution conundrum. Mass Spectrometry Reviews. 2012;31:96-109.

van Dam NM, Meijden E. A role for metabolomics in plant ecology. Annual Plant Reviews, Biology of Plant Metabolomics. 2011;43:87.

van den Berg RA, Hoefsloot HC, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and

ED

transformations: improving the biological information content of metabolomics data. BMC genomics. 2006;7:142.

van der Greef J, Smilde AK. Symbiosis of chemometrics and metabolomics: past, present, and future.

PT

Journal of Chemometrics. 2005;19:376-86.

Vapnik V. Statistical Learning Theory. New York: John Willey & Sons; 1998.

CE

Varghese RS, Cheema A, Cheema P, Bourbeau M, Tuli L, Zhou B, et al. Analysis of LC-MS Data for Characterizing the Metabolic Changes in Response to Radiation. J Proteome Res. 2010;9:2786-93. Venable JD, Xu T, Cociorva D, Yates JR, 3rd. Cross-correlation algorithm for calculation of peptide

AC

molecular weight from tandem mass spectra. Anal Chem. 2006;78:1921-9. Verron T, Sabatier R, Joffre R. Some theoretical properties of the O-PLS method. J Chemometr. 2004;18:62-8.

Villas-Boas SG, Mas S, Akesson M, Smedsgaard J, Nielsen J. Mass spectrometry in metabolome analysis. Mass Spectrom Rev. 2005;24:613-46. Villas‐Bôas SG, Mas S, Åkesson M, Smedsgaard J, Nielsen J. Mass spectrometry in metabolome analysis. Mass spectrometry reviews. 2005;24:613-46. Vivó-Truyols G, Torres-Lapasió J, Van Nederkassel A, Vander Heyden Y, Massart D. Automatic program for peak detection and deconvolution of multi-overlapped chromatographic signals: Part I: Peak detection. Journal of Chromatography A. 2005;1096:133-45. Viv -Truyols G. Bayesian approach for peak detection in two-dimensional chromatography. Analytical chemistry. 2012;84:2622-30. Wagner C, Sefkow M, Kopka J. Construction and application of a mass spectral and retention time index database generated from plant GC/EI-TOF-MS metabolite profiles. Phytochemistry. 2003;62:887-900. Walczak B, Massart D. The radial basis functions—partial least squares approach as a flexible non-linear regression technique. Anal Chim Acta. 1996;331:177-85. 67

ACCEPTED MANUSCRIPT Wang JS, Reijmers T, Chen LJ, Van der Heijden R, Wang M, Peng SQ, et al. Systems toxicology study of doxorubicin on rats using ultra performance liquid chromatography coupled with mass spectrometry based metabolomics. Metabolomics. 2009;5:407-18. Wang Q, Li H-D, Xu Q-S, Liang Y-Z. Noise incorporated subwindow permutation analysis for

T

informative gene selection using support vector machines. Analyst. 2011;136:1456-63. Wang W, Zhou H, Lin H, Roy S, Shaler TA, Hill LR, et al. Quantification of proteins and metabolites by

RI P

mass spectrometry without isotopic labeling or spiked standards. Analytical chemistry. 2003;75:4818-26.

Wang Y, Yi L, Liang Y, Li H, Yuan D, Gao H, et al. Comparative analysis of essential oil components in

SC

Pericarpium Citri Reticulatae Viride and Pericarpium Citri Reticulatae by GC-MS combined with chemometric resolution method. Journal of Pharmaceutical and Biomedical Analysis. 2008;46:66-74. Wang YD, Cu M. The Concept of Spectral Accuracy for MS. Anal Chem. 2010;82:7055-62.

MA NU

Watson DG. A rough guide to metabolite identification using high resolution liquid chromatography mass spectrometry in metabolomic profiling in metazoans. Comput Struct Biotechnol J. 2013;4:e201301005.

Webb AR. Statistical pattern recognition: John Wiley & Sons; 2003. Weber RJM, Southam AD, Sommer U, Viant MR. Characterization of Isotopic Abundance Measurements in High Resolution FT-ICR and Orbitrap Mass Spectra for Improved Confidence of Metabolite Identification. Anal Chem. 2011;83:3737-43.

ED

Weber RJM, Viant MR. MI-Pack: Increased confidence of metabolite identification in mass spectra by integrating accurate masses and metabolic pathways. Chemometr Intell Lab. 2010;104:75-82. Wei X, Sun W, Shi X, Koo I, Wang B, Zhang J, et al. MetSign: A computational platform for

PT

high-resolution mass spectrometry-based metabolomics. Analytical chemistry. 2011;83:7668-75. Werner E, Heilier JF, Ducruix C, Ezan E, Junot C, Tabet JC. Mass spectrometry for the identification of

CE

the discriminating signals from metabolomics: Current status and future trends. J Chromatogr B. 2008;871:143-63.

Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, van Velzen EJJ, et al. Assessment of PLSDA

AC

cross validation. Metabolomics. 2008;4:81-9. Williams DK, Muddiman DC. Parts-per-billion mass measurement accuracy achieved through the combination of multiple linear regression and automatic gain control in a Fourier transform ion cyclotron resonance mass spectrometer. Anal Chem. 2007;79:5058-63. Wishart DS. Computational strategies for metabolite identification in metabolomics. Bioanalysis. 2009;1:1579-96. Wold S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics. 1978;20:397-405. Wold S, Antti H, Lindgren F, Öhman J. Orthogonal signal correction of near-infrared spectra. Chemometr Intell Lab. 1998;44:175-85. Wold S, Johansson E, Cocchi M. PLS: Partial Least Squares Projections to Latent Structures, 3D QSAR in drug design. 1993. p. 523-50. Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001a;58:109-30. Wold S, Sjöström M, Eriksson L. Partial Least Squares Projections to Latent Structures (PLS) in Chemistry.

Encyclopedia of Computational Chemistry: John Wiley & Sons, Ltd; 2002.

Wold S, Sjostrom M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemometr Intell Lab. 68

ACCEPTED MANUSCRIPT 2001b;58:109-30. Wolf S, Schmidt S, Muller-Hannemann M, Neumann S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics. 2010;11:148. Wolfender J-L, Rudaz S, Hae Choi Y, Kyong Kim H. Plant metabolomics: from holistic data to relevant

T

biomarkers. Current medicinal chemistry. 2013;20:1056-90. Wong JW, Durante C, Cartwright HM. Application of fast Fourier transform cross-correlation for the Wu

AH,

Gerona

R,

Armenian

P,

French

D,

Petrie

RI P

alignment of large chromatographic and spectral datasets. Analytical chemistry. 2005;77:5655-61. M,

Lynch

KL.

Role

of

liquid

chromatography–high-resolution mass spectrometry (LC-HR/MS) in clinical toxicology. Clinical

SC

Toxicology. 2012;50:733-42.

Xu C-J, Jiang J-H, Liang Y-Z. Evolving window orthogonal projections method for two-way data resolution. Analyst. 1999;124:1471-6.

MA NU

Xu Y, Heilier JF, Madalinski G, Genin E, Ezan E, Tabet JC, et al. Evaluation of Accurate Mass and Relative Isotopic Abundance Measurements in the LTQ-Orbitrap Mass Spectrometer for Further Metabolomics Database Building. Anal Chem. 2010;82:5490-501.

Yang J, Honavar V. Feature Subset Selection Using a Genetic Algorithm. In: Liu H, Motoda H, editors. Feature Extraction, Construction and Selection: Springer US; 1998. p. 117-36. Yi L-z, Yuan D-l, Liang Y-z, Xie P-s, Zhao Y. Fingerprinting alterations of secondary metabolites of tangerine peels during growth by HPLC-DAD and chemometric methods. Analytica Chimica Acta.

ED

2009;649:43-51.

Yi L, Dong N, Liu S, Yi Z, Zhang Y. Chemical features of Pericarpium Citri Reticulatae and Pericarpium Citri Reticulatae Viride revealed by GC–MS metabolomics analysis. Food Chemistry.

PT

Yi L, Song C, Hu Z, Yang L, Xiao L, Yi B, et al. A metabolic discrimination model for nasopharyngeal carcinoma and its potential role in the therapeutic evaluation of radiotherapy. Metabolomics.

CE

2014;10:697-708.

Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research. 2004;5:1205-24.

AC

Yun Y-H, Cao D-S, Tan M-L, Yan J, Ren D-B, Xu Q-S, et al. A simple idea on applying large regression coefficient to improve the genetic algorithm-PLS for variable selection in multivariate calibration. Chemometr Intell Lab. 2014a;130:76-83. Yun Y-H, Liang Y-Z, Xie G-X, Li H-D, Cao D-S, Xu Q-S. A perspective demonstration on the importance of variable selection in inverse calibration for complex analytical systems. Analyst. 2013;138:6412-21. Yun Y-H, Wang W-T, Tan M-L, Liang Y-Z, Li H-D, Cao D-S, et al. A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. Anal Chim Acta. 2014b;807:36-43. Zeng Z-D, Liang Y-Z, Wang Y-L, Li X-R, Liang L-M, Xu Q-S, et al. Alternative moving window factor analysis for comparison analysis between complex chromatographic data. Journal of Chromatography A. 2006;1107:273-85. Zhang AH, Sun H, Han Y, Yan GL, Yuan Y, Song GC, et al. Ultraperformance Liquid Chromatography-Mass Spectrometry Based Comprehensive Metabolomics Combined with Pattern Recognition and Network Analysis Methods for Characterization of Metabolites and Metabolic Pathways from Biological Data Sets. Analytical Chemistry. 2013a;85:7606-12. Zhang AH, Sun H, Yan GL, Yuan Y, Han Y, Wang XJ. Metabolomics study of type 2 diabetes using ultra-performance LC-ESI/quadrupole-TOF high-definition MS coupled with pattern recognition 69

ACCEPTED MANUSCRIPT methods. Journal of Physiology and Biochemistry. 2014;70:117-28. Zhang H, Wang H, Dai Z, Chen M-s, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics. 2012a;13:1-20. Zhang LX, Tang CL, Cao DS, Zeng YX, Tan BB, Zeng MM, et al. Strategies for structure elucidation of

T

small molecules using gas chromatography-mass spectrometric data. Trac-Trend Anal Chem. 2013b;47:37-46.

RI P

Zhang Z-M, Chen S, Liang Y-Z. Baseline correction using adaptive iteratively reweighted penalized least squares. Analyst. 2010;135:1138-46.

Zhang Z-M, Liang Y-Z, Lu H-M, Tan B-B, Xu X-N, Ferro M. Multiscale peak alignment for

SC

chromatographic datasets. Journal of Chromatography A. 2012b;1223:93-106.

Zhao Z, Liu H. Searching for interacting features in subset selection. Intelligent Data Analysis. 2009;13:207-28.

MA NU

Zheng K, Li Q, Wang J, Geng J, Cao P, Sui T, et al. Stability competitive adaptive reweighted sampling (SCARS) and its applications to multivariate calibration of NIR spectra. Chemometr Intell Lab. 2012;112:48-54.

Zhou B, Wang J, Ressom HW. MetaboSearch: tool for mass-based metabolite identification using multiple databases. PLoS One. 2012;7:e40096.

Zhu ZJ, Schultz AW, Wang JH, Johnson CH, Yannone SM, Patti GJ, et al. Liquid chromatography quadrupole time-of-flight mass spectrometry characterization of metabolites guided by the METLIN

ED

database. Nature Protocols. 2013;8:451-60.

PT

Figure legend Fig.1. A recent literature survey of the number of publications (A) and their cited

CE

times (B), searching in Web of Science (Sep 6th, 2014). Plant metabolomics was used as a key word.

AC

Fig.2. The flowchart of data analysis of plant metabolomics. Fig.3. Deconvolution results using alternative moving window factor analysis (AMWFA). Fig. 3(A) and (B) are the total ion chromatograms (TICs) of peak cluster I and II of Pericarpium Citri Reticulatae Viride (PCRV) and Pericarpium Citri Reticulatae (PCR) before deconvolution. Fig.3 (C) and (D) are the resolved chromatographic curves of peak cluster I and II, respectively. (reprinted with permission from (Wang, Yi, 2008)). Fig.4. GC-MS total ion chromatograms (TICs) of tangerine peels, before (A) and after peak alignment (B). Retention time shifts in different samples were removed successfully by multi-scale peak alignment (MSPA) approach. Fig.5. The idea and outline of model population analysis (MPA). Fig.6. The prediction error distributions of an informative, uninformative or interfering variable before (white) and after permutation (gray) 1000 times. Random 70

ACCEPTED MANUSCRIPT sampling is employed here. A. Informative variable, prediction error will increase after permutation. B. Uninformative variable, prediction error should be without significant difference before and after permutation. C..Interfering variable, prediction

T

error may decrease after permutation.

RI P

Fig.7. The idea and main results of volatile metabolic footprinting of tangerine peels collected from July to December. A. The photographs of tangerine peels. B. Metabolic footprints obtained by PCA. C. A heat map of the relative abundance levels

SC

of the volatile compounds during the ripening process. (reprinted with permission from (Yi, Dong)).

MA NU

Fig.8. Plots of PLS scores (A) and plots of cross-validated PLS scores (B) on simulated data. The 4 data sets with random values are simulated on computer. Each data set has 100 samples and the number of variables is set to 5, 50, 500 and 5000, respectively. The class label for each sample is randomly assigned. For each data set,

ED

PLS-DA and cross-validated PLS-DA is implemented. Table 1. Available databases and libraries for metabolite identification Accessa

MS Spectral Library

Current Size

Website

c

276,248(242,466)

http://www.nist.gov/srd/nist1a.cfm

Wiley Registry of Mass Spectral Data

c

670,000(570,000)

http://onlinelibrary.wiley.com/book/10. 1002/9780470175217

GolmMetabolome DatabaseRT

d

26,587

FiehnLib

d

1000

MassBank

d

40,889

NIST MS/MS Library

c

234,284(45,298)

ReSpect

d

9017

METLIN

w

61,784

AC

CE

NIST 14

PT

Name

http://gmd.mpimp-golm.mpg.de/ http://fiehnlab.ucdavis.edu/projects/Fie hnLib/index_html http://www.massbank.jp/ http://www.nist.gov/srd/nist1a.cfm http://spectra.psc.riken.jp/ http://metlin.scripps.edu

Chemical Substance Database PubChem Dabatase

Compound

d

>53 million

http://www.ncbi.nlm.nih.gov/pccompo und

ChemSpider

w

>21 million

http://www.chemspider.com/

Manchester Metabolomics Database

d

42,553

BiGG Database

w

2835

BioCyc (MetaCyc)

http://dbkgroup.org/MMD/ http://bigg.ucsd.edu/bigg

UNKNOWN

http://biocyc.org/

CAS Registry

c

>89 million

http://www.cas.org/

CSLS

w

UNKNOWN

http://cactus.nci.nih.gov/

71

ACCEPTED MANUSCRIPT d

~166 billion

Dictionary of Natural Products

c

240,007

Beilstein database

c

>500 million

KEGG ligand database

d

17,282

ChEBI

d

40,211

http://www.gdb.unibe.ch/gdb/ http://dnp.chemnetbase.com/dictionary -search.do?method=view&id=1079994 5&struct=start&props=&&si= http://www.elsevier.com/online-tools/r eaxys http://www.genome.jp/kegg/ligand.htm l http://www.ebi.ac.uk/chebi/

HMDB

d

41,806

http://www.hmdb.ca/

KNApSAcK

d

50,899

http://kanaya.naist.jp/KNApSAcK/

LIPID MAPS

d

37,566

http://www.lipidmaps.org/

LipidBank

w

7,009

SC

RI P

T

GDB databases

http://www.lipidbank.jp/

http://metlin.scripps.edu http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cr SDBS w 34,000 e_index.cgi a Access right to the database, c, d and w denote commercial, downloadable and online access, respectively. RT Retention indices are included. w

240,501

MA NU

METLIN

ED

Table 2. Available metabolite identification tools and related tools assisting metabolite identification Name

Reference

Website

MassLib MOLGEN-MS

PT

GC-MS Spectrum Identification

(Kerber, Laue, 2001)

CE

Mass Spectrum Interpreter

(Stein, 1995)

http://www.masslib.com/c http://molgen.de/?src=documents/molgenms.htmld,w http://chemdata.nist.gov/mass-spc/interpreter/d

Accurate Mass

http://www.thermoscientific.comc

MetabolitePilot

http://www.absciex.comc

AC

MetWorks

Seven Golden Rules

(Kind and Fiehn, 2007)

SIRIUS

(Bocker et al. , 2009)

MI-Pack

(Weber and Viant, 2010)

MetaboSearch

(Zhou et al. , 2012)

http://fiehnlab.ucdavis.edu/projects/Seven_Golden_Rul es/d http://bio.informatik.uni-jena.de/sirius2/d http://www.biosciences-labs.bham.ac.uk/viant/mipackd http://omics.georgetown.edu/metabosearch.htmld

MS/MS Spectrum Prediction Mass Frontier

http://www.thermoscientific.comc,g

ACD/MS Fragmenter

http://www.acdlabs.comc,g

MetISIS

(Kangas, Metz, 2012)

http://omics.pnl.gov/software/d

(Heinonen, Rantanen, 2008)

http://www.cs.helsinki.fi/group/sysfys/software/fragid/d

(Bonn, Leandersson, 2010)

http://www.moldiscovery.com/software/massmetasitec

In silico Fragmentation FiD Mass-MetaSite MetFrag 72

(Wolf, Schmidt, 2010)

http://c-ruttkies.github.io/MetFrag/d,w

ACCEPTED MANUSCRIPT (Heinonen, Shen, 2012)

https://github.com/icdishb/fingeridd

MetFusion

(Gerlich and Neumann, 2013)

http://msbi.ipb-halle.de/MetFusion/w

CFM-ID

(Allen, Greiner, 2014a, Allen et al. , 2014b)

http://cfmid.wishartlab.com/d,w

T

FingerID

(Bocker and Rasche, 2008, Rasche, Svatos, 2011)

SIRIUS2

RI P

De Novo Analysis

http://bio.informatik.uni-jena.de/sirius2/d

Molecule Ion Annotation

http://www.mcisb.org/resources/putmedid.htmld

CAMERA

(Kuhl, Tautenhahn, 2012)

http://metlin.scripps.edu/xcms/useful_links.phpd

IDEOM

(Creek, Jankevics, 2012)

http://mzmatch.sourceforge.net/ideom.phpd

MZedDB

(Draper, Enot, 2009)

Mass Spectra Deconvolution

(Stein, 1999)

ED

DeconvolutionReport ing Software

ChromaTOF® Formula Generation

PT

AnalyzerPro

CE

(Peironcely, Rojas-Cherto, 2012)

OMG

(Steinbeck et al. , 2003)

AC

The Chemistry Development Kit

http://maltese.dbs.aber.ac.uk:8888/hrmet/index.htmlw

MA NU

MAIT

AMDIS

SC

(Brown, Dunn, 2009)

PUTMEDID-LCMS

http://chemdata.nist.gov/dokuwiki/doku.php?id=chemd ata:amdisd http://www.chem.agilent.com/en-US/products-services/ Software-Informatics/Deconvolution-Reporting-Softwa re-%28DRS%29/Pages/default.aspxc http://www.spectralworks.com/analyzerpro.htmlc http://www.leco.com/products/separation-science/softw are-accessories/chromatof-softwarec

http://sourceforge.net/projects/openmg/d http://sourceforge.net/projects/cdk/d

Formula To Mass To Formula

http://www.ch.ic.ac.uk/java/applets/f2m2f/w

Molecular finder

http://www.chemcalc.org/mf_finderw

Formula

http://hires.sourceforge.net/w,d

HiRes c

d

Commercially available. Freely downloadable to the local site. interface. gAlso suitable for GC-MS spectrum.

73

w

Freely accessed via web

ACCEPTED MANUSCRIPT

PT

Table 3. A taxonomy of variable selection techniques with the mentioned methods. Consider the interaction effect among variables or not

Variable ranking or subset selection

Computa tion speedy

Reference

NO

Ranking

High

(Wold, Sjöström, 2002)

(Favilla, Durante, 2013)

Classifier

Interpretability

PLS-weights

PLS

Based on loading weight matrices of PLS modeling

PLS-VIP

PLS

Accumulate the importance of each variable being reflected by loading weights from each latent variable of PLS

NO

Ranking

High

PLS-regression coefficient

PLS

A single measure of association between each variable and the response.

NO

Ranking

High

Correlation

No classifier

NO

Ranking

High

Information gain

No classifier

NO

Ranking

High

(Wold, Sjöström, 2001a) (Hall, 1999) (Ben-Bassat, 1982)

Euclidean distance

No classifier

NO

Ranking

High

(Liang, Yang, 2008)

Mutual information

No classifier

NO

Ranking

High

(Yu and Liu, 2004)

NO

Subset selection

High

(Li, Liang, 2009a, Zheng, Li, 2012)

74

TE

D

MA

NU S

CR I

Methods

CE P

Calculate simply between variables and classification label.

PLS

GA-PLS-DA

PLS-DA

GA is used as an optimal algorithm to find the optimal subset with PLS-DA classifier.

NO

Subset selection

Low

(Cao, Sitter, 2012)

PSO-SVM

SVM

PSO is used as an optimal algorithm to find the optimal subset with SVM classification method.

NO

Subset selection

Medium

(Alba, Garcia-Nieto, 2007)

Random Forest

Decision Tree

Rank the variables by the percent increase of misclassification error when the

YES

Ranking

Medium

(Breiman, 2001)

AC

CARS

Realize a competitive feature selection based on the absolute regression coefficients.

ACCEPTED MANUSCRIPT

MIA

SVM

Give a measure based on the difference between the prediction errors of inclusion and exclusion for each variable with the margin of SVM

INTERACT

No classifier

Based on inconsistency and symmetrical uncertainty measurements for finding interacting features

Ranking

Medium

(Li, Zeng, 2010b)

YES

Ranking

Medium

(Li, Liang, 2011)

YES

Subset selection

High

(Zhao and Liu, 2009)

PLS-DA

Compute the complementary information between variables and then effectively discover biomarker with the help of mutual associations of metabolites.

YES

Ranking

Medium

(Li, Xu, 2012)

PLS

Find the optimal subset of variables through observing the difference between the prediction errors of inclusion and exclusion for each variable.

YES

Subset selection

Medium

(Yun, Wang, 2014b)

PLS

Search for the optimal variable combinations through shrinking the variable space smoothly

YES

Subset selection

Medium

(Deng, Yun, 2014)

MA

NU S

YES

IRIV

VISSA

75

CE P

AC

VCN

TE

D

SPA

CR I

PLS-DA

Identify and rank the informative variable based on the difference between the prediction errors of normal and permutated subwindow for each variable.

PT

variable is permuted randomly.

ACCEPTED MANUSCRIPT

PT

Table 4. An overview of multivariate analysis methods for modeling

Category unsupervised

Advantage Disadvantage Suit to provide an overview of a large Class information is dataset. not considered.

HCA

unsupervised

Suit to provide an overview of the clusters of samples.

SOM

unsupervised

Account for non-linear in the data

LDA

unsupervised

PLS-DA

supervised

Easy and fast. Suit to linear and low dimensional data. Particularly suit to linear and co-linear data.

OPLS-DA

supervised

Particularly suit to linear and co-linear data. Good visualization ability and interpretation ability.

Not suit to unbalanced data.

SVM

supervised

Suit to linear and nonlinear problem. High flexibility in modeling non-linear data.

Model tuning is complex

RF

supervised

Suit to linear and nonlinear problem. Resistance to outliers.

Relatively low computation speed

AC

CE P

TE

D

MA

NU S

CR I

Method PCA

76

Class information is not considered. Variable importance is not obtained. Class information is not considered. Not suit to high dimensional data Not suit to unbalanced data.

Applications in metabolomics (Mao et al. , 2014) (Zhang et al. , 2014) (Koekemoer et al. , 2012) (Wang et al. , 2009) (Draisma et al. , 2013) (Hubert et al. , 2014) (Kriegel et al. , 2009) (Makinen et al. , 2008) (Patterson et al. , 2008) Vaclavik et al. , 2012) (Yi et al. , 2014) (Dong et al. , 2012) (Rago et al. , 2013) (Varghese et al. , 2010) (Solinas et al. , 2014) (Zhang et al. , 2013a) (Chan et al. , 2009) (Holmes et al. , 2008) (Krooshof et al. , 2010) (Lin et al. , 2011) (Mahadevan et al. , 2008) (Uarrota et al. , 2014) (Bertini et al. , 2014) (Fan et al. , 2010) (Lin, Wang, 2011) (Liu et al. , 2014b)

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

Figure 1

78

PT

ED

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC

CE

Figure 2

79

Figure 3

80

AC

CE

PT

ED

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

SC

RI P

T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA NU

Figure 4

81

AC

Figure 5

CE

PT

ED

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

82

Figure 6

83

AC

CE

PT

ED

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC

Figure 7

CE

PT

ED

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

84

PT

ED

MA NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC

CE

Figure 8

85

Recent advances in chemometric methods for plant metabolomics: A review.

This article has been withdrawn at the request of the author(s) and/or editor. The Publisher apologizes for any inconvenience this may cause. The full...
2MB Sizes 5 Downloads 9 Views