Biometrical Journal 56 (2014) 5, 761–763

DOI: 10.1002/bimj.201300198

761

Discussion What about sparsity when data are curves? Fr´ed´eric Ferraty∗ Toulouse Mathematics Institute, University of Toulouse, 118 route de Narbonne, 31400 Toulouse, France Received 17 September 2013; revised 3 February 2014; accepted 8 February 2014

This is a discussion around the paper “Overview of object oriented data analysis” by J. Steve Marron and Andr´es M. Alonso.

Keywords: Functional data; Nonparametric regression; Sparsity; Variable selection.

Object-oriented data analysis (OODA) is a very exciting modern field of research in Statistics. The overview by Marron and Alonso (2014) points out interesting issues and emphasizes numerous challenging open problems. OODA is clearly an extension of what usually people call high-dimensional data analysis. Recent literature on high-dimensional data involves essentially three main aspects: small sample of high-dimensional observations, linear modeling, and sparsity approach. More precisely, small sample of high-dimensional observations corresponds most of the time to the situation when one has to analyze a collection of n random vectors of dimension p (i.e., n observations of p scalar variables) with p much greater than n; sparsity refers to the notion of redundant information and assumes that only few of the p variables have to be considered but we never know them in advance. There exists a considerable amount of literature around this topic (see, for instance, Tibshirani, 1996; ¨ Hastie et al., 2009; Bulhmann and van de Geer, 2011) mainly related to the linear regression model. Indeed, this linear framework allows implementation with fast computation, accomplished theoretical developments, and interpretable outputs. However, in order to better understand the underlying structure of the data, the statisticians have to face challenging issues. For instance, the following important questions arise: How to relax linear assumption? How to extend to more complex data the sparsity approach? When handling high-dimensional datasets (typically an n-sample of high-dimensional vector {xi = (xi1 , . . . , xip )}i=1,...,n ) in a regression setting (i.e., one observes corresponding responses {yi }i=1,...,n ), a first step has been achieved by Ferraty and Hall (2014) who propose a new algorithm for selecting nonlinearly few most predictive variables (more details about this method are available online at http://www.math.univ-toulouse.fr/∼ferraty/online-resources.html). Now, let us consider the situation where the observed n high-dimensional vectors represent n discretized curves at p design points λ1 , . . . , λ p : xi1 = xi (λ1 ), . . . , xip = xi (λ p ). Standard examples of such datasets come from near-infrared spectrometry (see, for instance, Borggaard and Thodberg, 1992), which is a nondestructive technology that allows to measure chemical ∗ Corresponding author: e-mail: [email protected]

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

762

F. Ferraty: What about sparsity when data are curves?

Figure 1 Differentiated spectra of 60 gasoline samples: (A) continuous paths; (B) vertical lines identify design points. compounds in a wide variety of products. By definition, the underlying mechanism is continuous and a classical interpolation/approximation allows to represent them as continuous spectrometric curves. This is especially shown in Fig. 1A, which displays 60 curves derived from the approximation of 60 first derivative spectra x1 (λ), . . . , x60 (λ) of 60 gasoline samples in the range 900–1700 nm (i.e., λ ∈ [900; 1700]). The original data can be found in the R (see R Development Core Team, 2013) package fds (Shang and Hyndman, 2013). Each spectrum is sampled at 401 equally spaced wavelengths λ1 = 900 nm, . . . , λ401 = 1700 nm and Fig. 1B zooms the gray area of panel (A); the vertical lines identify the design points (wavelengths) λ131 = 1160 nm, λ132 = 1162 nm, . . . , λ171 = 1240 nm. For each of these 60 gasolines, one observes their octane number (y1 = 85.3, . . . , y60 = 87.1) and the statistical aim is to predict the octane number from a given spectrum. In this example, the variability of the approximated first derivative of spectra seems very small but surprisingly it is sufficient to predict well their corresponding octane numbers. When one has at hand a collection of curves sampled at design points, the notion of sparsity is not so clear. If one considers the observed discretized curves as vectors, one can try to select few design points; if the observations are viewed as continuous curves, one would prefer to select few continuous subintervals of wavelengths instead of particular ones. From a practical point of view, the strategy used in the algorithm developed by Ferraty and Hall (2014) has been adapted in order to select best predictive continuous parts of wavelengths. The gray areas in Fig. 2A identify the best predictive continuous portions of wavelengths when considering observed curves as continuous functions; vertical lines in Fig. 2B locate particular wavelengths when the discretized curves are processed as vectors. The point-wise selective method retains four wavelengths (1234, 1386, 1388, and 1482 nm) and the interval-wise technique selects two portions ([1228; 1286] and [1380; 1438]). It is worth noting that the selected intervals contain the first three wavelengths retained in the point-wise method. The predictive power (involving a standard Monte Carlo scheme) of both methods is quite similar even if both selective techniques express the sparsity of such dataset in two different ways: point-wise sparsity and continuous-wise sparsity. Based on this example, one can extend the notion of sparsity for more complex datasets as soon as one has at hand some measure μ acting on the space of design “points” (in our example, the wavelengths live in the real line but one can encounter numerous situations where the design points belong to multidimensional spaces). Then, a “point-wise” selection of design points remains to retain a subset S of design points with μ(S ) = 0 whereas a continuous-wise selection provides a family of subsets S1 , S2 , . . . with μ(Sk ) = 0 for any k. However, the underlying mechanisms drawing the data are  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 56 (2014) 5

763

Figure 2 Once differentiated spectra of 60 gasoline samples: (A) selected continuous portions; (B) vertical lines identify four selected wavelengths often subtle. So, it might be advantageous to mix both point-wise and continuous-wise selection for much more complex data; this will be certainly an interesting idea for developing useful new statistical methodologies. Conflict of interest The authors have declared no conflict of interest.

References ¨ Buhlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics, Springer, New York, NY. Borggaard, C. and Thodberg, H. H. (1992). Optimal minimal neural interpretation of spectra. Analytical Chemistry, 64, 545–551. Ferraty, F. and Hall, P. (2014). An algorithm for nonlinear, nonparametric model choice and prediction. ArXiv:1401.8097. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (2nd edn.) Springer, New York, NY. Marron, J. S. and Alonso, A. M. (2014). Overview of object oriented data analysis. Biometrical Journal 56, 732–753. R Development Core Team (2013). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at: http://www.R-project.org/. Shang, H. L. and Hyndman, R. J. (2013). FDS: functional data sets. R package version 1.7. Available at: http://CRAN.R-project.org/package=fds. Tibshirani, R. (1996). Regression analysis and selection via the Lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288.

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

What about sparsity when data are curves?

This is a discussion around the paper "Overview of object oriented data analysis" by J. Steve Marron and Andrés M. Alonso...
307KB Sizes 2 Downloads 5 Views