psychometrika doi: 10.1007/s11336-015-9478-5

A UNIFIED APPROACH TO FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS AND FUNCTIONAL MULTIPLE-SET CANONICAL CORRELATION

Ji Yeh Choi, Heungsun Hwang MCGILL UNIVERSITY

Michio Yamamoto KYOTO UNIVERSITY

Kwanghee Jung UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER

Todd S. Woodward UNIVERSITY OF BRITISH COLUMBIA AND BRITISH COLUMBIA MENTAL HEALTH AND ADDICTION RESEARCH INSTITUTE Functional principal component analysis (FPCA) and functional multiple-set canonical correlation analysis (FMCCA) are data reduction techniques for functional data that are collected in the form of smooth curves or functions over a continuum such as time or space. In FPCA, low-dimensional components are extracted from a single functional dataset such that they explain the most variance of the dataset, whereas in FMCCA, low-dimensional components are obtained from each of multiple functional datasets in such a way that the associations among the components are maximized across the different sets. In this paper, we propose a unified approach to FPCA and FMCCA. The proposed approach subsumes both techniques as special cases. Furthermore, it permits a compromise between the techniques, such that components are obtained from each set of functional data to maximize their associations across different datasets, while accounting for the variance of the data well. We propose a single optimization criterion for the proposed approach, and develop an alternating regularized least squares algorithm to minimize the criterion in combination with basis function approximations to functions. We conduct a simulation study to investigate the performance of the proposed approach based on synthetic data. We also apply the approach for the analysis of multiple-subject functional magnetic resonance imaging data to obtain low-dimensional components of blood-oxygen level-dependent signal changes of the brain over time, which are highly correlated across the subjects as well as representative of the data. The extracted components are used to identify networks of neural activity that are commonly activated across the subjects while carrying out a working memory task. Key words: functional data, functional principal component analysis, functional multiple-set canonical correlation analysis, alternating regularized least squares algorithm.

Functional data consist of samples of curves or surfaces, each of which is assumed to arise from an underlying smooth function that varies over a continuum (e.g., time, space, wavelength, probability, etc.) (Ramsay & Silverman, 2005). Data of this type have become prevalent in many areas, owing to the emergence of advanced measurement tools that allow collecting repeated measurements intensively over a short interval. A few examples of functional data in psychology include neuroimaging data where activations of the entire brain are scanned as a function of time (e.g., Viviani, Grön, & Spitzer, 2005; Hwang, Jung, Takane, & Woodward, 2012a), eye-tracking data where changes in gaze position or pupil size are captured over a period of time (e.g., Jackson & Correspondence should be made to Ji Yeh Choi, Department of Psychology, McGill University, 1205 Dr. Penfield Avenue, Montreal, QC H3A 1B1 Canada. Email: [email protected]

© 2016 The Psychometric Society

PSYCHOMETRIKA

Sirois, 2009), and music perception data where variations in melodic perception are measured over time (e.g., Vines, Wanderley, Krumhansl, Nuzzo, & Levitin, 2004; Vines, Nuzzo, & Levitin, 2005). A range of classical multivariate data analyses have been generalized for the analysis of such functional data. Functional principal component analysis (FPCA) (Ramsay & Dalzell, 1991; Rice & Silverman, 1991) and functional multiple-set canonical correlation analysis (FMCCA) (Hwang et al., 2012a) represent functional extensions of PCA and MCCA for dimension reduction of functional data. The aim of FPCA is to approximate a single set of functional data with low-dimensional representations or components in such a way that the components are mutually orthogonal and successively explain the maximum variance of the data. FPCA has been widely combined with other statistical models for examining directional relationships between predictor and dependent variables, including multilevel (e.g., Di, Crainiceanu, Caffo, & Punjabi, 2009) and regression models (e.g., Escabias, Aguilera, & Valderrama, 2004; Hwang, Suk, Lee, & Moskowitz, 2012b; Tan, Choi, & Hwang, 2015), and has been applied to various types of data such as neuroimaging or gene expression data (e.g., Kneip & Utikal, 2001; Viviani, Grön, & Spitzer, 2005; Yao, Müller, & Wang, 2005). FMCCA involves multiple sets of functional data and attempts to investigate associations among them. It extracts low-dimensional components, often called canonical variates, from each set of functional data in such a way that the components are maximally correlated across different sets of functional data but uncorrelated within the same set. FMCCA subsumes functional canonical correlation analysis (e.g., Leurgans, Moyeed, & Silverman, 1993; He, Müller, & Wang, 2003) as a special case, where the number of functional datasets is two. In this paper, we propose a unified approach to FPCA and FMCCA. The proposed approach aims to provide a general framework for reducing dimensionality in functional data, so that it would not only subsume both FPCA and FMCCA as special cases, but also allow a compromise between the two techniques by controlling for the contribution of each technique to the final solution. This enables researchers to determine the degree of contribution of the techniques to yielding the solution. Moreover, the proposed approach can address a potential issue inherent to FMCCA, which its components may not account for the variance of their data sufficiently well (e.g., Mishra, 2009; Hwang et al., 2013). This is because as stated earlier, FMCCA focuses only on maximizing the associations among the components across different sets of functional data, rather than explaining the variance of the data. By applying the proposed approach, we can obtain components that are highly correlated across multiple sets of functional data and at the same time, explain the variance of the data well. Technically, the proposed approach is regarded as a functional extension of Hwang et al.’s (2013) approach that integrates PCA and MCCA into a single framework. The paper is organized as follows. In Section 1, the technical underpinnings of the proposed approach are described, including model development and parameter estimation. An alternating regularized least squares algorithm is developed to estimate parameters in combination with basis function approximations to functions. In Section 2, we conduct a simulation study to examine the accuracy of parameter recovery of the proposed approach based on synthetic data. In Section 3, the usefulness of the proposed approach is illustrated by applying it to an empirical data set. We analyze fMRI data in which each functional brain scan is regarded as a snapshot of a function varying across different time points or scans. In Section 4, we summarize implications of the proposed approach and discuss potential topics for future research.

1. Method In functional data analysis, the basic entity of analysis is a smooth function that is assumed to generate discrete observations or records for an argument. The proposed approach deals with

JI YEH CHOI ET AL.

multiple sets of such functions that are available concurrently. For simplicity, we focus on the case of extracting a single component for the moment. Let z ik denote the ith smooth function in the kth dataset (i = 1, . . ., I ; k = 1, . . ., K ) with values z ik (tk ) for any argument tk (tk ∈ Tk ). The argument tk can represent time, spatial location, frequency, or any other continua. Let wk denote a canonical weight function assigned to the kth dataset with values wk (tk ), leading to a single component for the dataset. Let fi denote the ith object score obtained based on the leading components that are extracted from K datasets. Let ak denote a loading function connecting the object scores to the kth dataset with values ak (tk ). In the proposed approach, we seek to minimize the following optimization criterion

φ=α

I  K  

(z ik (tk ) − fi ak (tk ))2 dtk + (1 − α)

i=1 k=1 Tk



+α λ

K   k=1 Tk



2

D ak (tk ) dtk 2



 + (1 − α) ρ

I  K  



z ik (tk )wk (tk )dtk Tk

i=1 k=1

K  

2

 fi −

2



D wk (tk ) dtk , 2

(1)

k=1 Tk

I 2 f = 1, and α ∈ [0, 1], where λ and ρ with respect to fi ,ak (tk ),and wk (tk ), subject to i=1

K i denote non-negative smoothing parameters, and k=1 Tk [D 2 s(tk )]2 dtk is called a roughness penalty which refers to the sum of the integrated squared second derivative of a function s over K datasets. In (1), fi is scalar because it is obtained based on inner products of two functions (z ik and wk ) across K datasets. The optimization criterion has two subparts: the first two terms represent least squares criteria for FPCA and FMCCA, respectively; and the last two terms indicate roughness penalty terms that reflect the overall amount of smoothness of loading and canonical weight functions, as will be discussed below. This criterion reduces to the criterion for FPCA when α = 1, whereas it becomes equivalent to that for FMCCA when α = 0. In addition, the criterion (1) permits a compromise between the two techniques when 0 < α < 1. For example, we may put a greater emphasis on FPCA than FMCCA by putting α > .5, when our research objective focuses more on explaining the variance of the data rather than investigating associations among multiple datasets. This suggests that selecting the value of α be situation-dependent, which is based on research objectives. An example of determining the value of α is given in Section 3.

2 In (1), D 2 s(t) is used as a measure of roughness or curvature of a function, because the second derivative of a straight line is equal to zero and the second derivative value is expected to increase as a function becomes more variable. Moreover, the smoothing parameters (λ and ρ) indicate the relative importance of the roughness penalty terms in estimating the loading and weight functions. For example, when λ = ρ = 0, (1) reduces to an ordinary least squares optimization criterion. As the smoothing parameters become larger, more emphasis will be placed on the penalty terms in (1), leading the estimated functions to vary more smoothly. The values of λ and ρ may be determined by using G-fold cross validation (Hastie, Tibshirani, & Friedman, 2009), which will be discussed later in this section. For parameter estimation, we first apply basis function expansions to approximate all sets of functional data, loading functions, and canonical weight functions. Let ψk (tk ) denote a Mk by 1 vector of basis functions valued at an argument tk (tk ∈ Tk ). Let cik , sk , and bk denote a Mk by 1 vector of coefficients for the basis functions for the ith case’s smooth function, a loading function, and a canonical weight function, respectively. Each set of functional data and the parameter functions are approximated using the basis function expansion, as follows.

PSYCHOMETRIKA

z ik (tk ) =

MK 

cimk ψmk (tk ) = c ik ψk (tk )

m

ak (tk ) =

MK 

smk ψmk (tk ) = s k ψk (tk )

m

wk (tk ) =

MK 

bmk ψmk (tk ) = ψk (tk ) bk .

(2)

m

By combining (2) into (1), the optimization criterion can be rewritten as I  K     2 c ik ψk (tk ) − fi s k ψk (tk ) dtk φ=α i=1 k=1 Tk

  2 I  K     ψk (tk )ψ k (tk )dtk bk fi − c ik + (1 − α) Tk

i=1 k=1

 +α λ

K   

2





D 2 ak (tk ) dtk

+ (1 − α) ρ

k=1 Tk



2



D 2 wk (tk ) dtk

k=1 Tk

I  I  K K         2 c ik − fi s k Jk c ik − fi s k + (1 − α) fi − c ik Jk bk i=1 k=1

+ αλ

K 

s k



+ (1 − α) ρ

i=1 k=1



Tk

k=1

K 

D 2 ψk (tk )D 2 ψk (tk ) dtk sk 

 Tk





D ψk (tk )D ψk (tk ) dtk bk 2

bk

k=1



K   

2

I  I  K K         2 c ik − fi s k Jk c ik − fi s k + (1 − α) fi − c ik Jk bk i=1 k=1

+ αλ

K 

i=1 k=1

s k Rk sk + (1 − α) ρ

k=1

K 

b k Rk bk ,

(3)

k=1

where Jk = ψk (tk ) ψk (tk ) dtk and Rk = D 2 ψk (tk )D 2 ψk (tk ) dtk . Note that both Jk and Rk are symmetric. Let f = [f1 , . . . , f I ] denote an I by 1 vector consisting of all object scores of the leading components, and Ck = [c1k , . . . , c I k ] denote an I by Mk matrix of the coefficients for the basis functions for the functional data. Then, (3) can be re-expressed as: φ=α

K 

K K     SS Ck − fs k J + (1 − α) SS (f − Ck Jk bk ) + αλ s k Rk sk k

k=1

+ (1 − α) ρ

k=1 K 

b k Rk bk ,

k=1

where SS(A) = tr(A A) and SS(A)H = tr(A AH).

k=1

(4)

JI YEH CHOI ET AL.

An alternating regularized least squares algorithm is developed to minimize (4) with respect to f, sk , and bk , subject to f  f = 1. This algorithm consists of three steps, each of which updates a parameter estimate while fixing the other parameter estimates. Specifically, the algorithm begins by assigning random initial values to parameters, and repeats the following steps until convergence. Step 1. Update f for fixed sk , and bk . This is equivalent to minimizing the following criterion

φf = α

K K        tr Ck − fs k Ck − fs k Jk + (1 − α) (f − Ck Jk bk ) (f − Ck Jk bk ), (5) k=1

k=1

with respect to f. By solving

∂φf ∂f

= 0, f is updated by

 K −1  K K     ˆf = α s k Jk sk + (1 − α) K I F αCk Jk sk + (1 − α) Ck Jk bk . k=1

k=1

(6)

k=1

The estimates of f are subsequently normalized to satisfy the constraint f  f = 1. Step 2. Update sk for fixed f and bk . This is equivalent to minimizing the following criterion

φs = α

K K        tr s k Rk sk , Ck − fs k Ck − fs k Jk + αλ k=1

with respect to s k . By solving 

(7)

k=1 ∂φs ∂s k

= 0, we have

   Jk ⊗ f  f sk + (Rk ⊗ λI) sk = vec f  Ck Jk ,

(8)

where vec(A) indicates a supervector formed by stacking all columns of A one below another, and ⊗ indicates the Kronecker product. Then, sk is updated by sˆk =

    −1 Jk ⊗ f  f + (Rk ⊗ λI) vec f  Ck Jk .

(9)

Step 3 . Update bk for fixed f and sk . This is equivalent to minimizing

φb = (1 − α)

K K     tr −2f  Ck Jk bk + b k J k C k Ck Jk bk + (1 − α) ρ b k Rk bk k=1

k=1

K K     = (1 − α) tr −2f  Xk bk + b k X k Xk bk + (1 − α) ρ b k Rk bk , k=1

(10)

k=1

with respect to bk , where Xk = Ck Jk . By solving

∂φb ∂bk

= 0, bk is updated by

−1   bˆ k = X k Xk +ρRk X k f.

(11)

PSYCHOMETRIKA

It is straightforward to extend the criterion to the case of extracting more than one component. Let Q denote the number of components to be extracted. Let F denote an I by Q matrix of object scores. Let Sk and Bk denote a Mk by Q matrix of basis function coefficients for the loading and weight functions, respectively. When Q > 1, (4) will be expressed as φ=α

K 

K K     SS Ck − Fs k J + (1 − α) SS (F − Ck Jk Bk ) + αλ S k Rk Sk k

k=1

+ (1 − α) ρ

k=1 K 

k=1

B k Rk Bk ,

(12)

k=1

subject to the orthonormalization constraint F F = I, where I is an identity matrix of size −1  K

 ˆ S k Jk Sk + (1 − α) K I F Q.In the first step of the algorithm, F is updated by F = α k=1 K  K



αCk Jk Sk + (1 − α) Ck Jk Bk , followed by orthonormalization to satisfy the constraint k=1 k=1     −1 F F = I. In the second step, Sk can be reconstructed from vec Sˆ k = Jk ⊗ F F + (Rk ⊗ λI)  −1    vec F Ck Jk . In the third step, Bk is updated by Bˆ k = X k Xk + ρRk X k F. The values of λ and ρ may be chosen automatically by applying G-fold cross validation (Hastie et al., 2009, p. 241). In G-fold cross validation, we divide the entire dataset into G subsamples. We use one sub-sample as a test set, while using all of the remaining G– 1 sub-samples as a training set. The training set is used for parameter estimation, and the resultant estimates are applied to the test set in order to compute a prediction error, which is computed as

P E (λ, ρ) = α

Ig K     n=1 k=1 Tk

2 (g) (g) z nk (tk ) − fn aˆ k (tk ) dtk

2 Ig K    (g)  (g) fn − + (1 − α) z nk (tk )wˆ k (tk )dtk , n=1 k=1

(13)

Tk

where aˆ k (tk ) and wˆ k (tk ) denote the estimates updated from the gth training set (g = 1, . . ., G), (g) (g) and z nk (tk ) and fn denote the nth smoothing function and object score in the corresponding test set, respectively (n = 1, . . ., Ig ). This process is repeated G times by using each sub-sample consecutively and only once as a test set. We then calculate the cross-validation estimate of prediction error by averaging prediction errors over all G test sets. The values of λ and ρ associated with the smallest cross-validation estimate of prediction error can be selected as the final ones. Although choice on G is largely arbitrary, in practice, G = 5 or 10 is recommended (Hastie et al., 2009, p. 243). If the size of an entire dataset (I ) is small, we can consider dividing the dataset into I sub-samples (i.e., G = I ), each of which has one observation only. This special case is called leave-one-out cross validation.

2. A Simulation Study We conducted a simulation study to evaluate how well the proposed approach recovered the parameter values of loading and weight functions. In this study, we considered two factors:

JI YEH CHOI ET AL.

sample size and error standard deviation. We generated datasets based on prescribed components and parameter functions. Specifically, we set the numbers of functional datasets and components at four and two, i.e., K = 4, and Q = 2, respectively. In addition, we fixed α = .5. Each of data, loading, and weight functions was assumed to vary over 100 equally spaced time points (Tk = 100). The loading functions over [1, 100] were given as     2π 2π (1) a1 (tk ) = cos (tk + 2) + 4 sin (tk + 4) 40 70     2π 2π 11 a1(2) (tk ) = cos (14) (tk + 2) + 4 sin (tk + 4) 2 40 70     2π 2π (1) a2 (tk ) = 7 cos (tk + 3) + 5 sin (tk + 4) 50 60     2π 2π a2(2) (tk ) = 2 cos (15) (tk + 1) + 5 sin (tk + 4) 50 60     2π 2π (1) a3 (tk ) = 4 cos (tk + 2) + 3 sin (tk + 5) 50 80     2π 2π a3(2) (tk ) = cos (16) (tk + 2) + 5 sin (tk + 5) 50 80     2π 2π (1) a4 (tk ) = 3 cos (tk + 2) + 6 sin (tk + 2) 50 90     2π 2π a4(2) (tk ) = 6 cos (17) (tk + 4) + 6 sin (tk + 4) , 70 100 where the superscript in parenthesis indicates the qth component (q = 1, 2) and the subscript indicates the kth dataset (k = 1, . . ., 4). The loading functions for the kth dataset were stored in a Q by Tk matrix, denoted by Ak . Similarly, the weight functions defined over [1, 100] were specified as  −π tk π + 2 100   tk π −π + −2 cos 2 100   tk π −π + 1.5 sin 2 100   tk π −π + −1.5 sin 2 100     2π 2π cos (tk + 1) − 4 sin (tk + 2) 40 70     2π 2π cos (tk + 3) + 4 sin (0tk + 4) 40 70   tk π −π + 3 + 3.45 sin 2 100   tk π −π + . 4 + 3 cos 2 100

w1(1) (tk ) = 2 cos (2)

w1 (tk ) = w2(1) (tk ) = (2)

w2 (tk ) = w3(1) (tk ) = (2)

w3 (tk ) = w4(1) (tk ) = (2)

w4 (tk ) =



For the kth dataset, the weight functions were stored in a Tk by Q matrix, denoted by Wk .

(18)

(19)

(20)

(21)

PSYCHOMETRIKA Table 1. The mean square errors (MSE) of the estimates of loading functions (a) and weight functions (w) averaged over 100 time points.

Sample size I = 50 I = 100 I = 200 I = 500

Standard deviation

MSE(a)

MSE(w)

σ = .5 σ =1 σ =2 σ = .5 σ =1 σ =2 σ = .5 σ =1 σ =2 σ = .5 σ =1 σ =2

0.00284 0.00336 0.00421 0.00290 0.00307 0.00339 0.00292 0.00381 0.00450 0.00298 0.00329 0.00492

0.00370 0.00428 0.00561 0.00363 0.00480 0.00572 0.00327 0.00510 0.00585 0.00457 0.00525 0.00557

We considered four levels of sample size (I = 25, 50, 100, and 200) and three levels of error standard deviation (σ = .5, 1, and 2). By crossing the three levels of error standard deviation at each sample size, a total number of 12 conditions were considered. For each condition, we first drew an I by Qmatrix of object scores (F) from the standard normal distribution and normalized it to satisfy the constraint F F = I. We also drew an I by Q +Tk matrix of random errors, denoted by Ek , from a normal distribution with mean 0 and standard deviation σ . Then, we generated an I by Tk  −1 matrix of each functional dataset Zk by Zk = [Yk + Ek ] Vk Vk Vk , where Vk = [αI, −βWk ] and Yk = [αFAk , −βF]. This way of generating Zk was derived from rewriting the first two

K

K terms in (1) as α k=1 SS (Zk − FAk ) + (1 − α) k=1 SS (F − Zk Wk ) in matrix format, or equivalently minimizing the sum of squares of Ek = [αZk , (1 − α) F]−[αFAk , (1 − α) Zk Wk ] = Zk [αI, − (1 − α) Wk ] − [αFAk , − (1 − α) F] = Zk Vk − Yk (refer to Hwang et al., 2013). We generated 500 simulated datasets for each condition. We used 42 B-spline basis functions to approximate the data, loading, and weight functions [i.e., Mk = 42 in (2)]. The number of the basis functions was chosen to be sufficiently large to represent these functions that vary over a large number of time points (100). To reduce computational burden, we set the smoothing parameters for the loading and weight functions to be the same (i.e., λ = ρ). We applied fivefold cross validation (i.e., G = 5) to determine the common smoothing parameter, considering 10 alternative values that varied from 101 to 1010 by a factor of 10. To further lessen computational burden, at each condition, the cross-validation procedure was applied to the first 10 datasets, and then the smoothing parameter value that was chosen most frequently by the cross validation over the 10 datasets was employed for the remaining 490 datasets. We examined the mean square errors (MSE) of the estimated loading and weight functions to evaluate parameter recovery of the proposed approach. Table 1 provides the MSE of the estimates for loading and weight functions averaged over 100 time points. As shown in the table, the MSE values of the estimates of both loading and weight functions were quite close to zero in all conditions. Within each sample size, if error standard deviation became larger, the MSE of the estimates tend to be larger, although the differences were rather small. Moreover, the MSE of the estimates did not seem to change much across sample sizes. This may be because they were already quite small even at I = 50. Thus, the study showed that overall, the proposed approach was able to recover the parameters sufficiently well, even when sample size was small. Moreover, although using the same value for

JI YEH CHOI ET AL.

the two smoothing parameters was restrictive, it still resulted in an acceptable level of parameter recovery in the study. 3. An Empirical Application to Functional Neuroimaging Data The present example is a subset of functional magnetic resonance imaging (fMRI) data collected from a verbal working memory study (Cario, Woodward, & Ngan, 2006). The fMRI scanner records variation in blood-oxygenation level dependent (BOLD) signal, based on the fact that the signal is intensified by the change in cerebral blood flow and blood oxygenation at the sites of increasing and decreasing brain activity. A higher proportion of oxygen-rich hemoglobin relative to oxygen-deprived hemoglobin leads to BOLD signal change, and the change in BOLD can be referred to as the hemodynamic response (HR) in fMRI. fMRI data consist of a set of BOLD time series, which comprise many voxels, and for the dataset of interest here, each voxel is a three-dimensional rectangular 4 × 4 × 4 mm cube of a brain tissue with its own BOLD time series. A voxel serves as the basic spatial unit of analysis. fMRI images are taken repeatedly, and the dataset of interest here, the time resolution of 3 s, defined as the repetition time (TR) depicting a full scan of the brain. BOLD signal changes are considered functional that reflect a smooth variation in brain activity over multiple scans or time points (Tian, 2010). 3.1. Description of the Data Four female subjects who were right-handed, healthy, and native English speakers (age range 18–35) participated in a working memory task. The task used a modified version of the Sternberg Item Recognition Task (SIRT), and the memory load variable was manipulated by varying how many strings of letters subjects would see. This task consisted of three different phases: encoding, maintenance, and retrieval phases. In the encoding phase, one of the four different conditions—2, 4, 6, or 8 uppercase consonants—was displayed for 4 s. Subjects were then requested to remember the consonants they have seen over a short delay (6 s) (i.e., maintenance phase). During the retrieval phase, a single lowercase consonant was shown as for the item recognition test, asking subjects to decide whether the lowercase letter had been presented in the encoding phase. The completion of an item recognition test was considered a single trial, and another trial was repeated after an intertrial interval. The intertrial interval (ITI) was jittered (3, 4 or 5 s), and during the ITI a “+” was displayed starting 1 s after the test letter and ending 1 s before the encoding letters. Eighteen additional blank trials (20 % of total trials) were also included, during which the word RELAX was presented. In total, each subject received 36 trials of the item recognition tests (i.e., 9 occurrences of each memory load condition). In this example, we had four sets of functional data, each of which contains BOLD signal changes of 23,621 brain voxels for each subject, which were recorded repeatedly over 214 different time points/scans. Prior to applying the proposed approach, all BOLD signals were realigned, spatially normalized, and smoothed using statistical parametric mapping (SPM2) (Welcome Institute of Cognitive Neurology, University College London, UK, http://www.fil.ion.ucl.ac.uk/spm/ software). To relate experimental conditions directly to brain activity, we constructed a design matrix that predicted timing of brain activity according to the timing of stimulus presentation. The design matrix was modeled using finite impulse response (FIR) filters (Goutte, Nielsen, & Hansen, 2000), whose hemodynamic responses were measured from the first to eighth scans following a stimulus presentation (also referred to as poststimulus scans). The design matrix had all scans measured (i.e., 214 scans) in rows and the 8 poststimulus scans for each load condition in columns, where 1 was assigned when HR was expected to increase at a specific poststimulus scan and 0 otherwise. Refer to Metzak et al. (2011a,b) for more details on constructing the design matrix.

PSYCHOMETRIKA Table 2.

The cross-validation estimates of prediction error against the common logarithms of different values of λ and ρ.

Values of λ λ = 101

λ = 102

λ = 103

λ = 104

Values of ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ ρ

= 101 = 102 = 103 = 104 = 101 = 102 = 103 = 104 = 101 = 102 = 103 = 104 = 101 = 102 = 103 = 104

Estimates of prediction error 1.6596 1.2743 1.3474 1.3594 1.4416 1.2736 1.3492 1.3618 1.4341 1.3510 1.3846 1.3632 1.4352 1.3981 1.3871 1.3643

3.2. Analysis and Results The proposed approach was applied to investigate which brain networks, depicted by components, were likely to be activated commonly across the four subjects while performing the same working memory task. More specifically, the approach was used for integrating BOLD signal changes from the four subjects into highly correlated low-dimensional representations or components, which also explained the BOLD signal changes well. Given non-periodic changes in the BOLD signals, we used a B-spline basis system to approximate the data, canonical weight, and loading functions. Two elements should be pre-determined for a B-spline basis system: the number of knots that divides the entire interval of a function into subintervals; and the order of a polynomial for defining a spline function at any subinterval. We created order four B-spline basis functions with 72 equally spaced knots over the interval [0, 214] using a MATLAB code provided by Dr. James Ramsay (available at http://www.psych.mcgill.ca/ misc/fda). The values of the smoothing parameters (λ and ρ) were chosen via 10-fold cross validation. G = 10 was chosen because the sample size was quite large, so that each of 10 sub-samples would still have enough observations. Table 2 displays the cross-validation estimates of prediction error against the common logarithms of different values of λ and ρ, varying each one from 101 to 104 by a factor of 10. The final value was chosen as λ = ρ = 102 at which the minimum value of the cross-validation estimates was achieved. Given the values of λ and ρ, we first applied the proposed approach to the data by imposing equal contributions of FPCA and FMCCA (α = .5) on the final solution. However, it turned out that the optimization function value was decided heavily by the value of the FPCA term. That is, the value of the FPCA term accounted for about 99 % of the optimization function value, whereas the value of the FMCCA term constituted only 1 %. This indicates that using the proposed approach under the equal weighting would be essentially identical to applying FPCA only, ignoring FMCCA. However, as stated above, our main objective was to integrate BOLD signal changes from the four subjects into highly correlated low-dimensional components as in

JI YEH CHOI ET AL.

A

-3

14

x 10

Mean Predictor Weights

12 10 8 6 4 2 0 -2

0

5

10

15

20

25

Time (seconds)

Mean Predictor Weights

B

-3

10

x 10

5

0

-5

0

5

10

15

20

25

Time (seconds)

Figure 1.

The mean predictor weights obtained from the proposed approach under α = .01. A the mean predictor weights for each load condition on the first dimensional component. B the mean predictor weights for each load condition on the second dimensional component ( = 2 letters; * = 4 letters;  = 6 letters; and  = 8 letters).

FMCCA, while these components also explained the variance of the BOLD signal changes well as in FPCA. To obtain a fair compromise between the two techniques, we would need downgrade the influence of the FPCA term on the solution dramatically. Thus, we decided to set α = .01, by which the influences of the FPCA and FMCCA terms on the final solution could be well-balanced. It is well known that BOLD signals can reflect many sources of biological activity that have nothing to do with neural activity, including physiological reactions from the respiratory or cardiovascular system (Huettel, Song, & McCarthy, 2009).Thus, we computed so-called predictor weights (Hunter & Takane, 2002) to check whether the components extracted by the proposed approach were biologically plausible and relevant to the experimental manipulations in the working memory task (Metzak et al., 2011a,b). The predictor weights are obtained by regressing each estimated canonical weight function on the design matrix. According to the design matrix constructed for this study, the HR function reflecting changes in neural activity was expected to peak

PSYCHOMETRIKA

Figure 2. Five slice images constructed from the object scores of voxels obtained based on the first dimensional components. The dominant 5 % of the object scores are displayed with positive scores in red and yellow and negative scores in blue and green (Color figure online).

between the second and fifth poststimulus scan (Metzak et al., 2011a). Given that TR was 3 s in the study, such the peak HR should occur between 8 and 15 s in the 24-s cycle of a trial for each memory load condition. Panels A and B in Figure 1 display the mean predictor weight values (averaged over the subjects) of the first- and second- dimensional components, respectively. The mean predictor weights in panel A show the expected patterns of changes in HR. At each memory load condition, the mean predictor weights peaked between 8 and15 s in the 25s cycle after trial onset, suggesting BOLD signal changes obtained from the first component were related to the experimental manipulations. Conversely, the mean predictor weights in panel B did not exhibit the expected pattern, indicating that BOLD signal changes from the second component were not directly attributed to the experimental conditions. This was the case for the other subsequent components (i.e., Q > 2). Therefore, we concentrate on interpreting the first dimensional solution. Figure 2 represents five slice images constructed from dominant 5 % of the first dimensional object scores of the entire voxels. The sagittal slice at the right end of the figures shows which axial plain of section was chosen to be displayed. These images represent the functional brain network by displaying functionally connecting activated regions among the four subjects while performing the working memory task. Previous studies on working memory have reported evidence for two different connected brain regions, also known as brain networks: one is a task-positive network that involves incremental activity in fronto-parietal brain regions; and the other is a task-negative network involving suppressed activity in ventro-medial brain regions (Fox et al., 2005; Metzak et al., 2011a). In Figure 2, positive object scores are displayed in red to yellow, whereas negative scores are in blue to green. Although the task-negative network is more prominently identified among the four subjects, the result supports the co-existence of both task-positive and task-negative networks as hypothesized. The task-positive network involves increased activation in regions around the anterior cingulate and parietal lobe. The task-negative network involves decreased activation in brain regions such as the medial frontal cortex, bilateral middle/inferior temporal cortices, ventral anterior cingulate cortex, right fusiform gyrus, posterior cingulate cortex, and bilateral angular gyri. These regions identified by our analysis were consistent with those by previous studies of working memory (Metzak et al., 2011a). For comparison, we also analyzed the data by applying the proposed approach under α = 1. This case is equivalent to applying FPCA only to extract components that explain the most variance of the functional data. Panels A and B in Figure 3 display the mean predictor weights for the first and second components obtained under the case.

JI YEH CHOI ET AL.

A

5

x 10

-3

Mean Predictor Weights

4 3 2 1 0 -1 -2 -3 -4

5

0

10

15

20

25

20

25

Time (seconds)

B

2.5

x 10

-3

Mean Predictor Weights

2

1.5

1

0.5

0

-0.5

-1

0

5

10

15

Time (seconds)

Figure 3.

The mean predictor weights obtained from the proposed approach under α = 1. A the mean predictor weights for each load condition on the first dimensional component. B the mean predictor weights for each load condition on the second dimensional component ( = 2 letters; * = 4 letters;  = 6 letters; and  = 8 letters).

The mean predictor weights for both first and second components did not show the expected HR patterns over poststimulus time points, suggesting that both components were not directly related to the experimental manipulations. Thus, it is unlikely to be meaningful to interpret these components from substantive perspectives. This shows that in this example, applying FPCA and FMCCA simultaneously within the framework of the proposed approach resulted in solutions that were substantively more interpretable.

4. Concluding Remarks We proposed a unified approach to two data reduction techniques for functional data—FPCA and FMCCA. This approach permits using FPCA and FMCCA in an integrated and complementary manner, so that components obtained from each set of functional data are maximally correlated

PSYCHOMETRIKA

as well as explain the variance of the data well. The empirical usefulness of the approach was demonstrated through the analysis of both simulated and real data. The simulation study showed that the approach recovered parameters sufficiently well. The proposed approach was also applied to the analysis of fMRI data. It was able to identify brain regions or functional brain networks that were commonly activated across the four subjects during a working memory task. These activated regions signified both task-positive and task-negative networks, which is consistent with previous studies. Despite these technical and empirical implications, the proposed approach can be further improved in two directions. First, we may combine it with a clustering method, such as nonoverlapping k-means (MacQueen, 1967; Hartigan & Wong, 1979) or fuzzy c-means (Bezdek, 1974; Dunn, 1974), to investigate if heterogeneous subgroups of a population exist that show distinct characteristics. By combining the approach with a clustering method, we can estimate cluster memberships of individuals as well as cluster-specific parameter functions, relaxing the assumption that all individuals are sampled from a single homogeneous population. Such an extension may be further generalized by allowing cluster memberships to be functional, especially when researchers are interested in examining if the membership of an individual changes between clusters over a continuum. Second, the proposed approach may be extended for modeling functions that would vary over two different continua, such as time and space (e.g., Ruiz-Medina, 2012). At present, the proposed approach assumes that each sample of functions varies over a single argument, i.e., time. As a result, the present application of fMRI data requires a pre-processing step for dealing with spatial variation in data through SPM2, independently of the proposed approach. An extension of the approach to multidimensional arguments may allow finding a smooth underlying function that takes into account of spatial and temporal dependence simultaneously. A few methods have been developed to smooth data over space dimensions, using bivariate thin-plate splines (e.g., Sibson & Stone, 1991; Hancock & Hutchinson, 2006), kringing (e.g., Giraldo, Delicado, & Mateu, 2011) or finite element L-splines (Ramsay, 2002). Moreover, Marra, Miller, and Zanin (2012) introduced so-called tensor product smoothing that combines spline basis functions for both temporal and spatial dimensions. We may adopt a similar strategy for the proposed approach. References Bezdek, J. C. (1974). Numerical taxonomy with fuzzy sets. Journal of Mathematical Biology, 1(1), 57–71. Cairo, T. A., Woodward, T. S., & Ngan, E. T. C. (2006). Decreased encoding efficiency in schizophrenia. Biological Psychiatry, 59(8), 740–746. Di, C.-Z., Crainiceanu, C. M., Caffo, B. S., & Punjabi, N. M. (2009). Multilevel functional principal component analysis. The Annals of Applied Statistics, 3(1), 458. Dunn, J. C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3, 32–57. Escabias, M., Aguilera, A. M., & Valderrama, M. J. (2004). Principal component estimation of functional logistic regression: Discussion of two different approaches. Journal of Nonparametric Statistics, 16(3–4), 365–384. Fox, M. D., Snyder, A. Z., Vincent, J. L., Corbetta, M., Van Essen, D. C., & Raichle, M. E. (2005). The human brain is intrinsically organized into dynamic, anticorrelated functional networks. Proceedings of the National Academy of Sciences of the United States of America, 102, 9673–9678. Giraldo, R., Delicado, P., & Mateu, J. (2011). Ordinary kriging for function-valued spatial data. Environmental and Ecological Statistics, 18(3), 411–426. Goutte, C., Nielsen, F. A., & Hansen, L. K. (2000). Modeling the hemodynamic response in fMRI using smooth FIR filters. IEEE Transactions on Medical Imaging, 19, 1188–1201. Hancock, P. A., & Hutchinson, M. F. (2006). Spatial interpolation of large climate data sets using bivariate thin plate smoothing splines. Environmental Modelling and Software, 21(12), 1684–1694. Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means clustering algorithm. Applied Statistics, 28(1), 100–108. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York: Springer. He, G., Müller, H. G., & Wang, J. L. (2003). Functional canonical analysis for square integrable stochastic processes. Journal of Multivariate Analysis, 85(1), 54–77. Huettel, S. A., Song, A. W., & McCarthy, G. (2009). Functional magnetic resonance imaging. Massachusetts: Sinauer.

JI YEH CHOI ET AL. Hunter, M. A., & Takane, Y. (2002). Constrained principal component analysis: Various applications. Journal of Educational and Behavioral Statistics, 27(2), 105–145. Hwang, H., Jung, K., Takane, Y., & Woodward, T. S. (2012a). Functional multiple-set canonical correlation analysis. Psychometrika, 77(1), 48–64. Hwang, H., Jung, K., Takane, Y., & Woodward, T. S. (2013). A unified approach to multiple-set canonical correlation analysis and principal components analysis. British Journal of Mathematical and Statistical Psychology, 66(2), 308– 321. Hwang, H., Suk, H. W., Lee, J.-H., Moskowitz, D. S., & Lim, J. (2012b). Functional extended redundancy analysis. Psychometrika, 77(3), 524–542. Jackson, I., & Sirois, S. (2009). Infant cognition: going full factorial with pupil dilation. Developmental Science, 12(4), 670–679. Kneip, A., & Utikal, K. J. (2001). Inference for density families using functional principal component analysis. Journal of the American Statistical Association, 96(454), 519–542. Leurgans, S. E., Moyeed, R. A., & Silverman, B. W. (1993). Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society Series B (Methodological), 55, 725–740. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (pp. 281–297). Marra, G., Miller, D. L., & Zanin, L. (2012). Modelling the spatiotemporal distribution of the incidence of resident foreign population. Statistica Neerlandica, 66(2), 133–160. Metzak, P. D., Feredoes, E., Takane, Y., Wang, L., Weinstein, S., Cairo, T. A., et al. (2011a). Constrained principal component analysis reveals functionally connected load-dependent networks involved in multiple stages of working memory. Human Brain Mapping, 32(6), 856–871. Metzak, P. D., Riley, J. D., Wang, L., Whitman, J. C., Ngan, E. T. C., & Woodward, T. S. (2011b). Decreased efficiency of task-positive and task-negative networks during working memory in schizophrenia. Schizophrenia Bulletin, 38(4), 803–813. Mishra, S. K. (2009). Representation-constrained canonical corrleation analysis: A hybridization of canonical correlation and principal component analyses. Journal of Applied Economic Sciences (JAES), 7, 115–124. Ramsay, J. O., & Dalzell, C. J. (1991). Some tools for functional data analysis. Journal of the Royal Statistical Society. Series B (Methodological), 53, 539–572. Ramsay, J. O., & Silverman, B. W. (2005). Functional data analysis (2nd ed.). New York: Springer. Ramsay, T. (2002). Spline smoothing over difficult regions. Journal of the Royal Statistical Society, 64(2), 307–319. Rice, J. A., & Silverman, B. W. (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society, 53, 233–243. Ruiz-Medina, M. D. (2012). New challenges in spatial and spatiotemporal functional statistics for high-dimensional data. Spatial Statistics, 1, 82–91. Sibson, R., & Stone, G. (1991). Computation of thin-plate splines. SIAM Journal on Scientific and Statistical Computing, 12(6), 1304–1313. Tan, T., Choi, J. Y., & Hwang, H. (2015). Fuzzy clusterwise functional extended redundancy analysis. Behaviormetrika, 42(1), 37–62. Tian, T. S. (2010). Functional data analysis in brain imaging studies. Frontiers in Psychology, 1, 1–11. Vines, B. W., Nuzzo, R. L., & Levitin, D. J. (2005). Analyzing temporal dynamics in music: Differential calculus, physics, and functional data analysis techniques. Music Perception, 23, 137–152. Vines, B. W., Wanderley, M. M., Krumhansl, C. L., Nuzzo, R. L., & Levitin, D. J. (2004). Performance gestures of musicians: What structural and emotional information do they convey? In Gesture-based communication in humancomputer interaction (pp. 468–478). Berlin: Springer. Viviani, R., Grön, G., & Spitzer, M. (2005). Functional principal component analysis of fMRI data. Human Brain Mapping, 24(2), 109–129. Yao, F., Müller, H.-G., & Wang, J.-L. (2005). Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association, 100(470), 577–590. Manuscript Received: 5 MAR 2015

A Unified Approach to Functional Principal Component Analysis and Functional Multiple-Set Canonical Correlation.

Functional principal component analysis (FPCA) and functional multiple-set canonical correlation analysis (FMCCA) are data reduction techniques for fu...
685KB Sizes 1 Downloads 10 Views