Development of ocean color algorithms for estimating chlorophyll-a concentrations and inherent optical properties using gene expression programming (GEP) Chih-Hua Chang* Department of Environmental Engineering, National Cheng Kung University, Tainan 701, Taiwan * [email protected]

Abstract: This paper proposes new inversion algorithms for the estimation of Chlorophyll-a concentration (Chla) and the ocean’s inherent optical properties (IOPs) from the measurement of remote sensing reflectance (Rrs). With in situ data from the NASA bio-optical marine algorithm data set (NOMAD), inversion algorithms were developed by the novel gene expression programming (GEP) approach, which creates, manipulates and selects the most appropriate tree-structured functions based on evolutionary computing. The limitations and validity of the proposed algorithms are evaluated by simulated Rrs spectra with respect to NOMAD, and a closure test for IOPs obtained at a single reference wavelength. The application of GEP-derived algorithms is validated against in situ, synthetic and satellite match-up data sets compiled by NASA and the International Ocean Color Coordinate Group (IOCCG). The new algorithms are able to provide Chla and IOPs retrievals to those derived by other state-of-the-art regression approaches and obtained with the semi- and quasi-analytical algorithms, respectively. In practice, there are no significant differences between GEP, support vector regression, and multilayer perceptron model in terms of the overall performance. The GEP-derived algorithms are successfully applied in processing the images taken by the Sea Wide Field-of-view Sensor (SeaWiFS), generate Chla and IOPs maps which show better details of developing algal blooms, and give more information on the distribution of water constituents between different water bodies. ©2015 Optical Society of America OCIS codes: (010.4450) Oceanic optics; (100.3190) Inverse problems; (010.0280) Remote sensing and sensors.

References and links 1. 2. 3. 4. 5. 6. 7.

T. Platt, “Primary production of the ocean water column as a function of surface light intensity: algorithms for remote sensing,” Deep Sea Res. Part I Oceanogr. Res. Pap. 33(2), 149–163 (1986). Z. P. Lee, K. L. Carder, R. G. Steward, T. G. Peacock, C. O. Davis, and J. L. Mueller, “Remote sensing reflectance and inherent optical properties of oceanic waters derived from above-water measurements,” Proc. SPIE 2963, 160–166 (1997). A. Morel, “Optical modeling of the upper ocean in relation to its biogenous matter content (case I waters).” J. Geophys. Res.-Oceans 93(C9), 10749–10768 (1988). S. Sathyendranath, Remote Sensing of Ocean Colour in Coastal, and Other Optically-Complex, Waters (IOCCG, 2000) Z. P. Lee, K. L. Carder, and R. A. Arnone, “Deriving inherent optical properties from water color: a multiband quasi-analytical algorithm for optically deep waters,” Appl. Opt. 41(27), 5755–5772 (2002). K. L. Carder, F. R. Chen, Z. P. Lee, S. K. Hawes, and D. Kamykowski, “Semianalytic Moderate-Resolution Imaging Spectrometer algorithms for chlorophyll a and absorption with bio-optical domains based on nitratedepletion temperatures,” J. Geophys. Res.- Oceans 104(C3), 5403–5421 (1999). S. A. Garver and D. A. Siegel, “Inherent optical property inversion of ocean color spectra and its biogeochemical interpretation. 1. Time series from the Sargasso Sea,” J. Geophys. Res.- Oceans 102(C8), 18607–18625 (1997).

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5417

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38.

C. C. Liu and R. L. Miller, “A spectrum matching method for estimating the chlorophyll-a concentration, CDOM ratio and backscatter fraction from remote sensing of ocean color,” Can. J. Rem. Sens. 34(4), 343–355 (2008). F. Melin, “Global distribution of the random uncertainty associated with satellite-derived Chla,” IEEE Geosci. Remote Sens. 7(1), 220–224 (2010). Z. Lee, R. Arnone, C. Hu, P. J. Werdell, and B. Lubac, “Uncertainties of optical parameters and their propagations in an analytical ocean color inversion algorithm,” Appl. Opt. 49(3), 369–381 (2010). D. A. Toole, D. A. Siegel, D. W. Menzies, M. J. Neumann, and R. C. Smith, “Remote-sensing reflectance determinations in the coastal ocean environment: impact of instrumental characteristics and environmental variability,” Appl. Opt. 39(3), 456–469 (2000). J. Chen, W. Quan, T. Cui, Q. Song, and C. Lin, “Remote sensing of absorption and scattering coefficient using neural network model: Development, validation, and application,” Remote Sens. Environ. 149, 213–226 (2014). P. Cipollini, G. Corsini, M. Diani, and R. Grasso, “Retrieval of sea water optically active parameters from hyperspectral data by means of generalized radial basis function neural networks,” IEEE Trans. Geosci. Remote 39(7), 1508–1524 (2001). D. D’Alimonte and G. Zibordi, “Phytoplankton determination in an optically complex coastal region using a multilayer perceptron neural network,” IEEE Trans. Geosci. Remote 41(12), 2861–2868 (2003). I. Ioannou, A. Gilerson, B. Gross, F. Moshary, and S. Ahmed, “Neural network approach to retrieve the inherent optical properties of the ocean from observations of MODIS,” Appl. Opt. 50(19), 3168–3186 (2011). C. Jamet, H. Loisel, and D. Dessailly, “Retrieval of the spectral diffuse attenuation coefficient K-d(λ) in open and coastal ocean waters using a neural network inversion,” J. Geophys. Res.- Oceans 117(C10), C10023 (2012). L. E. Keiner and X. H. Yan, “A neural network model for estimating sea surface chlorophyll and sediments from thematic mapper imagery,” Remote Sens. Environ. 66(2), 153–165 (1998). G. Camps-Valls, L. Gomez-Chova, J. Munoz-Mari, J. Vila-Frances, J. Amoros-Lopez, and J. Calpe-Maravilla, “Retrieval of oceanic chlorophyll concentration with relevance vector machines,” Remote Sens. Environ. 105(1), 23–33 (2006). H. G. Zhan, P. Shi, and C. Q. Chen, “Retrieval of oceanic chlorophyll concentration using support vector machines,” IEEE Trans. Geosci. Remote 41(12), 2947–2951 (2003). M. S. Salama and F. Shen, “Stochastic inversion of ocean color data using the cross-entropy method,” Opt. Express 18(2), 479–499 (2010). H. T. Shahraiyni, S. B. Shouraki, F. Fell, M. Schaale, J. Fischer, A. Tavakoli, R. Preusker, M. Tajrishy, M. Vatandoust, and H. Khodaparast, “Application of the active learning method to the retrieval of pigment from spectral remote sensing reflectance data,” Int. J. Remote Sens. 30(4), 1045–1065 (2009). C. H. Chang, C. C. Liu, and C. G. Wen, “Integrating semianalytical and genetic algorithms to retrieve the constituents of water bodies from remote sensing of ocean color,” Opt. Express 15(2), 252–265 (2007). H. G. Zhan, Z. P. Lee, P. Shi, C. Q. Chen, and K. L. Carder, “Retrieval of water optical properties for optically deep waters using genetic algorithms,” IEEE Trans. Geosci. Remote 41(5), 1123–1128 (2003). D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, 1989). K. Miettinen, P. Neittaanmaki, M. M. Makela, and J. Periaux, Evolutionary Algorithms in Engineering and Computer Sciences (John Wiley & Sons, 1999). J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, 1992). C. Fonlupt, “Solving the ocean color problem using a genetic programming approach,” Appl. Soft Comput. 1(1), 63–72 (2001). S. Tang, C. Michel, and P. Larouche, “Development of an explicit algorithm for remote sensing estimation of chlorophyll a using symbolic regression,” Opt. Lett. 37(15), 3165–3167 (2012). C. Ferreira, “Gene expression programming: a new adaptive algorithm for solving problems,” Complex Syst. 13, 87–129 (2001). M. O'Neill and C. Ryan, “Grammatical Evolution: A steady state approach,” in Late Breaking Papers at the Genetic Programming 1998 Conference, J. R. Koza, ed., (Stanford University Bookstore, 1998), pp. 180–185. P. J. Werdell and S. W. Bailey, “An improved in-situ bio-optical data set for ocean color algorithm development and satellite data product validation,” Remote Sens. Environ. 98(1), 122–140 (2005). Z. P. Lee, Remote Sensing of Inherent Optical Properties: Fundamentals, Tests of Algorithms, and Applications (IOCCG, 2006). D. Odermatt, A. Gitelson, V. E. Brando, and M. Schaepman, “Review of constituent retrieval in optically deep and complex waters from satellite imagery,” Remote Sens. Environ. 118, 116–126 (2012). Z. P. Lee, “Models, parameters, and approaches that used to generate wide range of absorption and backscattering spectra,” http://www.ioccg.org/groups/lee_data.pdf, accessed 2/16/15. S. W. Bailey and P. J. Werdell, “A multi-sensor approach for the on-orbit validation of ocean color satellite data products,” Remote Sens. Environ. 102(1-2), 12–23 (2006). M. Z. Hashmi, A. Y. Shamseldin, and B. W. Melville, “Statistical downscaling of watershed precipitation using Gene Expression Programming (GEP),” Environ. Model. Softw. 26(12), 1639–1646 (2011). A. Efstratiadis and D. Koutsoyiannis, “One decade of multi-objective calibration approaches in hydrological modelling: a review,” Hydrol. Sci. J. 55(1), 58–78 (2010). E. G. Bekele and J. W. Nicklow, “Multi-objective automatic calibration of SWAT using NSGA-II,” J. Hydrol. (Amst.) 341(3-4), 165–176 (2007).

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5418

39. A. H. Barnard, J. R. V. Zaneveld, and W. S. Pegau, “In situ determination of the remotely sensed reflectance and the absorption coefficient: closure and inversion,” Appl. Opt. 38(24), 5108–5117 (1999). 40. Z. P. Lee, K. L. Carder, R. G. Steward, T. G. Peacock, C. O. Davis, and J. S. Patch, “An empirical algorithm for light absorption by ocean water based on color,” J. Geophys. Res.- Oceans 103(C12), 27967–27978 (1998). 41. J. E. O’Reilly, S. Maritorena, B. G. Mitchell, D. A. Siegel, K. L. Carder, S. A. Garver, M. Kahru, and C. McClain, “Ocean color chlorophyll algorithms for SeaWiFS,” J. Geophys. Res.- Oceans 103(C11), 24937–24953 (1998). 42. T. S. Moore, J. W. Campbell, and M. D. Dowell, “A class-based approach to characterizing and mapping the uncertainty of the MODIS ocean chlorophyll product,” Remote Sens. Environ. 113(11), 2424–2430 (2009). 43. Gepsoft, “Getting Started with Regression,” http://www.gepsoft.com/tutorials.htm, assessed 2/15/15. 44. R. Doerffer and H. Schiller, “The MERIS case 2 water algorithm,” Int. J. Remote Sens. 28(3-4), 517–535 (2007). 45. F. E. Hoge and P. E. Lyon, “Satellite retrieval of inherent optical properties by linear matrix inversion of oceanic radiance models: An analysis of model and radiance measurement errors,” J. Geophys. Res.- Oceans 101(C7), 16631–16648 (1996). 46. S. Maritorena, D. A. Siegel, and A. R. Peterson, “Optimization of a semianalytical ocean color model for globalscale applications,” Appl. Opt. 41(15), 2705–2714 (2002). 47. S. Sathyendranath and T. Platt, “Analytic model of ocean color,” Appl. Opt. 36(12), 2620–2629 (1997). 48. C. Hu, Z. Lee, and B. Franz, “Chlorophyll a algorithms for oligotrophic oceans: A novel approach based on three-band reflectance difference,” J. Geophys. Res.- Oceans 117(C1), C01011 (2012). 49. W. Shi and M. H. Wang, “Detection of turbid waters and absorbing aerosols for the MODIS ocean color data processing,” Remote Sens. Environ. 110(2), 149–161 (2007). 50. R. M. Pope and E. S. Fry, “Absorption spectrum (380-700 nm) of pure water. II. Integrating cavity measurements,” Appl. Opt. 36(33), 8710–8723 (1997). 51. G. C. Feldman, “Feature of the Day (May 4, 2002), Ocean Color Image Gallery,” http://oceancolor.gsfc.nasa.gov/cgi/image_archive.cgi, assessed 2/15/15. 52. G. Zibordi, F. Melin, and J. F. Berthon, “Comparison of SeaWiFS, MODIS and MERIS radiometric products at a coastal site,” Geophys. Res. Lett. 33(6), L06617 (2006). 53. M. Kotanchek, G. Smits, and E. Vladislavleva, “Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models,” in Genetic Programming Theory and Practice V, R. Riolo, T. Soule, and B. Worzel, eds. (Springer, 2008), pp. 201–220.

1. Introduction The application of ocean-color remote sensing is promising with regard to its ability to allow for a high spatial and temporal assessment of important physical, chemical, and biological parameters of the global ocean from space [1]. Deriving optical properties associated with these essential parameters from remotely sensed radiance above the sea surface, is known as the “inverse problem” in the field of ocean optics. Retrieval algorithms or inversion algorithms/models are applied to solve the inverse problem, which use inputs representing “ocean-color” [2] (Rrs, the ratio between the water-leaving radiance and downwelling irradiance, sr−1) measured by a satellite sensor at a limited number of spectral bands in the visible range. Typical outputs are the total phytoplankton content in terms of Chlorophyll-a concentration (Chla); and ocean inherent optical properties (IOPs) [3], such as the absorption and back-scattering characteristics for phytoplankton, detritus (non-algal particles) and colored dissolved organic matters (CDOM or gelbstoff). Estimation of the bio-optical properties through solving the inverse problem is of importance in understanding the related environmental dynamics at both local and global scales. The various inversion algorithms can generally be classified into two different approaches based on whether a forward-model that physically describes the bio-optical relationships and radiative-transfer processes is or is not involved in the inversion process [4]. First, modelbased approaches solve the inverse problem by deriving algebraic results [5], or approximating suitable solutions using optimization or spectral matching methods, such as the Levenberg-Marquart approach [6], and nonlinear least-square schemes [7, 8]. If a representative model is available, the major benefit obtained from such approaches is that few training data are required to develop the algorithm. In order to obtain algebraic solutions or improve the efficiency of optimization, it is commonly necessary to simplify the forwardmodel, such as by the use of semi-analytical and quasi-analytical approximations [5, 6]. However, simplification may create unrepresentative models and lead to incorrect results, particularly when dealing with optically complex waters [4]. Second, without the use of a #227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5419

predefined model, empirical or regression approaches can directly relate the inputs to the outputs by training with a set of known samples. Although it is indicated that the model-based approach may have a higher level of complexity and perhaps better accuracy for Case-2 waters [4], current operational algorithms used for mapping global ocean-color products are developed using a data-driven or empirical approach, mainly due to higher model transferability and lower computational load. The IOP products were mostly derived by model-based approaches or empirical algorithms developed by data generated from mechanistic models. The performance of empirical methods is highly dependent on the level of representativeness and complexity of the training data. Because the interactions between different IOPs are commonly not linear [4], bio-optical models are not globally the same [9], and bio-optical measurements often return imprecise and uncertain data [10, 11], the concepts of a universally applicable model or representative data are problematic in real oceans. This has led to the use of soft-computing techniques in the ocean-color community, to develop more reliable and robust algorithms to deal with such nonlinear systems. For example, recent work has applied artificial neural networks (ANNs) [12–17], support vector regression (SVR) [18, 19], cross-entropy method [20], active learning method [21], and evolutionary algorithms (EAs) [22, 23]. EAs are inspired by natural processes that drive evolution [24], such as inheritance, variation and the Darwinian struggle for survival. EAs have been widely used in many disciplines to solve complex problems, and are effective, robust and easy-to-implement methods that can search for global solutions to optimization problems [25]. In general, EAs encompass all adaptive and computational models of natural evolutionary systems [25], such as genetic algorithms (GAs) [24] and genetic programming (GP) techniques [26]. GAs are genotype-algorithms, in which the individuals are fixed-length linear strings and function as genomes (chromosomes). A GA works as a nonlinear optimization method in the modelbased approach for solving the inverse problem [22, 23], in which candidate solutions are encoded as chromosomes and manipulated by natural selection operators, e.g., reproduction, crossover and mutation, and an individual’s fitness for survival is evaluated in decoded forms (using real values) by the predefined forward model and objective function. In contrast, a GP is a phenotype-algorithm, where the phenotypes of individuals are represented and manipulated as nonlinear entities of different sizes and shapes, commonly a tree-structured function [26]. Solving the inverse problem with GP is also known as symbolic regression [27, 28], in which the most appropriate backward model for linking inputs and outputs is developed by directly optimizing a population of tree-structured functions through the related evolutionary processes. GPs have had significant success in creating innovative (i.e., commonly improved) functions and equations for industrial applications. However, in a Kozastyle GP [26], both the searching operators and the fitness functions are based on the phenotype. Therefore, without the use of a genotype, illegal or invalid tree-structures may often be created in the selection process, which limits the optimization efficiency and algorithm transferability [29]. Several genotype-phenotype GP techniques which can manipulate individuals genetically as GA and can maintain the functional complexity of the phenotype, have recently been proposed, such as the grammatical evolution [30] and gene expression programming (GEP) techniques [29]. A genotype-phenotype system makes a separation between an organism’s genotype and phenotype, which means that it is more compatible with how genetics actually operates in nature, and is more likely to provide reasonable, efficient and simple algorithms to solve the inverse problem. In GEP, the individuals are encoded as GAs, which are then translated into tree-structured functions, the gene expression trees (ETs), by a functional relationship between chromosomes and phenotype expressions. With the unequivocal translation system [29], any modification made in the genome can always result in correct ETs. However to date no research has been performed using the GEP to develop ocean-color retrieval algorithms.

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5420

With an attempt to gain the computational advantages from both the evolutionary algorithm and empirical approaches, this work develops new inversion algorithms for the retrieval of Chla and IOPs from remote sensing of ocean color using GEP. First, version 2 of the NASA bio-Optical Marine Algorithm Data set (NOMAD) [31] is examined to summarize all the available data points of Rrs(λ), Chla and IOPs. Second, available data pairs are sampled into two subsets: a training data set for the creation of inversion models, and a validation data set that is used to check the performance of the model. Third, the GEP technique is applied to develop six ocean-color inversion algorithms for the estimation of (1) Chla (mg/m3); (2) absorption coefficients at 443nm (m−1), including phytoplankton [aph(443)], detritus-gelbstoff [adg(443)], particulate matters [ap(443)], and total optically-active-substances [a(443)]; and (3) the particle backscattering coefficients at 555nm (m−1) [bbp(555)]. Finally, the application of these new models is validated against the in situ, synthetic, and Sea Wide Field-of-view Sensor (SeaWiFS) match-up data sets compiled by NASA and International Ocean Color Coordinate Group (IOCCG) [32]. Most of the in situ data used in this study comes from locations relatively close to the coast (some are Case 2 waters), and the IOCCG synthetic data set comprises a wide range of parameters characterizing the global ocean (Case 1 waters). Compared to existing empirical- and model-based approaches, the inversion algorithms proposed in this study are clearly expressed functions which have higher transferability and can be readily applied, and they are capable of: (1) providing better retrievals for in situ data; (2) being used to process the SeaWiFS data at regional and global scales; and (3) generating ocean-color products which give more information on the distribution of water constituents. 2. Data sets NOMAD is a collection of in situ coincident observation data of radiometric, phytoplankton pigment concentrations and bio-optical properties from NASA’s SeaWiFS Bio-optical Archive and Storage System (SeaBASS). NOMAD is also the most popular and representative publicly available data set used to support algorithm development and satellite product validation, having been cited more than 100 times in the ISI Web of Knowledge since 2005. The latest version of the NOMAD data set (updated on 18 July, 2008), referred as NOMADv2a, was used in this study for the development of inversion models through GEP. The IOP data with coincident ground measured values of Rrs(λi) available at the first five SeaWiFS bands (λi = 411, 443, 489, 510 and 555nm) are used in this work, resulting in the following numbers of corresponding data points: HPLC measured Chla is 1,210, aph(443) is 897, adg(443) is 788, ap(443) is 932, a(443) is 799 and bbp(555) is 332. Although studies have shown that the Rrs measured at 670nm is very important for inversion of coastal waters [33], it was not used in this study when developing the algorithm, because there is an insufficient number of accurate measurements in NOMADv2a. The frequency distributions of the six selected data pairs are shown in Fig. 1. The NOMADv2a data set contains some near coastal samples, but the majority of the data is distributed in the lower end of each histogram, representing clear waters. The six data pairs were sorted by increasing the levels of Chla and IOPs, and afterwards the odd- and even-numbered data were used separately as training and validation samples for the development of GEP inversion models. A limited number of bbp(555) observations were available, therefore an additional 56 data points were allocated to the training samples from the validation set, and thus two-thirds of the data were used for training. As shown in Table 1, there were 222 to 605 data points available to train the GEP algorithms. The IOCCG prepared a synthetic and an in situ data set (hereafter referred as IOCCG synthetic and in situ data set) containing IOPs and the corresponding apparent optical properties (AOPs) for its Report Number 5 [32], in order to evaluate and compare the performance of a variety of retrieval algorithms commonly used in remote sensing practices. Note that the uncertainties in IOP measurements were designed into the IOCCG synthetic data by mapping one Chla to 25 different sets of IOPs based on theory and field data [34],

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5421

resulting in 500 IOPs that are generated from 20 levels of Chla in the range of 0.03–30 mg/m3, and the corresponding Rrs(λ) simulated by Hydrolight for the sun at 30° and 60° from zenith, respectively. The IOCCG in situ data set comprises 656 cases including Chla, a(443), adg(443), aph(443) and Rrs(λ), which also originated from the SeaBASS data.

Fig. 1. The frequency distributions of six selected IOP data points from the NOMADv2a data set. The variations of coincident observed Rrs(λ) are illustrated as boxplots and labeled on subaxes. Table 1. Statistical summary of the bio-optical data for training and validation of the GEP-derived inversion models Y Chla aph(443) adg(443) ap(443) a(443) bbp(555)*

Subset Training Validation Training Validation Training Validation Training Validation Training Validation Training Validation

N 605 605 446 445 394 394 466 466 400 399 222 110

Range 0.017–65.57 0.019–70.21 0.002–1.48 0.002–1.46 0.002-1.164 0.002-1.493 0.003–1.679 0.003–1.878 0.012–2.605 0.013–2.255 0.0012–0.0087 0.0013–0.0088

Average 2.472 2.531 0.074 0.073 0.105 0.107 0.098 0.100 0.208 0.205 0.0035 0.0036

s 5.94 6.25 0.142 0.136 0.17 0.18 0.190 0.199 0.341 0.328 1.602 1.553

Another subset of NOMAD, released earlier as version 1.3 on 19 Sept 2005 (referred to as NOMADv13), consists of coincident observations of Rrs(λ) from SeaWiFS imagery, and was used to ensure the inversion models are also able to process real satellite data. Due to difficulties in meeting the criteria of satellite “match-up” [35], the sample size of NOMADv13 is relatively small, where 268, 123, 111, 123,111 and 28 cases are available for on-orbit validation of the derived Chla, aph(443), adg(443), ap(443), a(443) and bbp(555), respectively. Overall, among the four data sets used, only the training samples (Table 1) in NOMADv2a are used to develop the inversion models, with the remainder used for validation. The IOCCG synthetic and in situ data sets are applied to evaluate and compare the performance of our inversion models and existing algorithms. NOMADv13 was used as satellite-matchup data for the validation of retrievals generated by GEP-derived inversion

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5422

models, in which the inputs are the satellite observed Rrs(λ), rather than the ground measurements in other data sets. 3. Symbolic regression The ocean color inverse problem can be regarded as an issue of symbolic regression [26], which requires finding a function in symbolic form that fits the relationships among given data points that consist of Rrs(λ) observed at a number of wavebands as the predictor variables, and the corresponding measurements of an ocean-color product as the variable to be predicted. Symbolic regression is also known as error-driven-evolution [26]. In an oceancolor inverse problem, where the evolution of solutions identified by symbolic regression is driven by a quantifiable objective, namely fitness. The fitness evaluation function is commonly a measure of the deviation between the simulated and observed values. Compared to other conventional linear or nonlinear regression techniques, the major difference with symbolic regression is that no a priori or pre-defined functions/models are specified. In symbolic regression, the computational constraint is that all of the mathematical operators or logical expressions used in the function should be chosen from a pre-defined pool of mathematical expressions or symbols. In the first part of this section, the general procedures for solving the symbolic regression problem with GEP will be introduced. Particular attention will be given to demonstrate how the unequivocal translation system works. The steps and settings for the development of ocean-color inversion algorithms using the GEP symbolic regression approach will then be described in the second part. 3.1 GEP based symbolic regression GEP is a genotype-phenotype genetic algorithm, in which the genotype of each individual is represented as a linear symbolic string (chromosome) with one or more fixed-length genes, as shown in Fig. 2(a); and the inferred phenotype (ET) is expressed nonlinearly in diagram form, as shown in Fig. 2(b). The computational strength of GEP is that not only can the ET be read straightforwardly from the “easily manipulated” linear string, but that it can also represent a real mathematic function [Fig. 2(c)] without losing the computational complexity. The process of transferring genetic information from a gene into an ET is carried out by the socalled translation system, in which several rules are applied to prevent the forming of illegal ETs and ensure that all possible messages or changes made in the gene can be directly translated into an ET. The rules that govern this are briefly described, as follows: 1. The head region of a gene [Fig. 2(a)] can contain symbols that represent both the functional and terminal nodes in the inferred ET [Fig. 2(b)], whereas the tail region contains only terminals. For instance, the symbols “Q”, “*”, “M” and “+” in the head part [Fig. 2(a)] represent the square root, multiplication, maximum of the two, and addition function, respectively, and these are functional nodes. The symbols “a” and “b” represent the predictor variables or numerical constants, and these are terminal nodes [Fig. 2(a)]. 2. The length of the head region (h) in a chromosome is freely chosen, and the length of the tail (t) is then calculated as follows: t = h ( n − 1) + 1.

(1)

Where n is the maximum number of arguments that a predefined function or operator can take. In Fig. 2, h is 6 and n equals the number of arguments used in the “maximum of two” function (represented as “M”), and hence the tail length is 7. Consequently, the size and shape of an ET, as well as the created real function, depend on the number/type of functional symbols used in the head region. In this case, when the head region is full of mathematical expressions, each of which has an

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5423

argument number of 2, then all of the noncoding region will be coded, and the size of ET reaches the maximum available number of 13. 3. Multigenic chromosomes are often used in GEP to increase the diversity of individuals, in which each gene of the chromosome represents a particular subunit of the related ET. Note that both the fitness values of sub-ETs and multi-subunit ETs will be evaluated during the evolutionary process. This is useful when both the size (complexity) and performance (fitness) of the identified function are of interest to developers. 4. Reproduction of the genotype of an individual with best performing fitness through genetic operators, i.e., replication, mutation, transposition and recombination (crossover), is an essential part of evolution. With the exception of replication, the common genetic operators used in GEP are illustrated in Fig. 3. Because of the linear-string structure, possible modifications made on phenotypes by the genetic operators can be directly translated into the phenotype without using special rules, as shown in Figs. 3(a) and 3(b).

Fig. 2. Example of different representations of an individual in GEP, (a) an example of GEP coded linear chromosome with one 13-digit gene, (b) gene expression tree (ET) of the example chromosome, and (c) the corresponding algebraic expression.

Fig. 3. Example of performing different genetic operations to the genotype of individuals with GEP and the corresponding changes in the phenotype, (a) an example of GEP one = point mutation, (b) an example of genetic transposition operation, and (c) the GEP recombination operation.

To solve symbolic regression problems, the GEP creates and optimizes regression models to the highest fitness value following the steps below:

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5424

1. The GEP process begins by randomly generating a population of individuals (genotype). 2. Each individual, which is translated to one possible regression model (phenotype) by the translation system, can be any combination of predefined mathematical operators, logical expressions, predictor variables and numerical constants. 3. The probability for each individual to survive in the evolutionary process is assessed by a fitness evaluation function. 4. The second and subsequent generations are created through a reproduction process. In this process, an individual with a high fitness value has a greater opportunity of being selected and passing its genetic characteristics on to its offspring by duplicating or processing genetic operations. 5. In the final generation, the optimum regression model linking the predictor and predictand variables can be obtained. 3.2 Development of ocean-color retrieval algorithms using GEP The symbolic regression function for solving the ocean-color inverse problem can be defined as the following form: Yˆ = f ( d 0 , d1 , ..., d 4 ).

(2)

Where Yˆ is the (water constituent) bio-optical property predicted by the measured ocean color, and di (i = 0 to 4) is the Rrs(λi) measured at the first five SeaWiFS bands in the visible range, i.e., λi = 411, 443, 489, 510 and 555nm. By using the GEP approach, the process of finding the optimum function, as shown in Eq. (2), consists of the following steps: First, formulate the fitness functions to guide and constrain the direction of evolution. The most widely used evaluation functions in symbolic regression are the coefficient of determination (R2) and root-mean-square-error (RMSE) [29, 36], which tend to match several peak and higher values rather than fitting the overall distribution patterns [37]. In the contrast, the log-transformed RMSE (logRMSE), which places more emphasis on fitting the lower observed values [5, 38], can cover a greater portion of IOP data in NOMADv2a [Fig. 1], which is formulated as follows:  [log(Yˆ ) − log(Y )]  logRMSE =  N  

1 2

N

i

i =1

i

2

   .  

(3)

The evaluation of fitness is also subject to “constraints” to guide the individual towards making not only accurate but also sensible predictions. For example, the predicted IOP should always be a positive value. Therefore, the multi-objective fitness function used is formulated as: 1    − punish.  1 + logRMSE 

fitness = 1000 

(4)

Where punish is the number of predicted values that failed to fall in the range of observed Y (Table 1, calculated from both the training and validation samples). Second, set up the pre-defined pool of mathematical expressions or symbols used to form the regression models. A total of 13 mathematical functions are selected (Table 2) based on the recommendation from Ferreira [29] for symbolic regression problems, and a review of existing empirical algorithms [39–41]. Moore, et al. [42] indicated that the use of the Rrs(λ)

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5425

pre-classification technique can better characterize the uncertainty of satellite data products, and it is also applicable to screening an appropriate algorithm for a specific optical water type. To implement this concept of pre-classification in the proposed inversion model using GEP, another six logical expressions are introduced in this study, as shown in Table 2, including the “maximum of two”, “maximum of three”, and four different “Greater Or Equal To” (GOE2) function sets. Third, create the genetic diversity that is needed for individuals to evolve, thus making it more likely that the fittest individual will be found. In GEP, the level of genetic diversification is controlled by various parameters describing the probability of specific genetic operations occurring, e.g., the rate of mutation and crossover. The initial settings of the genetic operation parameters are given in Table 2, and were also recommended by Ferreira [29] for optimum evolution. Finally, other settings for planning the regression model architecture, e.g., the head size, available number of numerical constants, number of genes (sub-ETs) per chromosome, and function linking sub-ETs, are based on the discussions in the forums hosted by Gepsoft [43], and determined by trial and error tests. All of the GEP settings are summarized in Table 2, where (i) x, y and z cannot only represent a predictor or a numerical constant, but also a functional node. For example, the addition function set “x+y” can be the addition of two predictors “d0+d1”, one predictor and one function set “d0+d2/d1”, and two function sets, and (ii) the function sets consist of only one variable x indicates that the number of arguments the function can take (arity) is 1. For example, the number of arities for x + y and max3(x,y,z) is 2 and 3, respectively. Table 2. Summary of the GEP symbolic regression settings for the development of oceancolor inversion models

Procedure

Function set

Settings Mathematical expressions: x + y, x-y, xy, x/y, 1/x, |x|, xy, 10x, ex, ln(x),

x , x

1 3

, x2

Logical expression: max2(x, y, z); max3(x, y, z);

x ≥ y , then x, else y; GOE2C: if x ≥ y , then x + y, else x-y; GOE2D: if x ≥ y , then xy, else x/y; GOE2E: if x ≥ y , then x + y, else xy.

GOE2A: if

Model architecture

Head size: 8 Max number of numerical constants in a gene (sub-ET): 8 Number of genes in a chromosome: 3 Function linking all genes (sub-ETs): Addition

Genetic operators

For one gene: Mutation rate: 0.00138 Inversion rate: 0.00546 Transposition rate: 0.00546 Recombination rate: 0.00277

General

Number of individuals in a generation: 30 Stop condition: no change has occurred within the past 3,000,000 generations

For a multigenic chromosome: Gene recombination rate: 0.00277 Gene transposition rate: 0.00277

4. Performance and model-closure evaluation To evaluate and compare the performance of the GEP derived inversion models, measurements of (1) model rationality, in terms of the number of valid retrievals (n); (2) coefficient of determination in log phase (R2); (3) logRMSE; (4) mean absolute percentage deviation (MAPD); and (5) Type II regression results (the intercept and slope of major axis regression) are used in this study.

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5426

First, the performance of trained GEP models was assessed when applied to the overall NOMADv2a data set by comparing their retrievals to those from (1) NASA’s operational near-surface Chla algorithm for SeaWiFS (OC4v6); (2) a recently published Chla retrieval algorithm developed using GP based symbolic regression (GPSR) [28]; and (3) the latest version of the Quasi-Analytical Algorithm for IOPs (QAAv5) [5]. Note that the GPSR model was also developed based on NOMADv2a, but with three log-10 band reflectance ratios as inputs. In order to independently evaluate the strength of GEP between other data-driven empirical approaches, three inversion models were developed using the same training data as GEP, and also included in the benchmark comparison: (1) the OC4-type band-ratio polynomial regression model (OC4x); (2) the Multilayer Perceptron model (MLP), which has one hidden layer (6 neurons) and a logistic activation function for both hidden and output layers; and (3) the ε-insensitive SVR model (SVR) equipped with the Radial Basis Function kernels [18]. The MLP and SVR were both developed and optimally tuned using the DTREG commercial software, version 10.7.18. Secondly, performance indicators for the GEP inversion models are calculated with the same procedures and data sets utilized in the IOCCG Report Number 5 [32], and are evaluated by comparing them with the indicator values obtained from various algorithms tested in this earlier report. The following inversion algorithms were selected for comparison: (1) the QAAv4, (2) the MERIS neural network algorithm (NN) [44], (3) the linear matrix inversion algorithm (LMI) [45], (4) the GSM SA bio-optical model (GSM) [46] and (5) the SA reflectance model based inversion algorithm (SARA) [47]. Inversion algorithms for IOPs and diffuse attenuation coefficients are mostly based on physical models or a bio-optical data set generated by “unbiased” and “closure-achieved” forward models. Since the GEP-derived IOP algorithms were fully based on in situ observations to define regression functions, they may be affected by errors within the data set. Additionally, the application of symbolic regression for IOP is limited to only one dependent variable, e.g., aph(443), and each IOP retrieval algorithm is developed independently and might have difficulties in achieving closure when combined with each other. Therefore, the validity of proposed algorithms should be evaluated by confirming that the IOPs obtained by independent algorithms are capable of being linked by bio-optical models forming IOPs spectra, and then achieving optical closure when a radiative-transfer model is applied. As shown on Fig. 4, a model-closure evaluation for the proposed algorithms is conducted by comparing the input Rrs(λ) (IOCCG synthetic data) of the inversion algorithms with the Rrs(λ) simulated by the forward models and corresponding IOP retrievals. The models, parameters, and approaches used to generate a(λ) and bb(λ) with IOPs derived at specific wavelengths, and the settings of HydroLight radiative transfer model used to simulate Rrs(λ) with the generated IOP spectra, are given in Lee (2006) [34]. Slight changes were made by shifting the reference wavelengths of absorption and scattering from 440 and 550nm to 443 and 555nm, respectively, and combining ad(λ) and ag(λ) in Lee’s approach to give adg(λ). The q-index developed by Doerffer and Schiller (2007) [44] to determine the degree of Rrs(λ) mismatch (mismatch when q ≥ 4) was used for closure evaluation.

Fig. 4. The procedure of closure evaluation for IOPs derived from aph(443), adg(443) and bbp(555) algorithms.

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5427

5. Results and discussion 5.1 The GEP-derived ocean-color inversion models The inversion models developed by GEP were in the form of ETs, which were translated to the following algebraic equations, based on the procedure described in Section 3.1. 2

1  f ( d d , 0.881 − d )    f (d , d )  +0.226, Chla =  + Abs  f    f (d , d )  d     0.09 3

0

4

3

4

3

4

3

2

3

1

{

3

} ,

+ f 2 f3 [ f 4 ( max(d 3 ,d1 ,d 0 ),d 3 ) ,d1 + d 4 ] , f3  f3 ( d 4 ,0.369 ) , d 2 (

a ph (443) = d 0

d3 d4

) + 0.8423 − ( d 0

d3

)

2

 d   ), 0.0092 ) +  0.2475    d    2

(

+ f3 max 2(f3 ( d 2 , d 0 ) , d 0 2

2

(5)

0

(

3

d 1 + 1.3014 )

2

, (6)

3

2



adg (443)= f 4  d 3 − 0.9756,



 (d 

d4 d0

0



f 4 (d 4 , d 2 )



d0

− 0.0798 − f 2 (d1 , d 2 ) ) + f3 



,d 4 − 0.0002 − d 3 

 (7)

+ f3 [ d1 + f 4 ( d 2 − 2 d 3 ,-1.0691d 3 ) ,d 4 ] ,

(

d -0.016 10 0

a p (443) = d 2 + f 4 0.0048

(

+ f3 0.5034, d 4



2



, d3

)  125 +

) − max 3 ( f [f (0.0236,d ), f (d ,d )] ,d ,-0.2639 )

d0 d2

2

2

4

3

4

0

1

0.2262 



d2

, 

2

(8)

2

  + f [ 0.0162 + 0.8085d , f ( 2d , f (d ,d ) )] max 3( d − 0.005, 0.37 57 d , d )   0.4344 d 4

a (443) = 

4

0

4

0

4

3

2

2

0

2

and (9)  d   + f  f  , max 2( d , 0.0068)  , f ( d , d ) + max 3( d , d , 0.0071)  ,  d   2

4

4

1

2

2

3

2

4

2

0



d



 d3

bb p (555) = d 4 − f 4  d1 , f3 

0



2

, f3 ( d 2 , d 4 )   + f 4 [ f 2 (0.0043, d1 ), f 2 ( d 3 , 0.003) ]

+ f3 max 3( d1 , d 3 , 0.0035),10

2

(10)



d1

 + f [ f ( ln(d )+2.3932,0.0021) , - 0.6994d − d + d ] . 4

4

4

4

0

1

Where f1, f2, f3, and f4 represent the “Greater Or Equal 2” function of GOE2A, GOE2C, GOE2D and GOE2E (Table 2); max2 is the maximum of two function; and max3 is the maximum of three function. Interestingly, the GOE2 functions are the most frequently chosen by the above equations. It is evident that logical expressions play an important role in GEPderived inversion models by varying the functional relationships between variables according to the input value of Rrs(λ). Although similar pre-classification strategies have been widely used in existing algorithms, such as the max3 function in OCx algorithms [41] and the color index (CI) threshold in the CI-based algorithm [48], this study is the first to develop an optimum inversion model derived using the GOE2 functions. 5.2 The limitation of GEP-derived algorithms A synthetic Rrs(λ) data set was developed in this study to evaluate the robustness of the proposed algorithms. The test spectra were generated by taking 100,000 random Latinhypercube samples of the Rrs(λ) spectrum from a multivariate lognormal distribution with a mean and covariance matrix obtained by all available measurements (n = 3,079) in the NOMADv2a data set. The evaluation aimed to be comprehensive and independent, as a major

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5428

portion of the measurements was not included in the training data, and the testing spectra were simulated to sufficiently cover possible spectral values and patterns in the NOMADv2a data set. However, the simulated data may contain biases from the measurement errors within the in situ data set. The evaluation results are shown on Table 3. Although for the bbp(555) retrievals are relatively less robust, more than 99% of the retrieved values from other algorithms are within the scope of training/validation data set. The proposed algorithms were robust even when Rrs(411) was at a relatively low level of 0.00014 sr−1. However, an exceptional condition may occur while processing ocean-color imagery over coastal regions, where the Rrs(411) and Rrs(443) are often biased lower than 0.00014 sr−1 or become negative values due to the atmospheric correction errors [49]. Further analysis using different SeaWiFS scenes is required to ensure the validity of GEP-derived algorithms. Table 3. Statistical analysis results of Chla and IOPs retrievals obtained by the GEPderived algorithms

Rrs (λ, range) (nm, sr−1)

99% interval of retrieved values estimated by a lognormal distribution (percentage valid retrievals) aph(443) adg(443) ap(443) a(443) bbp(555) Chla (mg/m3) (m−1) (m−1) (m−1) (m−1) (m−1)

411, 1.4E-4–1.2E-1 443, 2.5E-4–6.4E-2 489, 4.5E-4–4.4E-2 510, 4.5E-4–3.0E-2 555, 1.1E-4–7.4E-2

0.015– 22.01 (99.9%)

0.0020– 0.5423 (100.0%)

0.0037– 0.5591 (99.8%)

0.0025– 0.7581 (99.2%)

0.0093– 1.0920 (99.8%)

0.0005– 0.0175 (95.0%)

5.3 Performance of GEP models for the NOMADv2a in situ bio-optical data set The overall performance of the GEP-derived algorithms is shown in Table 4 and Fig. 5, by comparing their retrievals to the known Chla and IOPs in both training and validation data sets. The performance indicators for GEP, MLP, and SVR in training and validation stages are both reported in Table 4. The performance for all data is generally the average value of both stages. There are no significant differences among all empirical algorithms with regard to the type II regression. Because of a greater size of training data, the OC4v6 developed using all of the available samples in the NOMADv2a data set (n = 1,243) performs better than the OC4x based on the same training data used in this study (n = 605). Nevertheless, the other four empirical algorithms derived by the state-of-the-art regression approaches with less training data, e.g., GEP, GPSR, MLP and SVR, provide tighter and stronger relationships with the known values in terms of having lower errors (logRMSE and MAPD) and higher R2 values than OC4v6. Based on the same training data, although the SVR performs the best across GEP and MLP for all and the major portion of cases (within the range of 0.25 and 3 mg/m3), the GEP performs the best for oligotrophic waters, where the levels of Chla were commonly lower than 0.25 mg/m3 [48], and well-balanced its performance across different levels of Chla. This advantage can help to process an ocean-color image with a higher dynamic range of Chla distribution. As shown in Fig. 5(a), because of the relatively few samples taken from eutrophic waters, a similar trend of under-estimation was found in both GEP and OC4v6 algorithms when the observations were higher than 10 mg/m3. The band-ratio algorithm derived by symbolic regression (GPSR) has a simpler architecture than our GEP-derived model [Eq. (5)]. However, the increased complexity of the GEP algorithm with additional GOE2 functions can obtain more accurate results, as seen in the fact that all of the performance indicators are better. In practice, there are no significant differences between GEP, SVR, and MLP in terms of the overall performance. The GEP-derived algorithm has the potential to be improved by using more function sets and adding more sub-ETs. The SVR and MLP can also be improved by tuning the network structures until fully optimized. The key strengths of the GEP-derived algorithm are its greater computational efficiency, which has been shown by GP in Tang et al. (2012) [28], and its greater algorithm transferability, due to its explicit nature (compared to SVR and MLP, which are implicit, black-box approaches). #227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5429

Comparisons were also made to the IOP retrievals from the model-based QAAv5 algorithm using the same data set, as shown in Table 4 and Fig. 5. The empirically-based GEP algorithms perform better than the model-based QAAv5 in terms of all performance metrics, for the NOMADv2a in situ data. For the category of a(443) or bbp(555), which combine the contributions from both phytoplankton and detritus-gelbstoff, two algorithms are able to attain good retrievals. However, without the use of the additional measurements of Rrs(667), it is evident that QAAv5 has difficulty in separating the contributions of aph(443) and adg(443) from a(443), particularly when aph(443) is lower than 0.1 m−1. With the advantage of using an independent algorithm for each inversion [Eqs. (6) and (7)], the GEP has better retrievals of both aph(443) and adg(443), which can help in accurately deriving the concentration of the specified water constituent. Table 4. Performance of the GEP-derived and other compared inversion algorithms for Chla and IOPs, using the NOMADv2a in situ data set

Product

Tran./Val. R2 0.907/0.882

Tran./Val. logRMSE 0.219/0.248

OC4v6

0.868

0.261

GPSR OC4x MLP SVR GEP OC4v6 GPSR OC4x MLP SVR GEP OC4v6 GPSR OC4x MLP SVR GEP QAAv5 GEP QAAv5 GEP QAAv5 GEP QAAv5

1210

891 887 788 746 799 799 332 328

0.880 0.859/0.854 0.889/0.883 0.909/0.889 0.704/0.597 0.618 0.597 0.593/0.500 0.660/0.603 0.683/0.579 0.627/0.561 0.548 0.550 0.550/0.504 0.598/0.549 0.624/0.579 0.863 0.029 0.769 0.617 0.896 0.866 0.801 0.488

GEP

932

0.888

Model

n

GEP Chla

Chla ≤ 0.25

0.25 < Chla ≤3

aph(443) adg(443) a(443) bbp(555) ap(443)

#227036 - $15.00 USD (C) 2015 OSA

402

607

Intercept

Slope

−0.012

0.959

Tran./Val. MAPD 41.7/47.8

−0.017

0.960

50.2

0.250 0.269/0.276 0.238/0.246 0.217/0.241 0.154/0.173 0.173 0.182 0.196/0.199 0.173/0.176 0.163/0.182 0.236/0.260 0.275 0.261 0.259/0.263 0.238/0.251 0.215/0.233 0.186 0.751 0.249 0.615 0.146 0.170 0.088 0.196

0.010 −0.032 −0.026 −0.030 0.100 0.161 0.204 0.193 0.192 0.136 0.031 0.059 0.086 0.032 0.027 0.009 −0.088 −0.745 −0.172 0.265 −0.077 −0.114 −0.324 1.488

0.962 0.915 0.942 0.926 1.054 1.132 1.168 1.132 1.151 1.081 1.401 1.509 1.359 1.371 1.336 1.232 0.941 0.128 0.866 1.523 0.922 0.922 0.873 1.587

51.4 51.1/49.4 45.6/45.9 41.3/44.2 32.8/36.9 36.0 39.2 44.1/44.9 36.2/38.0 35.3/40.2 48.5/55.4 60.7 61.0 56.3/52.5 50.8/50.2 43.5/45.6 37.1 641 48.0 58.8 24.7 24.4 15.0 28.0

0.171

−0.069

0.948

32.1

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5430

Fig. 5. Comparing the retrievals obtained by GEP and selected algorithms to the known Chla and IOPs data from the NOMADv2a data set. (a) GEP- and OC4v6-derived Chla, (b) GEPand QAAv5-derived aph(443), (c) GEP- and QAAv5-derived adg(443), (d) GEP-derived ap(443), (e) GEP- and QAAv5-derived a(443), and (f) GEP- and QAAv5-derived bbp(555).

The four- and three-component bio-optical model is summarized in the following equation: a ( λ ) = a ( λ ) + a ( λ ) + a ( λ ) + a ( λ ) = a ( λ ) + a ( λ ) + a ( λ ). w

ph

d

g

w

ph

dg

(11)

Where aw(λ) is the known absorption coefficient of pure water [50]. Based in Eq. (11), the adg(443) can be calculated either by Eq. (7) alone, or by subtracting aph(443) and aw(443) from a(443) by utilizing the Eqs. (6)-(9). Although these equations were developed independently by GEP, and appear to be very different in their formulation, a comparison between the adg(443) retrievals obtained in two different ways [Fig. 6(c)] shows that both approaches are capable of carrying out the inversion, and hence the algorithms are generally interoperable. Therefore, specific information about gelbstoff and detritus in Eq. (11), i.e., ag(443) and ad(443), is obtainable from an “allocation” procedure. The ag(443) can be obtained by subtracting ap(443) and aw(443) from a(443), and then used to derive ad(443) from adg(443). The performances of deriving ag(443) and ad(443) using Eqs. (6)-(9) and following the above procedure are shown in Figs. 6(a) and 6(b), respectively. Although a similar overestimation is found for both adg(443) and ag(443) in the lower values, all of the ag(443) retrievals are valid (n/N = 100% and R2 = 0.65) by comparing them to the available cases in NOVADv2a (with coincident measurements of ag(443) and Rrs(λ) at the first five SeaWiFS visible bands). On the other hand, the ad(443) retrievals from GEP algorithms and the “allocation” procedure appear to be reasonable when higher than 0.01 m−1, but most are underestimated, with invalid estimates ( 0.25 mg/m3) and for oligotrophic waters (Chla ≤ 0.25 mg/m3), respectively. However, it is concluded that there are no significant differences between GEP, SVR, and MLP in terms of the overall performance, since these state-of-the-art data interpolation approaches all have the potential to be improved. The GEP algorithms proposed in their current forms are recommended for ocean color data processing because of the following advantages: (1) well-balanced performance across different Chla levels; (2) higher model transferability owing to their explicit nature; (3) appropriate function size compared to other GP derived algorithms, which commonly appear in the form of treestructure functions or source code for computing; and (4) higher flexibility of model improvement, due to a wide choice of function sets and model architectures. For IOPs retrievals, the GEP algorithms are robust and their performance is comparable to those of existing model-based approaches regarding the model-synthesized IOCCG data with added noise. Analytical or semi-analytical algorithms enable opportunities to understand the basics of ocean optics. On the other hand, simple empirical relationships extend the potential application of ocean-color products due to the easy and efficient processing ability. It might be the fundamental limitation for algorithms capable to perform complex data interpolations is that they are only valid at the data interval used for the model development. The general “out-of-range” limitation is applicable for these approaches to characterize the quality of each retrieved pixel value in ocean-color products. In this study, the statistical analysis using 100,000 simulated Rrs spectra shows that the GEP-derived algorithms are very reliable and robust at conditions similar to the NOMADv2a data set. The results of image processing show that some exceptional cases may occur in very turbid coastal regions where the Rrs(411) or Rrs(443) are negative. For those within the parameter region, but with questionable retrievals, the results of the closure evaluation provide some support for the GEP-derived IOP algorithms. The success of closure evaluation depends strongly upon the accurate field measurements in the NOMADv2a data set. However, unlike the MLP approach, which has the advantage of building bi-directional networks to derive pixel-by-pixel q-indexes as a threshold to identify not-valid values, the GEP symbolic regression is limited to only one dependent variable. To the best of my knowledge, there are no established procedures for estimating confidence or prediction intervals for symbolic regression models to prevent questionable retrievals. The strategy of assembling a set of diverse models to provide a confidence prediction [53] is thus suggested in any future GEP scheme. As an empirically-based approach, the GEP-derived algorithms are easily utilized and fast at processing the satellite imagery. After processing by the GEP-derived algorithms, the SeaWiFS GAC image shows more details of developing algal blooms in the Black Sea, and gives more information on the distribution of particles and CDOM. The GEP algorithms proposed in this study fully rely on training data to define the symbolic regression model, and thus it is recommended that a further study include more data for algorithm development, which can be a data set simulated by mechanistic models or a collection of accurate in situ data, particularly for higher bbp samples. Acknowledgments The authors thank the Ministry of Science and Technology, Taiwan ROC, for the financial support through grants NSC-101-2221-E-006-153 and MOST-103-2221-E-006-014.

#227036 - $15.00 USD (C) 2015 OSA

Received 18 Nov 2014; revised 15 Feb 2015; accepted 16 Feb 2015; published 23 Feb 2015 9 Mar 2015 | Vol. 23, No. 5 | DOI:10.1364/OE.23.005417 | OPTICS EXPRESS 5437

Development of ocean color algorithms for estimating chlorophyll-a concentrations and inherent optical properties using gene expression programming (GEP).

This paper proposes new inversion algorithms for the estimation of Chlorophyll-a concentration (Chla) and the ocean's inherent optical properties (IOP...
4MB Sizes 2 Downloads 7 Views