In this study, two functional logistic regression models with functional principal component basis (FPCA) and functional partial least squares basis (FPLS) have been developed to distinguish precancerous adenomatous polyps from hyperplastic polyps for the purpose of classification and interpretation. The classification performances of the two functional models have been compared with two widely used multivariate methods, principal component discriminant analysis (PCDA) and partial least squares discriminant analysis (PLSDA). The results indicated that classification abilities of FPCA and FPLS models outperformed those of the PCDA and PLSDA models by using a small number of functional basis components. With substantial reduction in model complexity and improvement of classification accuracy, it is particularly helpful for interpretation of the complex spectral features related to precancerous colon polyps.
Near-infrared (NIR) spectra of biomedical objects consist of many overlapping absorption bands representing the different modes of vibration of a large number of molecular constituents in the compounds, which are sensitive to the physical and chemical states of the compounds [
Traditional spectral data analysis involves multivariate statistical techniques such as multiple linear regression, principal components regression (PCR) [
Functional data analysis (FDA) proposed by Ramsay and Silverman [
In this study of functional logistic regression (FLR) [
A colon dataset is presented here to illustrate the application of functional data analysis in a classification setting with a problem arising in colonoscopy. The evolution of colon cancer starts with colon polyps. There are two different types of colon polyps, hyperplastic and adenomatous polyps. Hyperplasias are benign polyps which are known not to evolve into cancer. Adenomas have a strong tendency to develop cancer and therefore they have to be excised immediately. Polyps are often found during endoscopy of the colon (colonoscopy). A method to differentiate reliably adenomas from hyperplasias during a preventive medical colonoscopy is highly desirable.
Visible-NIR spectra covering the range from 320 to 920 nm were collected from 64 human colonic biopsy sites using an optical probe in vivo during colonoscopy which was undertaken to assess patients for precancerous adenomatous polyps. Repeated measurements, on average 6 spectra per site, were taken from each biopsy site. In total 363 spectra were analyzed, including 63 spectra from 11 hyperplastic polyps and 300 spectra from 53 adenomatous polyps.
Before constructing a classification model, standard data pre-processing was carried out on the spectra to improve signal quality. This involved spectral smoothing using Savitzky-Golay method (Savitzky and Golay 1964) [
Multivariate statistical methods including principal component analysis (PCA), partial least squares (PLS) and linear discriminant analysis (LDA) [
Though multivariate PCA and PLS are powerful exploratory statistical methods to reduce the dimension of the original spectral variable, in spectroscopic data setting it would be more informative to describe the spectrum as a smooth function rather than as a set of points when taking into account the underlying spectral information.
To solve the problems of high dimensionality and multicollinearity encountered in functional regression models, the PCA and PLS have been generalized to the infinite-dimensional case where observations of predictor variables are cur- ves or functions (functional data) instead of vectors as in multivariate case. Functional PCA [
Let
Functional PC and functional PLS components are uncorrelated that are obtained as generalized linear combinations of the functional predictor variables. The j th component is defined by
where
For functional PCA, we seek weight functions
subject to
For functional PLS, we seek weight functions
In binary classification problems, a functional logistic regression (FLR) model is used to model the relationship between the binary response and the functional predictor whose observations are functions instead of vectors as in multivariate case. The FLR model is given by
where
with
Taking the logit transformation gives the following equivalent expression for FLR model:
The use of least squares criteria to estimate the parameters of the functional linear regression model yields an ill-posed problem due to the infinite dimension of the predictor space. In addition, sample path
where
where
The main problem is to approximate the basis coefficients of sample curves from their discrete observations and to select an appropriate basis by taking into account the characteristics of the observed sample curves. Due to the high dependence structure of the design matrix
The FPCA and FPLS were carried out with k, the number of component basis functions, ranging from 2 to 30 and k was optimized by using the leave-out-one- site cross-validation as mentioned in Section 2.3.1.
All the algorithms for computations and analyses were implemented in R statistical programming language [
The mean spectral patterns from hyperplastic and adenomatous polyps after standard pre-treatment are shown in
For PCDA and PLSDA models, the discrimination results of cross-validation were used to optimize the number of PCs or PLS components. In
taking into account the relationship between the spectral variables and the response variable for latent variable design, the optimal PLSDA model achieved a leave-one-out cross-validation discrimination accuracy with the same sensitivity and specificity of 81% and 76% respectively, but using only seven PLS components in constructing the canonical variable.
By combining the loadings from the PCA or PLS and LDA, the PCDA loading and PLSDA loading can be found to show the contribution at each wavelength to the linear discrimination rule and thus can be related easily to the spectral features for interpretation purpose.
However, in
Though the PLSDA model used a smaller number of constructed variables for classification when compared with the PCDA model, the classification perfor- mance and interpretation of both models are not satisfactory. FPCA and FPLS logistic regression models showed improvements over the multivariate methods.
Models | Parameter | Sensitivity | Specificity |
---|---|---|---|
PCDA | 81% | 76% | |
PLSDA | 81% | 76% | |
FPCA | 88% | 76% | |
FPLS | 85% | 76% |
The discrimination results of cross-validation were used to optimize the num- ber of PCs or PLS components basis functions. In
Using a small number of PC or PLS functional basis components, the coefficients of FPCA and FPLS models show simpler structure features. This allows the visualization of discrimination coefficient vectors and showed how they contribute to the correct classification clearly. In
We further explored the structures of the FPCA and FPLS components. The first four functional PCs were considered since they account for more than 95% of the total variation in the colon spectral data. As shown in
more and more cycles. The functional PLS loadings seem to capture more informative spectral features even for dominant component.
Functional data analysis has been widely used in spectroscopic data where the absorbance spectrum is a functional variable whose observations are functions of wavelength. In this study, two functional logistic regression models with functional PC basis and functional PLS basis have been developed to distinguish adenomatic polyps from hyperplastic polyps during endoscopy of the colon. The results of this study showed that the functional logistic regression models outperformed the PCDA and PLSDA models by using a small number of components.
The commonly used multivariate models PCDA and PLSDA with more discriminant components used in the models may include some noise to hinder the classification performance and the interpretation of the spectral feature may be challenging. Taking into account the functional form of spectroscopic data, the FPCA and FPLS logistic regression models improved classification accuracy. The functional representation of the spectra combines dimension reduction and smoothing in one step. Both the FPCA and FPLS models gave better classification performance than the PCDA and PLSDA models and used a reduced number of functional basis components. In particular, FPLS used fewer latent variables than FPCA.
The most important contribution of the FPCA and FPLS models is that the discrimination coefficients that contributed to the correct classification of hyper- plastic and adenomatic polyps provided us insights and good understanding of the complex spectral features related to different types of colon polyps. With substantial reduction of spectral components, this functional logistic regression model is also a potentially accurate, fast and robust tool to distinguish adenomatic polyps from hyperplastic polyps during endoscopy of colon for the detection and removal of precancerous polyps before they turn into cancer. This is crucial for real-time clinical diagnostic application. In the future it is of interest to test the models on a larger number of colonic polys samples to give a more reliable diagnostic result. The spectral feature selection in functional data classification and interpretation of the classification loadings will also be further developed in future work.
This research is funded by Academic Research Funds (AcRF: RI 6/14 ZY) of National Institute of Education, Nanyang Technological University, Singapore. The author thanks the National Medical Laser Centre of University College London for providing the vis-NIR spectroscopic data.
Zhu, Y. (2017) Functional Data Analysis of Spectroscopic Data with Application to Classification of Colon Polyps. American Journal of Analy- tical Chemistry, 8, 294-305. https://doi.org/10.4236/ajac.2017.84022