There are a variety of classification techniques such as neural network, decision tree, support vector machine and logistic regression. The problem of dimensionality is pertinent to many learning algorithms, and it denotes the drastic raise of computational complexity, however, we need to use dimensionality reduction methods. These methods include principal component analysis (PCA) and locality preserving projection (LPP). In many real-world classification problems, the local structure is more important than the global structure and dimensionality reduction techniques ignore the local structure and preserve the global structure. The objectives is to compare PCA and LPP in terms of accuracy, to develop appropriate representations of complex data by reducing the dimensions of the data and to explain the importance of using LPP with logistic regression. The results of this paper find that the proposed LPP approach provides a better representation and high accuracy than the PCA approach.
Data mining is the extraction and retrieval of useful data and also involves the retrieval and analysis of data that are stored in a data ware house. Some of the major techniques of data mining are classification, association and clustering. Data mining is upcoming research area to solve various problems and classification is one of main problems in the field of data mining [
In order to evaluate a prediction method it is necessary to have different data sets for training and testing, however five datasets will be used and apply the algorithms principle component analysis (PCA) and locality preserving projections (LPP) to reduce the dimensions using dimensionality reduction toolbox (drtoolbox) in matlab software. After the input space is reduced to a lower dimension by applying one of the two methods PCA and LPP, cross-validation method will be applied to this new reduced features space using 10 fold to evaluation the model and then apply logistic regression to classifier the reduced data. All the performance measures: accuracy, sensitivity, specificity, f-score, precision and roc curve will be computed. The ROC analysis is plotted after each cross validation for the two methods using spss software to compute the area under the curve.
The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire journals, and not as an independent document. Please do not revise any of the current designations.
Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen. One way to overcome this problem is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on “new” data. This is the basic idea for a whole class of model evaluation methods called cross validation [
The measures that are used in this paper depend on matrix called the confusion matrix are as follows in
Where:
TP: true positives (predicted positive, actual positive),
TN: true negatives (predicted negative, actual negative),
FP: false positives (predicted positive, actual negative),
FN: false negatives (predicted negative, actual positive) [
・ Accuracy:
Accuracy is the proportion of true results (both true positives and true negatives) in the population [
・ Sensitivity or Recall:
Proportion of actual positives which are predicted positive [
・ Specificity:
Proportion of actual negative which are predicted negative [
・ Positive predictive value (PPV):
Proportion of predicted positives which are actual positive [
・ F Score:
Harmonic Mean of Precision and recall [
・ ROC analysis:
Receiver Operating Characteristics (ROC) graphs are a useful and clear possibility for organizing classifiers and visualizing their quality (performance) [
・ Area under curve (AUC):
Predicted positive | Predicted negative | Total | |
---|---|---|---|
Actual positive | TP | FN | AP |
Actual negative | FP | TN | AN |
Total | PP | PN | N |
AUC The area under the ROC is between 0 and 1 and increasingly being recognized as a better measure for evaluating algorithm performance than accuracy. A bigger AUC value implies a better ranking performance for a classifier [
The estimated probability is used to construct the ROC analysis after each cross validation for the two feature selection methods.
Two experiments on five databases have been systematically performed. These experiments reveal a number of interesting points:
1) In all datasets Locality preserving projection approach performed better than principle component analysis.
2) The datasets is downloaded from UCI repository website and I selected this datasets because most of paper apply LPP on face recognition dataset and there in no study using normal datasets with LR and LPP, however it was necessary to compare LPP with another algorithm like PCA to show the different between them and to improve that LPP is the best than PCA.
3)
4) The ROC curves of PCA, LPP with all data set are shown in Figures 2-5 and LPP seems to be the best one.
5) Comparing to PCA method which it preserve the global structure, the LPP method preserving local structure which is more important than the global structure for many reason: it is important to maintain the intrinsic information of high-dimensional data when they are transformed to a low dimensional space for analysis, a single characterization, either global or local, may be insufficient to represent the underlying structures of real world data and the local geometric structure of data can be seen as a data dependent regularization of the transformation matrix, which helps to avoid over fitting, especially when training samples are scarce.
This paper proposes dimensionality reduction algorithm called LPP and then compares it with another method of dimensionality reduction approach called Principle component analysis for LR classification. The comparison
Datasets | Performance measures | ||||
---|---|---|---|---|---|
Accuracy | Specify | Sensitivity | Precision | F-score | |
Climate model simulation crashes (540 × 18) | 0.9259 | 0.7500 | 0.9400 | 0.9792 | 0.9592 |
Heart (270 × 13) | 0.7778 | 0.7273 | 0.8125 | 0.8125 | 0.8125 |
Spambase (4601 × 57) | 0.9067 | 0.9333 | 0.8571 | 0.8734 | 0.8652 |
Phishing websites (2456 × 30) | 0.9283 | 0.9219 | 0.9350 | 0.9200 | 0.9274 |
Musk (version 1) (476 × 186) | 0.7358 | 0.7500 | 0.7143 | 0.6522 | 0.6818 |
Datasets | Performance measures | ||||
---|---|---|---|---|---|
Accuracy | Specify | Sensitivity | Precision | F-score | |
Climate model simulation crashes (540 × 18) | 0.9444 | 0.5000 | 0.9800 | 0.9608 | 0.9703 |
Heart (270 × 13) | 0.8148 | 0.8333 | 0.8200 | 0.8571 | 0.8276 |
Spambase (4601 × 57) | 0.9197 | 0.9296 | 0.9040 | 0.8889 | 0.8964 |
Phishing websites (2456 × 30) | 0.9323 | 0.9329 | 0.9314 | 0.9048 | 0.9179 |
Musk (version 1) (476 × 186) | 0.8113 | 0.8438 | 0.7619 | 0.7619 | 0.7619 |
Datasets | Performance measures | ||||
---|---|---|---|---|---|
Accuracy | Specify | Sensitivity | Precision | F-score | |
Climate model simulation crashes (540 × 18) | 0.9815 | 0.8000 | 0.9900 | 0.9800 | 0.9899 |
Heart (270 × 13) | 0.8889 | 0.8462 | 0.9286 | 0.8667 | 0.8966 |
Spambase (4601 × 57) | 0.9284 | 0.9368 | 0.9167 | 0.9119 | 0.9143 |
Phishing websites (2456 × 30) | 0.9602 | 0.9632 | 0.9565 | 0.9565 | 0.9565 |
Musk (version 1) (476 × 186) | 0.8491 | 0.8571 | 0.8333 | 0.7500 | 0.7895 |
includes several performance measures, which resulted in a valid and reliable conclusion. The performance of these approaches is evaluated in terms of accuracy, sensitivity, specificity, F-score, precision, AUC and ROC analysis. The comparison is done through experiments conducted on various types/sizes of datasets. The comparison shows that LPP gives relatively good result in feature reduction and computational complexity when the training data size is relatively larger in comparison to the number of features. In LR, the features are required to be uncorrelated but not needed to be independent, when PCA and LPP are applied to the datasets with the number of features quite bigger than the data size, the dimension needs to be reduced to a very low dimension, this results in loss of more information. It can be stated that LR has proven to be a powerful classifier for high dimensional data sets and it also gives good efficiency when using the features selection methods. From previous study, LPP performed better in face of recognition and in this paper LPP also performed better in normal datasets by preserving the local structure rather than the global structure.
Datasets | Methods | |
---|---|---|
PCA | LPP | |
Climate model simulation crashes | 0.912 | 0.948 |
Heart | 0.812 | 0.904 |
Spambase | 0.847 | 0.966 |
Phishing websites | 0.873 | 0.987 |
Musk (version 1) | 0.792 | 0.899 |
This paper can be extended to other data mining techniques like clustering, association, it can also be extended for other classification algorithm such as neural network, decision tree and support vector machine and much more datasets should be taken. Moreover, this paper recommends by using more than mathematical model to obtain the best results.
Azza Kamal Ahmed Abdelmajed, (2016) A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression. Journal of Data Analysis and Information Processing,04,55-63. doi: 10.4236/jdaip.2016.42005