Climate change is a controversial topic of debate, especially in the US, where many do not believe in anthropogenic climate change. Because its consequences are predicted to be dire, such as a mass ocean extinction and frequent extreme weather events, it is important to learn what causes the warming in order to better combat it. In this study, the first challenge dwells on how to construct reliable statistical models based on massive climate data of 800,000 years and accurately capture the relationship between temperature and potential factors such as concentrations of carbon dioxide (CO2), nitrous oxide (N2O), and methane (CH4). We compared the performance several mainstream machine learning algorithms on our data, which includes linear regression, lasso, support vector regression and random forest, to build the state of the art model to verify the warming of the earth and identifying factors contributing the global warming. We found that random forest outperforms other algorithms to create accurate climate models which use features including concentrations of different greenhouse gases to precisely forecast global atmosphere. The other challenge s in identifying factor importance can be met by the feature of ensemble tree-based random forest algorithm. It was found that CO2 is the largest contributor to temperature change, followed by CH4, then by N2O. They all had some sort s of impact, though, meaning their release into the atmosphere should all be controlled to help restrain temperature increase, and help prevent climate change’s potential ramifications.
The general scientific consensus is that the Earth is warming. Over the past century, the temperature has already climbed 0.5˚C [
An argument against anthropogenic global warming of the Earth is due to the increased solar activity in the past few years. This point is moot, since the Sun goes through an eleven-year cycle of solar activity, and the Earth has been continuously warming for the past decade, which does not make sense. As others have said, solar activity has no correlation with global temperature [
According to the Intergovernmental Panel on Climate Change’s (IPCC) latest report, the main driving force of global warming is the increase in concentration of carbon dioxide in the atmosphere [
So why we should worry about increased CO2 concentration and global warming? The consequences of global warming can be catastrophic. The increased CO2 concentration in the air will also lead to an increase in CO2 absorbed by the ocean, which means the ocean will become more acidic. The pH of the ocean has already decreased 0.1 [
In addition, many studies outline dire consequences involved with the global warming effects. According to IPCC report, the number of hurricanes, as well as the intensity of hurricanes, will increase due to the warming ocean water, putting coastal states at risk [
However, as more studies point out, there are many important factors that contribute to the global warming besides the concentration of CO2. Gases including CH4 have much stronger global warming effects than CO2 [
In order to determine the next step to help mitigate climate change, the main factors that drive climate change should be investigated to know how significant each factor is. This study will focus on these factors, as well as many other factors that have potential to cause differences in global temperature. A previous study of temperature over the past 1000 years was conducted, including solar activity, volcanic activity, and greenhouse gas (GHG) concentration [
Meanwhile, machine learning has been applied more and more widely on environmental protection problems and achieves promising results. Chen et al. [
Theory of variable fuzzy sets and fuzzy binary comparison method have been investigated on assessing water quality in [
In this paper, our first aim is to validate global warming based on the collected public data. After, machine learning algorithms are employed to investigate the effects different factors have on the global temperature. Then, we will analyze the plots generated from the algorithms, as well as draw conclusions from the plot.
The paper proceeds as follows: Section 2 is about the dataset we have. Section 3 is about how the data was used in conjunction with different machine learning algorithms and what the algorithms are. Section 4 is about the results from the machine learning analysis of the data. Section 5 summarizes the results and includes how the findings from this paper can be used in future projects.
Data from the past 800,000 years will be compiled from a variety of public databases, such as the National Oceanic and Atmospheric Administration and the United States Environmental Protection Agency. The data used will include: CO2 in parts per million (PPM) [
The data collected over the 800,000 years are not aligned with each other. For example, there may be CO2 and a corresponding temperature in year 1900, but may lack the corresponding N2O and CH4 concentration at that time. To prepare the data for machine learning, we use linear interpolation to align the data, since machine learning algorithms cannot handle missing data points effectively.
The global temperature change over the past 100 years will be visualized based on the public data provided by Lawrence Berkeley National Lab. The trend of global warming can be observed in the plotted average global temperature over the past 70 years. The coefficient of determination (R2) between global temperature and time is also computed, which can further validate statistically the increase of global temperatures along with time.
To investigate the possible factors that contribute to the global temperature increase, we need to conduct factor analysis on potential factors such as CO2 concentration. Many research works have been conducted to show there is a strong relationship between temperature and CO2. The common technique to analyze potential factors includes visual check and statistical correlation computation. In this work, we first visualize the variations of temperature and CO2, and we also compute the R2 to validate the correlation observed statistically.
Machine learning is a collection of statistical methods to analyze trends, find relationships, and develop models to predict things based on data sets. The machine learning algorithms we explore for this global warming study are random forest, support vector regression (SVR), lasso, and linear regression.
Random forest is an algorithm that uses trees as building blocks to construct more powerful prediction models. The algorithm takes an ensemble of a certain number of trees. When building these decision trees, the splits will be based off a random number of predictors, less than the number in the full set. By restricting the number of predictors in each tree, the strong predictors do not drown out weaker predictors, and the final result (the average of the results of each decision tree) of many uncorrelated trees will reduce variance of the predictions. The averaged final result will also be more accurate than if all predictors were used, as a strong predictor won’t always be used, decorrelating the trees from certain predictors, and making the average less variable and thus more reliable.
Support vector machines, or SVM, are algorithms that use hyperplanes (a line in more than 3 dimension) to create regressions. Essentially, the algorithm tries to separate the different types of data using a hyperplane that has the largest margin between the groups in a multi-dimensional space. If there is a point of data outside the margin, then there will be a penalty that will affect if the hyperplane really is the optimal choice. SVM can use different kernels, or different ways of finding the hyperplane in a high dimensional space. Support vector regression (SVR) is an extension of this, creating a regression from the principles of SVM. SVR, like in other regressions, also has a loss function, but it is only increased when the residuals are greater than a certain constant.
Lasso, or least absolute shrinkage and selection operator, is an algorithm that uses shrinkage, or when data is shrunk toward a certain point like the mean. The algorithm uses L1 regularization, which adds penalty based on the sum of the absolute value of coefficients, and will shrink some coefficients to zero if they play no role. This prevents the model from over fitting and creating a more general model. At the same time, lasso tries to minimize the sum of squares of the data.
With the results, many conclusions can be drawn, since random forests output feature correlations and such using numbers. This will be conducted multiple times and averaged to get as accurate of a result as possible.
After the text edit has been completed, the paper is ready for the template. Duplicate the template file by using the Save As command, and use the naming convention prescribed by your journal for the name of your paper. In this newly created file, highlight all of the contents and import your prepared text file. You are now ready to style your paper.
Data about the temperature and the CO2 concentration over the past 70 years were plotted on a graph (
the concentration of carbon dioxide also correlates with the temperature graph, suggesting that they are related and that it may be a large cause of the warming of the Earth.
To further verify the relationship between carbon dioxide and temperature, as shown in
The data collected over the past 800,000 years was randomly split into two even samples, one for training and one for testing. We further employed 8-fold cross validation during training process to search for suitable hyperparameters and prevent models from overfitting during training. Then, three different machine learning algorithms were compared: random forest, lasso, and support vector regression. With each algorithm, the parameters were tuned to fit the data and generate accurate training results. The visual results are shown in Figures 3-14. Here, we provide the key hyperparameters we used for different machine learning algorithms here. The hyperparameters here are selected in hyperparameter ranges we provided by using the 8-fold cross validation. The selected hyperparameters for random forest are 300 for number of trees used, 2 for max number of features, 1 for minimum number of samples required to be at a leaf node. For SVR, we use 2.0 for penalty C of the error term and radial basis function kernel
based on cross-validation results. We use 1.0 for regularization term coefficient in Lasso algorithm.
The resulting predictions were then graphed against the values from the data set. The plots for the two most accurate algorithms are shown below.
We aim to use the trained model to predict temperature given different potential factor values, therefore our problem is a regression problem. Mean squared error (MSE) measures the average of the squares of errors between our model predictions and real data and is suitable for regression problems. Other
score criteria such as mean absolute error can also be used, but they provide no better fitting models for our problem, so we use MSE to quantify the accuracy of the model employed. The training and testing MSE results for compared algorithms are shown in
Algorithm | Training MSE | Testing MSE |
---|---|---|
Random Forest | 0.1289 | 0.9557 |
Lasso | 1.8819 | 2.2088 |
SVR | 1.3740 | 1.5267 |
Linear Regression | 1.8796 | 2.2689 |
Feature | Relative Importance |
---|---|
CO2 PPM | 0.6598 |
CH4 PPB | 0.2795 |
N2O PPB | 0.0607 |
As visible from the feature importance chart in
Carbon dioxide is a very big factor in determining the temperature of the air. This means that the amount of carbon dioxide that humans (and not nature) are putting into the air is contributing a large amount to the changes in temperature [
We proved these three factors contribute to global warming when they are increased in concentration. As the IPCC noted, the effects of global warming can be catastrophic [
As evident from the first part of the results, there is an upward trend in temperature, which correlates with the upward trend in CO2 concentration. From the correlation analysis between the concentration of CO2 and the temperature, we further show that increase in CO2 concentration causing the temperature rise.
Afterward, we compared different machine learning algorithms in predicting the temperature using the concentrations of three gases: CO2, CH4, and N2O. It is apparent that random forest is by far the most accurate algorithm of the three tested. By adding more features and more data to train it, it will become even more accurate, and become a useful model for temperature change. This means by predicting the future outputs of CO2, CH4, N2O, and any other features that the algorithm is trained with, random forest will accurately predict the temperature.
The feature importance data gathered from random forest also tells an important story. In our study, we show that CO2 dominates the global temperature changes, but it is important to note that the unit of CH4 and N2O is ppb while the unit of CO2 is ppm, which indicates that the effect of CH4 and N2O should never be underestimated.
In current work, only three factors are considered as contributing factors to temperature change can be further considered such as atmospheric circulation, currents, and biodiversity. We compared four machine learning algorithms which have been proven to provide satisfactory performance in many cases. However, other machine learning algorithms, especially ensemble-based algorithms such as xgboost, as well as neural network can also be investigated for seeking better models in future work.
The authors declare no conflicts of interest regarding the publication of this paper.
Zheng, H. (2018) Analysis of Global Warming Using Machine Learning. Computational Water, Energy, and Environmental Engineering, 7, 127-141. https://doi.org/10.4236/cweee.2018.73009