This research applies network structuring theories to the aviation domain and predicts aviation network growth, considering a flight connection between airports as a link between nodes. Our link prediction approach is based on network structure information, and to improve prediction accuracy, it is necessary to estimate the mechanism of aviation network growth. This research critically evaluates the prediction accuracy of two methods: the receiver operating characteristic curve method (ROC) and the logistic regression method. We propose a four-step method to evaluate the relative predictive accuracy among different link prediction methods. A case study of US aviation networks indicated that the ROC method provided better prediction accuracy compared with the logistic regression method. This result suggests that tuning of the prediction distribution and the regression model coefficients can further improve the accuracy of the logistic regression method.
In recent years, the number of air passengers has been increasing, with the worldwide annual number of passengers up by approximately 34% from 2010 to 2014 [
Our focus in this study is to predict how the aviation network will change in the future to accommodate increased demand. In addition to being of importance for the industrial aspects of aviation, predictive tools for the evolution of aviation are also critical to improve our understanding of environmental and social impacts. In terms of advantages to industry, accurate and concrete forecasting of flight demands, including passenger fluctuation, will allow airlines to efficiently plan the frequency of flights, select the appropriate size of aircrafts, and optimize flight plans for each airport. Additionally, a forecast of this type may assist aircraft manufacturers in the design of future development plans [
Many studies have been conducted to forecast the aviation network based on estimates of the demand for a given air route and to evaluate the impacts caused by change in demand [
Quantitative studies have, however, been conducted to investigate the characteristics of aviation networks and to consider network changes in terms of the network structure. In these studies, airports are regarded as nodes, and airlines and the number of flights or the number of passengers above a threshold are regarded as links.
Analysis of the global aviation network structure by Guimera et al. determined that the network is scale-free and small-world, and that community structure is best explained from the point of view of geopolitical considerations in cities that have airports [
Bonnefoy and Hansman predicted the influence of very light jets (VLJs) by considering the overlaps in performance and capability between light jets and VLJs. The authors proposed a method for network structure analysis and a resultant network growth model [
Conversely, other works have suggested that the existing prediction methods are unable to sufficiently explain the real-world variation in aviation networks and have proposed other predictive methods based on complex network analysis. Kotegawa et al. attempted to predict future aviation networks utilizing three prediction methods and prediction measures based on the network structure [
The results of these studies imply that the structural characteristics and the growth process in the current aviation network affect the future network. For example, according to Bonnefoy and Hansman, the scale-free characteristics and the growth limits of hub airports provide an estimate of the future structure of the aviation network. The network structure described by Sawai and Sato and the robustness optimization described by Wei et al. will likely affect the reconstruction of a next-generation aviation network. Furthermore, the methods and measures proposed by Kotegawa et al. provide directly useful estimates of the variation of future aviation networks.
However, the work of Bonnefoy and Hansman relied on one network growth model that was not compared to other models and was not validated. Additional difficulties are associated with the previously mentioned studies, for instance the method of Sawai and Sato is somewhat unfeasible, and it is uncertain whether the method of Wei et al., which employed a relatively small network consisting of sixteen nodes, can be extended to larger networks. Furthermore, the predictive power of the approach of Kotegawa et al. is challenged by low accuracy since multi-year data was not employed.
Based on the above discussion, in the current study, we aim to improve the prediction accuracy of the future aviation network by the method of link prediction coupled with predictive measures calculated from the network structure.
First, link prediction was conducted according to two methods that utilize the measures introduced in Section 2.3. These measures are calculated from the network structure to identify missing links, and to determine which measure achieves the best prediction accuracy and the highest contribution. Next, the growth mechanism was estimated based on the following hypothesis: network growth depends on the measures that have high contributions. Furthermore, the factors that change the network structure were analyzed.
To compare results, we applied a four-step method that is popular in traffic engineering. This method incorporates population, income, and other statistical data as measures.
In the current study, we analyzed the annual variation of the aviation network in the US.
The data supplied by the Bureau of Transportation Statistics (BTS) [
The data set consists of scheduled departures, performed departures, passenger numbers, the origin and destination, including the distance between these locations, and the aircraft type, year, and class. (Although additional data are also available, we utilized these annual data.) The names of airports are according to the 3 letter abbreviations provided by the International Air Transport Association.
Departures _scheduld | Departures _performd | Passengers | Distance | Origin | Destination | Aircraft _type | Year | Class |
---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 210 | ATK | SCC | 556 | 2014 | P |
0 | 1 | 0 | 677 | DQH | ENA | 556 | 2014 | P |
0 | 2 | 0 | 59 | DQH | SCC | 556 | 2014 | P |
0 | 1 | 0 | 59 | ENA | ANC | 556 | 2014 | P |
0 | 1 | 0 | 203 | FAI | HUS | 556 | 2014 | P |
0 | 1 | 0 | 277 | FVQ | ANC | 556 | 2014 | P |
0 | 2 | 0 | 329 | GAL | ANC | 556 | 2014 | P |
0 | 1 | 0 | 362 | HUS | ANC | 556 | 2014 | P |
0 | 1 | 0 | 268 | OTZ | AIN | 556 | 2014 | P |
0 | 1 | 0 | 466 | PHO | SCC | 556 | 2014 | P |
0 | 1 | 0 | 269 | SCC | AIN | 556 | 2014 | P |
0 | 1 | 0 | 627 | SCC | ANC | 556 | 2014 | P |
0 | 3 | 0 | 210 | SCC | ATK | 556 | 2014 | P |
0 | 2 | 0 | 59 | SCC | DQH | 556 | 2014 | P |
0 | 1 | 0 | 466 | SCC | PHO | 556 | 2014 | P |
0 | 1 | 0 | 765 | STG | ANC | 556 | 2014 | P |
0 | 1 | 15 | 151 | ABI | MAF | 676 | 2014 | F |
0 | 1 | 48 | 1062 | ABQ | CMI | 631 | 2014 | F |
0 | 8 | 290 | 285 | AEX | DFW | 676 | 2014 | F |
For example, the first record (
In total, 436,559 records were obtained for 2014.
To begin, we collected the data for the number of flights performed and the number of passengers between any two given airports per year. We combined data from different aircraft types or class, which are otherwise separated in the records.
Next, we created an adjacent matrix of the airports. The thresholds were defined for the number of flights and the number of passengers. In this study, links are connected if there are two-way flights with a passenger count above the threshold. That is to say, no connection between node pairs (two airports) indicates either that the number of flights is below the threshold or that there are no flights. Predicted links indicate that the number of flights is expected to be above the threshold according to the prediction.
Weighting of the links was not applied in the current work. Therefore, the subject network is a non-directed network, with link prediction applied to this network.
In this study, according to Zhou et al. [
The similarity between node x and node y is represented by Score sxy. For example, PA, one of the measures, is calculated as sxy = kxky where kx is the degree of node x and ky is the degree of node y. A higher score indicates a greater possibility that a link exists between node x and node y.
In this study, eleven measures were used to analyze the network: JI, PA, CN, SP, Sal, Sør, HPI, HDI, LHN, AA, and RA.
1) Shortest Path (SP)
The SP measure is defined as the inverse of Lxy, where Lxy is the vertex distance between node x and node y as shown in Equation (1). When no connection exists, sxy is 0. This measure is created based on the hypothesis that two airports are likely to be connected when there are as few hub airports as possible.
2) Common Neighbors (CN)
The CN measure is defined as the number of common nodes between node x and node y as shown in Equation (2). This measure is created based on the hypothesis that the more airports two nodes have in common, the more likely they are to be connected. Here for node x, let Γ(x) denote the set of neighbors of x.
3) Salton Index (Sal)
The Sal measure is defined as the score obtained by dividing the CN measure with the geometrical mean of the node pair degrees as shown in Equation (3).
4) Jaccard Index (JI)
The JI measure is defined as the score obtained by normalizing the CN measure by the union of the adjacent nodes to the node pair as shown in Equation (4).
5) Sørensen Index (Sør)
The Sør measure is defined as the score obtained by normalizing the CN measure by the arithmetic mean of the node pair degrees as shown in Equation (5).
6) Hub Promoted Index (HPI)
The HPI measure is defined as the score obtained by dividing the CN measure with the lower degree of the node pairs as shown in Equation (6). When the link is adjacent to a hub node, the score tends to be higher because the denominator is determined by the lower degree only. The name of this measure is derived from this attribute.
7) Hub Depressed Index (HDI)
The HDI measure is similar to the HPI. In contrast the HPI, the higher degree of the node pairs is applied to the denominator as shown in Equation (7). The score tends to be lower when links are adjacent to a hub node.
8) Leicht-Holme-Newman Index (LHN)
The LHN measure is defined as the score obtained by dividing the CN measure with the product of the node pair degrees as shown in Equation (8). Although this measure appears similar to the Sal, the score tends to be lower when both degrees of the node pair are higher even if all adjacent nodes are common.
9) Preferential Attachment (PA)
The PA measure is defined as the product of node pair degrees as shown in Equation (9). This measure is created based on the hypothesis that the more airports connected to a given airport, the more likely it is to be connected to other airports.
10) Adamic-Adar Index (AA)
The AA measure is defined as the sum of the inverse of the logarithm of the common node degrees. That is to say, the lower degree node of the common adjacent nodes exerts a higher impact on the score as shown in Equation (10). Here, z indicates the common adjacent node of the node pair x and y.
11) Resource Allocation Index (RA)
The RA measure was proposed by Zhou et al. [
1) ROC curve method
The method of Zhou et al. [
The ROC curve was then calculated based on the data presented in
Node 1 | Node 2 | Measure (PA) | Link existence |
---|---|---|---|
3 | 5 | 9 | T |
2 | 5 | 9 | T |
2 | 3 | 9 | F |
5 | 6 | 6 | T |
3 | 6 | 6 | T |
・・・ | ・・・ | ・・・ | ・・・ |
・・・ | ・・・ | ・・・ | ・・・ |
4 | 6 | 2 | F |
4 | 1 | 2 | F |
The threshold is defined from the score of the measure, and the (X,Y) coordinates are plotted. The ROC curve is an aggregate data of plotted points.
A perpendicular is drawn down on the X-axis from a certain point on the ROC curve. The area, which is right side of the perpendicular and under the curve, is calculated.
Point [A] in
2) Logistic regression and measures method
Kotegawa et al. conducted link prediction by logistic regression, which takes into account node degrees, cluster coefficients, weights, and the difference between weights as explanatory variables [
a) Detect the node pairs without connection in each year’s network.
b) For these node pairs, calculate the values of the eleven prediction measures, which are explanatory variables.
c) Place the link status of these node pairs for the upcoming year in the objective variable y. If the status of the node pair is T, y is equal to 1 and if the status is F, y is equal to 0.
Build the logistic regression model and conduct the link prediction. The following provides additional details to explain step (4). xi denotes the eleven prediction measures and x is the aggregate of explanatory variables.
and p gives the probability of the objective variable y being equal to 1, and is defined as
The above equation is transformed to the linear regression model by a logit transformation as shown below.
A logistic regression model is built for each year using data from previous year to predict the link status. For example, for the link prediction from 2009 to 2010, x gives the eleven prediction measures of the node pairs which were not connected in 2008 and y the statuses of the node pairs in 2009. The logistic regression model is then built based on these values, and the 2010 link status is predicted by applying the eleven measures of the node pairs which were not connected in 2009 to this model.
3) Utilization of the four-step method
The four-step method regards the traffic flow as the movement between zones and predicts the future Origin-Destination (OD) matrix from the current OD matrix based on the two amounts of traffic, which are described as the traffic moving AWAY from a certain zone (generated traffic amount) and the traffic moving INTO a certain zone (attracted traffic amount) [
We show the main procedure for the four-step method below and refer to the original paper where it is described [
a) Traffic is sorted according to the origin and the destination to create the OD matrix.
b) The future generated and attracted traffic amounts are predicted through application of the linear regression model.
c) The calculation is repeated using the Frater method until convergence occurs.
Step 2 can be further explained as follows.
Here, Gi and Ai are the generated and attracted traffic amounts in zone i, respectively. For each zone, Gi and Ai are calculated according to the linear regression model shown below.
In this paper, the following four explanatory variables, Xmi, were employed:
X1i: The employed population in the state to which airport i belongs.
X2i: The per-capita disposable income in the state to which airport i belongs.
X3i: The GDP of the state to which airport i belongs.
X4i: Whether a northeast corridor station exists in the state to which airport i belongs.
X4i is a dummy variable, which is 1 when a northeast corridor station exists in the state to which airport i belongs. The northeast corridor is the railway in the east coast of North America. A majority of east coast states have this type of railway station. We take this variable into consideration because it is assumed that the transfer of the railway affects the utility of airlines.
The OD matrix for future flights was generated by the four-step method, where the number of flights was regarded in terms of traffic flow. Next, we considered airport pairs as pairs that will connect if the number of flights was above the threshold in both directions.
When we consider the presence of links as events, the link prediction becomes a two-classification problem. The F-value is generally used to calculate the accuracy of this type of problem, and therefore we also employ it in the current research.
In case the prediction is positive or negative, and the fact is positive or negative, the prediction result is classified into four groups (TP-true positive, FP-false positive, FN-false negative, and TN-true negative) as shown in
In this paper, positive and negative values indicate that node pairs are either connected or not connected, respectively. Therefore, TP, FP, FN, and TN describe the following scenarios:
TP: Node pair is predicted to be connected and is actually connected.
FP: Node pair is predicted to be connected but is actually not connected.
FN: Node pair is predicted to remain unconnected but actually becomes connected.
TN: Node pair predicted to remain unconnected and in fact remains unconnected.
Here, we define a as the number of TP, b as the number of FP, c as the number of FN, and d as the number of TN. Additionally, PAG represents Precision, and POD represents Recall. PAG and POD are defined according to Equations (18) and (19), respectively.
Furthermore, F as defined in Equation (20) gives the F-value, which indicates the harmonic mean of Precision and Recall.
The network was based on the method described in Section 2.2. Here, links are composed of node pairs (two airports) with more than 3650 flights. An example diagram of the network is presented in
For the case study employed in this work, when the number of flights per year
Prediction/Fact | True (1) | False (0) |
---|---|---|
True (1) | TP | FP |
False (0) | FN | TN |
between two given airports is below 3650, we consider that no connection exists, and a predicted link indicates that the number of flights exceeds the threshold of 3650.
1) Prediction results based on ROC curve
In terms of the average F-values, the CN value is highest and the values of PA, AA, and RA increase in the order listed. Additionally, the F-values for the CN measure remain the highest value throughout the years. This indicates that it is possible to predict whether two airports will be connected in the future by investigating whether or not they share many common airports. Furthermore, it is assumed that low degrees of common airports will likely result two airports being connected in the future because the F-values of the AA and RA are high.
In terms of the average Precision, the Precision of the CN measure has the highest score, followed by the values of PA, AA, and RA in the order listed, similar to the previously mentioned trend in the F-value. On the other hand, Recalls for the SP or LHN measures are higher than for the other measures. The average Recall for SP is 0.974. This result means that the vertex distance between most of the airport pairs is 2 for the pairs that will be connected in the future. This statement is equivalent to saying that two airports have common airports with more than 3650 flights. As LHN is defined as the score obtained by dividing the number of common nodes with the product of the node pair degrees, LHN is
Measures /Year | 08_09 | 09_10 | 10_11 | 11_12 | 12_13 | 13_14 | Average |
---|---|---|---|---|---|---|---|
JI | 0.01986755 | 0.019933555 | 0.013303769 | 0.013003901 | 0.018592297 | 0.026785714 | 0.018581131 |
PA | 0.044897959 | 0.039312039 | 0.030769231 | 0.027548209 | 0.062111801 | 0.055384615 | 0.043337309 |
CN | 0.080808081 | 0.080924855 | 0.064864865 | 0.045454545 | 0.075 | 0.096256684 | 0.073884839 |
SP | 0.013832853 | 0.013221154 | 0.006810443 | 0.011420414 | 0.023880597 | 0.016793893 | 0.014326559 |
Sal | 0.017424976 | 0.014522822 | 0.009960159 | 0.007168459 | 0.017094017 | 0.025940337 | 0.015351795 |
Sor | 0.01986755 | 0.019933555 | 0.013303769 | 0.013003901 | 0.018592297 | 0.026785714 | 0.018581131 |
HPI | 0.01659751 | 0.012311902 | 0.005191434 | 0.011744966 | 0.011173184 | 0.019555556 | 0.012762425 |
HDI | 0.018120045 | 0.017094017 | 0.013559322 | 0.014577259 | 0.02173913 | 0.022764228 | 0.017975667 |
LHN | 0.014449127 | 0.013767209 | 0.00702165 | 0.011851852 | 0.02143951 | 0.017120623 | 0.014274995 |
AA | 0.055276382 | 0.052023121 | 0.033707865 | 0.023121387 | 0.056782334 | 0.055214724 | 0.046020969 |
RA | 0.047930283 | 0.046875 | 0.03 | 0.022988506 | 0.038781163 | 0.062695925 | 0.041545146 |
high when the product of the node pair degrees is low. Therefore, the two airports do not contain a hub airport with high degree. That is to say, it indicates that two airports are likely connected when both of the two airports have low degrees.
2) Results from logistic regression method
The accuracy of the logistic regression prediction of TP (i.e., a node pair that is predicted to be connected and is actually connected) is low, the reason for which will be explained below. As an example,
It is known that when the logistic regression model is applied to such imbalanced data, the data suggesting that the node pairs will remain unconnected strongly affects the result. This is likely the main reason for the low accuracy of TP.
The data used in this paper is imbalanced. The number of flights between two airports that exceed the threshold (3650) is much less than the number of flights below the threshold. In this case, generally weighting or other adjustments can be applied. However, in this paper, we do not apply such adjustments because the method of Kotegawa et al., the basis of the current method, did not apply any adjustments.
On the other hand, we can estimate which measures might have impacts on the prediction from the partial regression coefficients shown in
Prediction/Fact | True (1) | False (0) |
---|---|---|
True (1) | 0 | 0 |
False (0) | 12 | 3338 |
JI | PA | CN | SP | Sal | Sor | HPI | HDI | LHN | AA | RA |
---|---|---|---|---|---|---|---|---|---|---|
−0.1220 | −0.1732 | 0.4830 | 0.3005 | −0.3473 | −0.2362 | 0.0571 | 0.3686 | 0.2599 | 0.1927 | 0.1087 |
example, the partial regression coefficient of the CN measure is high. We examine this result in the next section.
3) Results from the four-step method
It is general, the four-step method predicts traffic based on traffic engineering. As mentioned earlier, the four-step method does not use the network structure. The link prediction goal in this work is to predict the number of flights above the threshold. Therefore, we can regard our target as a prediction of the traffic amount, and compare our results with the four-step method prediction.
The distribution of predictions obtained by the four-step method is shown in
We first compare the result of the ROC curve method with the logistic regression model method. Although imbalanced data was applied in the ROC curve method, the impact is expected to be small. Specifically, it is noteworthy that Precision is high.
On the other hand, the CN value contributes to the TP in the ROC curve method and contributes to the TN (i.e., a correct prediction that an unconnected node pair will remain unconnected) in the logistic regression model. Based on these results, we examine the prediction of the aviation network from the CN, PA, and AA values, which have high accuracy in the case of the ROC curve, and obtain the following insights:
A node pair (two airports) that will connect in the future (i.e., the number of flights will increase) possesses three main characteristics in accordance with the definition of the CN, PA, and AA.
1) The product of node pair degrees is high.
2) The node pair has many common nodes.
3) The common nodes have low degrees.
Prediction/Fact | True (1) | False (0) |
---|---|---|
True (1) | 0 | 99 |
False (0) | 9 | 2320 |
Decision Coefficient | |
---|---|
G | 0.000963 |
A | 0.002016 |
Next, we compare the result of the ROC curve method with that of the four-step method.
The four-step method needs statistical measures, such as population or income per airport, to generate the OD matrix and linear regression model. In this paper, we use relatively accessible data, such as the employed population, per- capita disposable income, the GDP, and whether a northeast corridor rail station exists or not in the state where the airport is located. The result may differ depending on the data selected. Therefore, we cannot conclude that the prediction by the four-step method is significantly inferior to the other methods in this study.
However, the ROC curve method can predict links without such data. The result of this study indicates that the ROC curve method has an advantage because it only requires measures from the network structure to predict links.
In this study, we highlight that the accurate prediction of future aviation networks is important because of industrial, environmental, and social aspects. Additionally, we considered aviation networks as network structures and tried to forecast their future development using link prediction, with measures based on the network structure.
At first, we defined the prediction measures based on the similarity of the node pairs calculated only from the network structure. Then, we created two methods to utilize those measures. The two methods are the ROC curve and the logistic regression model.
Next, we calculated the measures that achieve the highest prediction accuracy and contribution, and determined the growth mechanism of the aviation networks based on these measures.
As a case study, we applied our prediction method to the aviation networks in the US by creating a network of the number of flights.
In the link prediction for this aviation network, the accuracies of the CN, PA, and AA values were high in the ROC curve method. The CN measure contributed to the TN (unconnected node pair predicted to remain unconnected) in the logistic regression model method. We determined the three characteristics of a node pair (two airports), which increase flights: 1) the product of node pair degrees is high, 2) the node pair has many common nodes, and 3) the common nodes have low degrees.
Furthermore, we determined that the ROC curve method has advantages compared with the result of link prediction based on the four-step method.
We describe what we consider to be the novelty and utility of our work below.
The basis of the link prediction of the ROC curve method is the same as was employed by Zhou et al. to located missing links. However, the purpose of that study was to find the missing links, whereas we employed this method to generate predictions and interpret their relevance. The novelty in the current approach is that we demonstrate that it is possible to predict links and to connect this prediction to the growth mechanism of the network.
The link prediction based on the logistic regression model, as was employed in this work, is an expansion of the method of Kotegawa et al. While the explanatory variables used by Kotegawa et al. were unable to explain the mechanism of network generation, the explanatory variables selected in the current work were able to do so, suggesting that our method is advantageous in this respect.
A comparison of the link prediction results by the ROC curve and the four- step method indicates that the ROC curve method is better since it only requires measures from the network structure to predict the links and it achieves a certain level of accuracy.
In the future, we plan on further exploring the following aspects:
・ To narrow down the number of predictions to achieve improved Precision in the ROC curve method.
・ To create a way to evaluate the accuracy for imbalanced data, in which the number of positives and the number of negatives are significantly different.
・ To extend the models to predict the disappearance of links.
Takahashi, Y., Osawa, R. and Shirayama, S. (2017) A Basic Study of the Forecast of Air Transportation Networks Using Different Forecasting Me- thods. Journal of Data Analysis and Infor- mation Processing, 5, 49-66. https://doi.org/10.4236/jdaip.2017.52004