^{1}

^{*}

^{1}

^{*}

^{1}

^{*}

Nowadays, one of the IT challenges faced by many enterprises is the maintenance of their legacy system and migration of those systems to modern and flexible platform. In this paper, we study the network properties of software call graphs, and utilize the network theories to understand the business logic of legacy system. The call graphs turn out approximately scale-free and small world network properties. This finding provides new insight to understand the business logic of legacy system: the methods in a program can be naturedly partitioned into the business methods group and supportive methods group. Moreover, the result is also very helpful in reusing valuable functionality and identifying what services should be to expose in the migration from legacy to modern SOA context.

In today’s Internet-driven economy, one of the IT challenges faced by many enterprises is the maintenance of their legacy system and migration of those systems to modern and flexible platform [

A call-graph is one kind of internal graph structure of software program, and reflects the essential function and behavior of programs. It is a directed graph G = (N,E), which maybe has loops, where N is the set of nodes which represent methods, and E is the set of edges which represents invocation relations between methods. For every node

・ They both have same static structure.

In software programs, the caller method invokes the callee method and the callee method is invoked by caller method; while in network, the source web page links to the target web page and the target web page is linked to the source web page. Both of them are directive connection relationships within different systems.

・ They both perform their functions through dynamic connection

In software programs, different modules execute step-by-step invocation then they can provide computing capability; while in network, different web pages can be dynamically linked together then they can provide information service. Both of them reflect the essential function by the runtime characteristics. This characteristics has already used in the Google web crawling, which is done by analyzing this link relationship underlying webpages.

The most interesting observation in the empirical analyses is that there are a few key nodes in software call graphs with the in-degree above average or the out-degree above average. Re-checking our testing cases, we found that the nodes with the out-degree above average correspond to those methods which provide high-level business functions; and the nodes with the in-degree above average correspond to those methods which provide low-level supportive functions. For example, the method init() has the out-degree above average, which performs the business function “initialization” to initialize the whole system; the methods initDB(), initCache() and buildConc() invoked by init() have high in-degree, which can provide some supportive functions, such as initializing database, clearing cache and building socket connection. These key methods provide new insight to understand the business logic in legacy systems: the methods in a program can be naturedly partitioned into the business methods group and supportive methods group. Moreover, the result is also very helpful in reusing valuable functionality and identifying what services should be to expose in the migration from legacy to SOA context.

The rest of this paper is structured as follows: Section II explains our approach; Section IV makes three empirical analyses to investigate network properties of software call graphs. Section V presents and the findings as lessons learned. Section VI the paper concludes with some potential work.

We generate and analyze call-graph by static program analysis [

We use Java programming language as the target language and analyze ten widely used Java programs, whose code are publicly available and can be downloaded from the open-source website. They are listed as following

We implement our analysis tool “Spotglitter” as a plugin for Eclipse. The tool is based on T.J. Watson Libraries for Analysis (WALA) [

For call-graph is a directed graph, where an invocation relationship corresponds to a directed link pointing from the caller method to the callee method, in this empirical analysis, we explored the in-degree distribution and out- degree distribution respectively in order to give an exact analysis for the invocation relationship underlying soft- ware programs. The results are illustrated in

From the results, we observed that both the in-degree distribution and out-degree distribution can be approximately characterized by the following algebraic scaling behavior:

where k is the variable that measures the number of links at different nodes and γ is the scaling exponent. We calculate the mathematical expectation and variance for the ten programs, the scaling exponent γ in in-degree distribution (in

Programs | Comments | Number of methods in the program |
---|---|---|

JBPM [ | A powerful workflow and BPM engine to createand analyze business processes. | 1418 |

SableCC [ | An object-oriented framework to generate compilers and interpreters in Java. | 2191 |

JUNG [ | A software library that provides the common and extensible language of modeling, analysis, and visualization of data. | 1973 |

JGraph [ | A most powerfulgraph component available for Java. | 1278 |

Azureus [ | A Java BitTorrent client. | 12,942 |

Apache James [ | Java SMTP and POP3 Mail server and NNTP News server | 2127 |

Java PetStore [ | A sample application to demonstrates how to use J2EE 1.3 platform. | 1894 |

Damls_ Matcher [ | An ontology toolkit providing semantic matchmaking for web service based on DAML-S. | 337 |

JTB [ | A syntax tree builder to be used with JavaCC parser generator | 1126 |

LGMA [ | A grid network environment demo. | 298 |

The aim of this experiment is to analyze the nodes distribution in call-graph based on the result in empirical analysis 1. The result is shown in

We observe in

In this empirical analysis, we try to analyze the clustering degree and the separation degree of the call-graph by computing the clustering coefficient and the characteristic path length. The characteristic path length L is defined as the average over all the links in the shortest path connecting the two nodes in the call-graph, which is used to measure the typical separation between two nodes in the network (a global property). The characteristic path length L can be computed with the Dijkstra algorithm [

Suppose that a nodev has k_{v} neighbors; then the clustering coefficient C_{v} of a node n is given by theratio of existing links E_{v} between its k_{v} first neighbors to the potential number of such ties_{v} over all nodes one arrives at the clustering coefficient C of the call-graph. We also compare these values to the random networks with the same number of nodes N. Toa given N and μ, where μ is the average number of links per node, the value of the clustering coefficient C and the characteristic path length L of random network are very small. In particular, for N → ∞ and μ fixed, the characteristic path length in the largest connected component approaches the logarithmic behavior of a Moore graph,

and the clustering coefficient approaches zero,

The result is listed in _{rand}, and the characteristic path length, L ≈ L_{rand}, where C_{rand} and L_{rand} are the respective statistical quantities for a random network with the same parameters N and μ.

From our preliminary empirical analysis, we can propose that these call-graph generated from software programs show the properties in both scale-free network [

・ Scale-free network characteristics

Scale-free networks, including the Internet, are characterized by an uneven distribution of connectedness. Instead of the nodes of these networks having a random pattern of connections, some nodes act as “very connected” hubs, a fact that dramatically influences the way the network operates. Scale-free networks are characterized by a power-law distribution of a node’s degree (i.e. the number of its next neighbors). From the empirical analysis 1, we have observed that the in-degree distribution and out-degree distribution of call graphs can be approximated by the power law, where the scaling exponent γ in in-degree distribution is 1.6 and the scaling exponent γ in out-degree distribution is 2.1 ± 0.1. While other scale-free network, such as WWW, Social network, Cellular network, phone call network, the scaling exponent is between 2.0 to 3.0. The power law distribution brings out the result that the structure and dynamics of scale-free network are strongly affected by a few nodes covering a great number of connections. This result is proved in empirical analysis 2. We have seen that nearly 20% nodes with the out-degree above average out-degree, whose out-going edges cover over 70% of total out-going edges, and there are nearly 13% nodes with the in-degree above average in-degree, whose in-coming edges cover over 50% of total in-coming edges. Comparing with the internet, the methods with high out-degree are very similar to hub nodes on the internet, i.e., pages with many links to authorities pages, based on only the links between web pages [

・ Small-world network characteristics

Roughly speaking, small-world network are those with high clustered subsets of nodes that there are a few steps away from each other. More precisely, the defining properties of small-world network rest on two structural properties: clustering and separation. In term of network topology, clustering, a local property, is used to measure the probability that two neighbors of one node are connected themselves, which is expressed by clustering coefficient. And separation, a global property measured by the characteristic path length, is used to evaluate the separation degree between two nodes in the network. In a small-world network, the characteristic path length compares to that in a random network with the same number of edges, whilst the clustering coefficient of its nodes can be orders of magnitudes larger on average. Watts [_{rand}, and small characteristic path length, L ≈ L_{rand}. Therefore, we conclude that the call-graph of software program can be described by small-world network. With the localization attribute in small-world network, we can also give some explanation to the separation of the nodes with the in-degree above average and the out-degree above average in the second empirical analysis. Because of the large clustering coefficient and the small average shortest path, the nodes in the call-graph will be concentrated in several local areas with large numbers of edges. These local areas are composed of those key nodes with in-degree above average or out-degree above average and the nodes which are directly connected to the key nodes.

Based on scale-free network and small-world network theories, we conclude that the key methods in the software programs contain about 20% of the total programs. This result provides extremely valuable and useful information to perform the reusing action to existing applications. In SOA context, most services should be mapped to business functions. So we believe that those methods with the out-degree above average, about 20% of total methods, should be extracted as services first of all. The other methods with large in-degree should also take into consideration because they can be used as atomic services to compose complex services. Based on this conclusion, developers can “search” in existing programs quickly and reuse proper methods based on this degree distribution. Moreover, as we explained in Section 1, the connectivity properties reflect essential function and behavior of programs. So from the analysis of degree distribution, maybe we can give an appropriate measurement to “reusage quality”. Similar to the Pareto Principle (also known as 20 - 80 rule), we can assume that the methods with large connections are quite possibly be connected again in future, which means these methods are more useful than the others. Then we can get a conclusion that these methods can have higher reusage quality than the other methods. In addition, most of the test cases are well known and regarded as good-design programs. The out-degree distribution in a few programs cannot be identified with a scale-free regime, which is due to the limited size of the sample and some fine difference among different programming models. So what is the best distribution model for the invocation relationship in software programs? How can we use the degree distribution for call-graph as a criterion to evaluate the design of the software programs? Can this power law distribution or the key methods ratio be used as an indicator to measure the quality of the software? These are also very interesting but critical problems in software engineering area.

Program | N | μ | C | C_{rand} | L | L_{rand} |
---|---|---|---|---|---|---|

JBPM | 1418 | 1.70 | 0.107 | 0.002 | 4.03 | 3.66 |

SableCC | 2191 | 1.64 | 0.139 | 0.003 | 1.85 | 3.77 |

JUNG | 1973 | 2.71 | 0.082 | 0.001 | 2.71 | 3.74 |

JGraph | 1278 | 1.55 | 0.075 | 0.002 | 9.13 | 3.63 |

Azureus | 12942 | 1.58 | 0.025 | 0.001 | 5.78 | 3.86 |

Apache James | 2127 | 1.70 | 0.138 | 0.001 | 2.99 | 3.75 |

Java Pet Store | 1894 | 2.81 | 0.104 | 0.001 | 2.29 | 3.73 |

Damls-Matcher | 337 | 2.02 | 0.010 | 0.005 | 4.40 | 3.30 |

JTB | 1126 | 2.11 | 0.217 | 0.004 | 10.91 | 3.60 |

LGMA | 298 | 2.36 | 0.036 | 0.009 | 3.29 | 3.27 |

With the quickly changing requirements and the ever growing high cost for software programs, how to reuse legacy system asset and extend current software lifecycle has already become an urgent problem in IT field. SOA technology emerges as a promising approach. But a basic problem for SOA is how to find the similar functions and evaluate the “reusage quality” of these functions rapidly. For invocation relationship reflects the essential function and behavior of programs, in this paper we try to investigate the properties of this relationship in order to evaluate the reusable functions in existing software programs. Here we use Java programming language as the testing language, and make use of call-graph analysis technique, which is a new application for traditional static program analysis techniques. From the empirical analysis, we have found that the call-graph generated from software programs exhibited the properties both in scale-free network and small-world network: the distributions of in-degree and out-degree follow the low-power; a few nodes cover most of connections; and the call- graph shows large clustering and small characteristic path length. According to scale-free network and small- world network, we can differentiate the business methods and supportive methods in software programs. More precisely, those methods with high out-degree provide high-level business functions; and those with high in-de- gree provide low-level supportive functions. Based on this conclusion, developer can select appropriate functions to reuse. Especially in SOA context, those methods with high out-degree, about 20% of total methods, should be extracted as services first of all; and those methods with high in-degree, about 13% of total methods, should also be extracted as atomic services to compose complex services. Further, this connectivity may also be used as a measurement to evaluate reusage quality of different methods, which also provide strong supportive information to reuse of existing programs.

We plan to continue to study the reason why the software programs present such properties in scale-free network and small-world network. Also we want to explore how to expose and package these methods with strong connectivity as reusable services, because as programming paradigms move, we need to expose these services into new form factors too. We would also like to understand if we can use these properties to measure the quality of the design and software programs.