^{1}

^{1}

^{*}

In many cases randomness in community detection algorithms has been avoided due to issues with stability. Indeed replacing random ordering with centrality rankings has improved the performance of some techniques such as Label Propagation Algorithms. This study evaluates the effects of such orderings on the Speaker-listener Label Propagation Algorithm or SLPA, a modification of LPA which has already been stabilized through alternate means. This study demonstrates that in cases where stability has been achieved without eliminating randomness, the result of removing random ordering is over fitting and bias. The results of testing seven various measures of centrality in conjunction with SLPA across five social network graphs indicate that while certain measures outperform random orderings on certain graphs, random orderings have the highest overall accuracy. This is particularly true when strict orderings are used in each run. These results indicate that the more evenly distributed solution space which results from complete random ordering is more valuable than the more targeted search that results from centrality orderings.

Many real world systems and networks can be represented by graphs of edges and nodes. These systems include such diverse areas of study as social networks, html structure, and highway systems. One machine learning task which is often performed on these graphs is community detection in which algorithms attempt to find groups of nodes which have a significant difference in density between intragroup edges and intergroup edges, otherwise known as communities. These communities often provide some useful information about the elements represented by the nodes of a graph. For example communities in social network graphs likely define distinct social groups or subgroups. Similarly communities in an html graph might represent pages on the same domain or the same topic. A variety of techniques have been developed to find good communities in graphs; however, many of these methods are suitable only for finding discrete communities or communities with disjoint sets of nodes. This unfortunately is not how true communities form in many networks. In social networks, it is quite common for an individual or node to belong to multiple friend groups or communities. Similarly in a co-authorship network it would be expected that certain authors who are focused on interdisciplinary studies might belong in roughly equal parts to two or more of the communities for the disciplines in which he is involved. This problem has largely been addressed through modifications to existing discrete community detection algorithms.

One of the best known and simplest community detection algorithms is LPA or Label Propagation Algorithm [

It has been demonstrated that centrality functions can improve the community detection results of standard LPA. Therefore in this paper we combine SLPA with a variety of centrality functions on an assortment of networks with varied structures in order to determine the effects of centrality functions used in conjunction with SLPA. This study includes among others degree, betweenness, and closeness centrality functions. The community detection quality of SLPA for each centrality function and graph combination is given in Section 3.4. Prior to performing these tests however it was necessary to determine the convergence rate of SLPA on each of the chosen network graphs as SLPA unlike LPA requires input to determine the number of iterations of label propagation will be performed. These results are summarized in Section 3.1. The social networks, centrality functions and evaluation metrics used in this study are described in detail in the following section.

This study makes use of four social networks, karate, pilgrim, dolphins, and high school. Karate represents the social structure of a karate club from the 1970’s and is composed of thirty-four individuals and seventy-eight connections; it is perhaps the most commonly referenced social network [

In order to evaluate the benefits of applying centrality to the ordering of nodes for propagation, seven different centrality functions were selected. These include degree centrality, subgraph centrality, closeness centrality, betweenness centrality, alpha centrality, leadership quality, and Page Rank. Degree centrality was the first and simplest measure of centrality. In undirected graphs such as those used in this study, the centrality of a node is merely its degree. Subgraph centrality is based on the number and size of all closed walks within the graph that contain each node [

of shortest paths passing through them [

In the original Label Propagation Algorithm (LPA), each node is initially assigned a unique label. During each iteration each node is visited in a random order, and when visited assigns itself the label most common amongst its neighbors. In the case of a tie one label is selected randomly from the set of maximal labels. This process continues until each node’s label is a most common label amongst its neighbors. Each node is then assigned to a community based on the label it currently has after the final iteration. This technique is very effective for its simplicity; unfortunately however it can produce disconnected communities and is rather unstable due to its uncertain termination condition.

The Speaker-Listener Label Propagation Algorithm or SLPA is an extension of the standard label propagation algorithm which attempts to imitate the natural process of human communication for information dissemination [

Ordering nodes according to centrality was performed as a two-step process. First centrality values were calculated for each node using a selected centrality function. Then nodes were selected for ordering with probability proportional to their centrality value. This resulted in high centrality nodes appearing more frequently early in the ordering and low centrality nodes usually occurring later in the ordering while still maintaining some level of randomness. The second step of this process was repeated at each iteration resulting in a new ordering each time.

There are several ways to measure the quality of detected communities in a graph. By far the most popular methods for this task on discrete community partitions are normalized mutual information and modularity. Modularity evaluates the goodness of a community structure by looking at intercommunity edges and intracommunity edges within a graph [

All testing was done a collection of identical machines operating Linux Mint 17. For the purpose of this study the clock speed and available RAM of these machines is irrelevant as convergence and efficiency are measured in number of runs and number of iteration per run while all processing takes place on the JVM version 1.7.0_79. SLPA was implemented in Java and received centrality values from the built in centrality functions included in the igraph package of R. Random number generation was handled by the native SecureRandom package of Java.

In order to determine how quickly SLPA converged on an accurate community partition for each graph, SLPA was run with a varying iterations parameter from five to one-hundred. SLPA was run twenty-five times at each number of iterations, and the median value was kept to better gauge how an average run at that iteration count would perform. This was repeated for each of the seven centrality functions, each time ordering nodes for label propagation based on their ranking from the selected centrality function using the process described in the previous section. The median overlapping modularity value on every graph for each number of iterations and centrality function are shown in Figures 6-9. It is clear although perhaps surprising that on these small networks convergence occurs for most algorithms after only five or ten iterations. For this reason all subsequent accuracy tests were run with only twenty-five iterations to minimize runtime without compromising accuracy.

Each centrality function was used in running SLPA one-hundred times on each graph. The results of these runs were recorded and evaluated based on three metrics: normalized mutual information, modularity, and overlap

modularity. The median value for each of these metrics was selected and presented in Figures 9-13 to provide a clear picture of the average performance of SLPA using each centrality function. It is quickly apparent that few of the centrality functions have a significant effect on community detection quality. In fact the only clearly significant centrality function is betweenness which drastically reduces the quality of community partitions. This is likely due to this functions emphasis on shortest paths which will cause it to identify bridge nodes between communities. If these nodes are allowed to propagate first it can result in labels flowing between communities more easily than they might otherwise. This likely is the cause of merging communities and poor community structure in these runs of SLPA. Several other functions regularly outperformed random ordering on some graphs and underperformed on others. This may indicate that different centrality functions are more valuable on certain graphs. This may be a sign of over fitting results towards a subset of graphs with certain characteristics. For this reason it appears that completely random ordering is optimal for SLPA since its more evenly distributed solution space can account for all possible graph structures.

The results of this testing show that for a variety of label propagations which have already been stabilized, ordering nodes for label propagation based on centrality functions do not improve predictive quality. In fact in most cases it slightly decreases performance when compared across a variety of different social network structures.

This is especially true of betweenness centrality which significantly reduces performance in almost all cases. The reason for this becomes quite apparent when one considers how reordering can effect community detection. Since betweenness has a tendency to give priority to bridge nodes which border multiple communities, allowing these nodes to propagate first increases the chances of a label overflowing its community bounds skewing propagation results. Other centrality functions may also cause this bias on certain graphs where bridge nodes have other qualities such as high degree or closeness centrality. This demonstrates that the primary value of ordering label propagation based on centrality is in its stabilizing effect; however, other methods such as those employed by SLPA may prove more effective since they do not as a consequence negatively affect community partitioning. For this reason we assert that in the case of SLPA random node ordering is the optimal ordering when testing across different graph structures.

Further research in this topic could focus on the application of centrality functions to other versions of label propagation which have not yet produced stable termination. Centrality based order has already demonstrated that it can have a stabilizing effect on standard label propagation and this study demonstrates that centrality ordering has little or no negative effect on final community detection. Centrality ordering therefore could enhance the stability of other label propagation algorithms without reducing the effectiveness of their clustering.

We would like to thank the Summer Research Institute at Houghton College for providing us with this research opportunity.

BrianDickinson,WeiHu, (2015) The Effects of Centrality Ordering in Label Propagation for Community Detection. Social Networking,04,103-111. doi: 10.4236/sn.2015.44012