With the rapid development of big data, the scale of realistic networks is increasing continually. In order to reduce the network scale, some coarse-graining methods are proposed to transform large-scale networks into mesoscale networks. In this paper, a new coarse-graining method based on hierarchical clustering (HCCG) on complex networks is proposed. The network nodes are grouped by using the hierarchical clustering method, then updating the weights of edges between clusters extract the coarse-grained networks. A large number of simulation experiments on several typical complex networks show that the HCCG method can effectively reduce the network scale, meanwhile maintaining the synchronizability of the original network well. Furthermore, this method is more suitable for these networks with obvious clustering structure, and we can choose freely the size of the coarse-grained networks in the proposed method.
Our life is full of all kinds of networks. For example, metabolic networks, large-scale power networks, paper citation networks, global transportation networks, scientific research cooperation networks and so on. These networks have same analogous and obvious characteristics―massive nodes and complex interactions. Networks with complex topological properties are called complex networks [
The existed research methods of complex networks are mainly designed for mesoscale networks. Coarse-graining technology is an effective way to study the large-size networks, which can reduce the complexity of networks by merging similar nodes. However, coarse-graining techniques go far beyond the requirement of clustering techniques, because coarse-graining techniques requires the coarse-grained networks to keep the initial network’s topological properties or dynamic characteristics, such as the degree distribution, clustering coefficient, correlation properties [
Coarse-graining technique is an important method for studying large-scale complex networks. However, almost every coarse-graining method has some inadequacies: for example, in the SCG method, computing the eigenvalues of network’s Laplacian matrix will take a lot of time, so the method is difficult to be used in large-scale real networks, furthermore, the SCG method cannot accurately control the size of coarse-grained network; The K-means clustering coarse-graining method requires to define the objective function, and may arise the problem of trapping local minimum values or selecting initial points. This paper proposes a new coarse-graining method based on hierarchical clustering (HCCG) on complex networks. The distance and similarity of the coarse graining method are easy to define, and we can choose the size of the reduced network freely. Moreover, this method does not need to define the objective function and does not cause the problem of selecting the initial point. The new coarse-graining method can make up for some shortcomings of the above-mentioned methods. In the HCCG method, we use the hierarchical clustering method to cluster the network nodes, and update the weights of edges between clusters to extract the coarse-grained network. Furthermore, we apply the HCCG method to WS small world networks, ER random networks and BA scale-free networks. Simulation experiments show that the HCCG method can keep the synchronizability of the initial network well, and the method is more suitable for the networks with more obvious clustering structure.
The rest of the paper is organised as follows. In Section 2, the mathematical basis of the HCCG method is introduced. The steps of the HCCG method are presented in Section 3. In Section 4, a large number of numerical simulations on several typical networks verify the feasibility and effectiveness of the proposed method. Finally, conclusions and discussion are drawn in Section 5.
Clustering is an unsupervised learning process. By clustering, the similarity is as large as possible in the same cluster and as small as possible in different clusters. The fundamental difference between sorting technique and clustering technique is that sorting technique must know the data characteristic on which it is based in advance, and the clustering technique is to find the data characteristics. Therefore, clustering analysis is usually used as a data preprocessing process in many fields, which is the basis for further data analysis and processing [
Hierarchical clustering methods merge or separate data objects recursively until some termination condition met. According to the order of hierarchical decomposition, the method can be divided into bottom-up algorithm and top-down algorithm. This paper adopts the bottom-up method, which is a cohesive hierarchical clustering algorithm. The algorithm starts with every individual as a cluster, then searching for the nearest cluster to group into one cluster. After merging once, the total number of clusters is reduced by one, until the required number of clusters or the nearest threshold is reached. In this paper, Jaccard distance is used to calculate the distance between two different nodes. The methods of calculating the distance between two different clusters include Single Linkage, Complete Linkage, Average Linkage and so on. In this paper, we use the Average Linkage in the HCCG method.
d i s t a v g ( C i , C j ) = 1 n i n j ∑ p ∈ C i , p ′ ∈ C j { | p − p ′ | } (1)
where | p − p ′ | is the distance between two nodes, i , j represent two different nodes of the initial network, the coarse-grained network has N ˜ clusters, labeled with C = 1 ˜ , 2 ˜ , ⋯ , N ˜ , C i is the label of the cluster of node i, | C i | is the cardinality of the set C i , n i = | C i | ( n i is the number of nodes in the cluster C i ).
Consider an unweighted and undirected network with N nodes, A = ( a i j ) N × N is the adjacency matrix of the network, ( a i i ) = 0 and ( a i j ) = ( a j i ) = 1 if node i connects to j ( i ≠ j ) , otherwise ( a i j ) = ( a j i ) = 0 . D = d i a g ( k 1 , k 2 , ⋯ , k N ) is a diagonal matrix, k i is the degree of i’th node. L = D − A , L = ( L i j ) N × N ∈ R N × N is the Laplacian matrix of the network. L satisfies the dissipative coupling condition: ∑ i = 1 N L i j = 0 . Since the network is connected, L is a symmetric matrix with nonnegative eigenvalues: 0 = λ 1 < λ 2 ≤ ⋯ ≤ λ N .
Generally, according to the difference of synchronization regions, network systems can be divided into four types, each type corresponds to a kind of synchronization region: 1) Type-I networks correspond to bounded regions ( α 1 , α 2 ) ; 2) Type-II networks correspond to unbounded regions ( α 1 , ∞ ) ; 3) Type-III networks correspond to union of several bounded regions ∪ ( α 1 i , α 2 i ) ; 4) Type-IV networks correspond to empty sets. Generally, the networks of the case (3) and (4) are difficult or impossible to achieve synchronization. Fortunately, most networks are in the case of (1) and (2). In Type-I networks, The synchronizability of networks can be characterized by the minimum non-zero eigenvalue λ 2 . The larger the value of λ 2 , the smaller the coupling strength is needed to achieve synchronization, the synchronizability of Type-I networks are stronger. In Type-II networks, the synchronizability of networks can be characterized by the ratio of λ ˜ 2 / λ ˜ N . Only when the eigenvalues ratio λ ˜ 2 / λ ˜ N > α 1 / α 2 are satisfied can the network achieve synchronization. So, the larger the value of R = λ ˜ 2 / λ ˜ N , the stronger the synchronizability of Type-II networks. Therefore, if λ 2 or λ ˜ 2 / λ ˜ N is unchanged in the process of coarse-graining, we can deem the synchronizability of the network is maintained [
There are three main steps in the HCCG method: the first step is calculating the distance between two nodes for obtaining the distance matrix of the initial network by using Jaccard distance, and constructing n single-member clusters; the second is calculating the distance between two different clusters, using the hierarchical clustering method to cluster the network nodes; the last step is updating the weight of the links between different clusters to extract the coarse-grained network. In this section, the HCCG scheme is introduced revolving around the above content.
Supposing ϑ ( i ) = { j | a i j ≠ 0 } is the set of neighbor nodes of node i. | ϑ ( i ) | is the cardinality of the set ϑ ( i ) . | ϑ ( i ) | = k i , k i are the degrees of nodes i. The Jaccard coefficient is defined as follows:
J i j = | ϑ ( i ) ∩ ϑ ( j ) | | ϑ ( i ) ∪ ϑ ( j ) | = | ϑ ( i ) ∩ ϑ ( j ) | | ϑ ( i ) | + | ϑ ( j ) | − | ϑ ( i ) ∩ ϑ ( j ) | (2)
here, i , j are two different nodes. ϑ ( i ) ∩ ϑ ( j ) is the common neighbor nodes set of node i and j, ϑ ( i ) ∪ ϑ ( j ) is the union of the neighbor nodes of node i and j. When | ϑ ( i ) | = 0 and | ϑ ( j ) | = 0 , J i j = 1 . Jaccard distance is an index related to Jaccard coefficient. The lower the nodes similarity, the larger the Jaccard distance. The Jaccard distance is defined as follows:
d i j = 1 − J i j = | ϑ ( i ) ∪ ϑ ( j ) | − | ϑ ( i ) ∩ ϑ ( j ) | | ϑ ( i ) ∪ ϑ ( j ) | (3)
A toy network is used to illustrate how to calculate the distance between two different nodes, using Jaccard distance method obtain the
In
The basic idea of hierarchical clustering method is to calculate the similarity between nodes by some similarity index, and to rank the nodes according to the similarity from high to low, then to merge the nodes step by step. The main steps are as follows:
1) Obtain the adjacency matrix A of the network;
2) Use Jaccard distance method calculating the distance between two different nodes to obtain the distance matrix of the network, construct n single-member clusters, the height of each cluster is 0;
3) Use Average Linkage method calculating the distance between two different clusters to search for the nearest clusters C i , C j , merging C i , C j , reducing the number of all clusters by 1, take the distance between the two clusters merged as the height of the upper layer;
4) Calculate the distance between the newly generated cluster and other clusters in this layer. If the termination condition is satisfied, the algorithm will end, otherwise it will turn to (3).
To better explain the steps of clustering, we use the example in
In
and node 4 are grouped together; based on Equation (3), we know d 1 , 2 = 1 3 is
the minimum distance on the
d i s t a v g ( C 1 , 2 , C 3 , 4 ) = 1 2 × 2 ( 1 + 1 + 1 + 1 ) = 1 , d i s t a v g ( C 1 , 2 , C 5 ) = 1 2 × 1 ( 1 + 1 ) = 1 ,
d i j | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
1 | 1 3 | 1 | 1 | 1 | 1 | |
2 | 1 | 1 | 1 | 2 3 | ||
3 | 0 | 2 3 | 1 | |||
4 | 2 3 | 1 | ||||
5 | 1 | |||||
6 |
d i s t a v g ( C 1 , 2 , C 6 ) = 1 2 × 1 ( 1 + 2 3 ) = 5 6 , d i s t a v g ( C 3 , 4 , C 5 ) = 1 2 × 1 ( 2 3 + 2 3 ) = 2 3 ,
d i s t a v g ( C 3 , 4 , C 6 ) = 1 2 × 1 ( 1 + 1 ) = 1 , d i s t a v g ( C 5 , C 6 ) = 1 1 × 1 ( 1 + 1 ) = 1 ,
the nearest distance is d i s t a v g ( C 3 , 4 , C 5 ) = 2 3 on the
C 3,4 and node 5 are grouped together; based on Equation (1),
d i s t a v g ( C 1 , 2 , C 3 , 4 , 5 ) = 1 2 × 3 ( 1 + 1 + 1 + 1 + 1 + 1 ) = 1 ,
d i s t a v g ( C 1 , 2 , C 6 ) = 1 2 × 1 ( 1 + 2 3 ) = 5 6 , d i s t a v g ( C 3 , 4 , 5 , C 6 ) = 1 3 × 1 ( 1 + 1 + 1 ) = 1 ,
the nearest distance is d i s t a v g ( C 1 , 2 , C 6 ) = 5 6 on the
C 1,2 and node 6 are grouped together; according to Equation (1), the nearest distance is
d i s t a v g ( C 1 , 2 , 6 , C 3 , 4 , 5 ) = 1 3 × 3 ( 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 ) = 1
on the
Clusters are obtained according to the clustering steps which in Section 3.2, then based on the Equation (4) update the edge weight between different clusters to extract the coarse-grained network. C s , C t are two different clusters, and the weight of C s to C t are defined as:
W C s → C t = ∑ x ∈ C s , y ∈ C t a x y | C t | s , t = 1 , 2 , ⋯ , N ˜ , s ≠ t (4)
In
W 1 ˜ → 2 ˜ = ∑ x ∈ 1 ˜ , y ∈ 2 ˜ a x y | 2 ˜ | = 1 + 1 + 1 + 1 + 1 3 = 5 3 , W 2 ˜ → 1 ˜ = ∑ x ∈ 2 ˜ , y ∈ 1 ˜ a x y | 1 ˜ | = 1 + 1 + 1 + 1 + 1 2 = 5 2 ,
W 2 ˜ → 3 ˜ = ∑ x ∈ 2 ˜ , y ∈ 3 ˜ a x y | 3 ˜ | = 1 1 = 1 , W 3 ˜ → 2 ˜ = ∑ x ∈ 3 ˜ , y ∈ 2 ˜ a x y | 2 ˜ | = 1 3 = 1 3 .
In this section, we apply the HCCG method to several typical complex networks. The several types of networks including: WS small world networks, ER random networks and BA scale-free networks. N is the size of initial networks; N ˜ is the size of coarse-grained networks, λ ˜ 2 is the minimum non-zero eigenvalue of the coarse-grained network, R ˜ = λ ˜ 2 / λ ˜ N is the eigenvalue ratio of the coarse-grained
network. The closer the values of λ ˜ 2 and λ 2 (or R and R ˜ ), the better the effect of the HCCG method on keeping the synchronizability of original networks.
In 1998, Watts and Strogatz [
As shown in Figures 5(a)-(c), in Type-I networks, when p = 0.1 , 0.2 , 〈 k 〉 = 4 , 6 , 8 the synchronizability of original networks can be maintained well in the coarse-graining process. However, with the increasing of p, the effect has not been so good. When the p value is fixed, the smaller the 〈 k 〉 , the better the effect of the HCCG method. In the literature [
As shown in Figures 5(d)-(f), in Type-II networks, the evolution curves of R ˜ are consistent with that of λ ˜ 2 . The reason is also consistent with in the case of Type-I networks. The difference is: when 〈 k 〉 and p are quite large, the effect of the HCCG method in Type-II networks is better than that in Type-I networks.
When p increases gradually, The minimum non-zero eigenvalue λ ˜ 2 and the maximum eigenvalue λ ˜ N increase simultaneously, λ ˜ 2 increases much faster than λ ˜ N which lead to the rise in R ˜ = λ ˜ 2 / λ ˜ N [
〈 k 〉 and p, the clustering structure of the network becomes weaker, the effect of maintaining original network’s synchronizability becomes not so good. So the HCCG method is more suitable for the networks with obvious clustering structure.
In 1959, Erdos and Renyi, two Hungarian mathematicians, proposed algorithm for the formation of random graphs [
HCCG method in ER random networks with initial sizes N = 1000 , 2000 , 3000 respectively, and Figures 6(d)-(f) shows the evolution curves of R ˜ by using the HCCG method in ER random networks with initial size N = 1000 , 2000 , 3000 respectively. For each case, the connecting probability is p = 0.26 , 0.35 , 0.47 , 0.67 , 0.88 . Each simulation result was obtained by 5 independent repeated experiments.
The eigenvalue spectral distribution of ER random networks is symmetrical, and as the connection probability p increases, the range of eigenvalue distribution shrinks rapidly, at the same time, λ ˜ 2 and λ ˜ N increase. The main reason is that with p increases, the number of isolated clusters decreases, the maximum degree d max increases, so λ ˜ N is increased. However, the growth rate of λ ˜ 2 is faster than that of λ ˜ N , which leads to the increase of R ˜ = λ ˜ 2 / λ ˜ N . Therefore, from a view of statistical perspective, the synchronizability is gradually increased as the connection probability p increases [
In 1999, Baralsi and Albert established BA model to explain the scale-free property of complex networks [
Figures 7(a)-(c), In the case of Type-I networks, when the scale of coarse-grained network is larger than or equal to 60 percent of the original scale, the HCCG method can maintain the synchronizability of the original network well; Figures 7(d)-(f), In the case of Type-II networks, when the scale of coarse-grained networks is larger than 70 percent of the original scale, the HCCG method can maintain the synchronizability of the original network well, but subsequently, there is a sharp increasing in R ˜ , at this time the average degree of the corresponding network increases suddenly. This is consistent with the conclusion in the literature [
networks, with the increasing of average degree, the synchronizability of the network basically unchanged. ER random networks have some emergence properties: for any given connecting probability p, either almost every graph G ( N , p ) has a certain nature Q, or almost every graph G ( N , p ) does not have this property Q. The suddenly increasing of R ˜ whether mean that arising certain emergence properties in BA scale-free networks, the problem has yet to be studied.
Coarse-graining technique is an important method for studying large-scale complex networks. In the HCCG method, the distance and similarity are defined simply. Jaccard distance is used to obtain the distance matrix of the network. Average Linkage method is used to find the nearest clusters. We use the hierarchical clustering method to cluster the network nodes, and update the weights of edges between clusters to extract the coarse-grained network. Massive simulation experiments show the HCCG method can keep the synchronizability of the original network well, and the method more suitable for the network with obvious clustering structure. Furthermore, the size of coarse-grained networks can be chosen freely in the HCCG method. However, the suddenly increasing of R ˜ on BA scale-free network whether mean that certain emergence properties arising, the problem has yet to be studied. In addition, greedy algorithm is used in hierarchical clustering, thence the clustering result is local optimum, may not be global optimum, the problem can be solved by adding random effects, which is also the direction of our future researches.
This project was supported by National Natural Science Foundation of China (Nos. 61563013, 61663006) and the Natural Science Foundation of Guangxi (No. 2018GXNSFAA138095).
The authors declare no conflicts of interest regarding the publication of this paper.
Liao, L., Jia, Z. and Deng, Y. (2019) Coarse-Graining Method Based on Hierarchical Clustering on Complex Networks. Communications and Network, 11, 21-34. https://doi.org/10.4236/cn.2019.111003