Advances in Pure Mathematics
Vol.08 No.04(2018), Article ID:83815,20 pages
10.4236/apm.2018.84021

Facebook Dynamics: Modelling and Statistical Testing

José Bavio1, Melina Guardiola1, Gonzalo Perera2

1Departamento de Matemática, Universidad Nacional del Sur, Buenos Aires, Argentina

2Universidad de la República, Montevideo, Uruguay

Copyright © 2018 by authors and Scientific Research Publishing Inc.

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

Received: December 26, 2017; Accepted: April 15, 2018; Published: April 18, 2018

ABSTRACT

In this work we study virtual social networks known as Facebook. It is used by millions of people worldwide, gathering a combination of virtual elements and real world components. We suggest a probabilistic model to describe the long-term behavior of Facebook. This model includes different friendship connection between profiles, directly or by suggestion. Due to web’s high interactivity level, we simplify the model assuming Markovian dynamic. After the model is established we propose Complete Transversality (CT) communication concept. CT describes people interaction that reflects profile behaviour and leads to estimators that measure this interaction. Then we introduce a weakness version of CT named Segmental Transversality (ST). Within this framework we develop estimators that allow hypothesis testing of CT and ST. And then, in ST context we propose performance measures to address a priori segmentation’s quality.

Keywords:

Markov Chains, Random Graphs, Social Networks

1. Introduction

Social networks have emerged as a communication tool of unexpected impact. Frequent contact between people through these networks gives rise to virtual relationships developed according to their interests. This study’s approach follows [1] which poses emphasis not only in the individual behaviour but in social interaction within the network.

One of today’s challenges is to develop accurate tools to identify influential users by sectors and markets, and to understand information flow dynamics. In turn, it is difficult to fully observe a social network; therefore statistical problems and probabilistic modelling are important issues (see [2] [3] and [4] for surveys on this subject). Besides this topic, social networks also provide examples of situations of unidentifiable or censored data models and this makes them particularly interesting.

The above concerns lead us to study virtual social networks phenomenon and we restricted our analysis to Facebook. For this we develop a model which, although does not describe it in full reality, it is useful as an approach to network’s dynamics study. Naturally this model leads to graph theory which relates to work in [5] and [6] .

We will use mathematical tools and statistics from stochastic processes field [7] [8] [9] , trying to answer questions such as the existence of transversal communication or how to find if a network segmentation proposed is optimal within a certain communication behaviour between segments.

This paper is structured as follows. In Section 2 we address Facebook modelling using tools of Markov chains, we introduce the concept of complete transversality in the communication, and in this context, we try to find the distribution of random functions involved in the model.

Section 3 proposes two statistical hypothesis tests, one to prove network’s CT and the other to prove CT between network segments using U-statistics theory and asymptotic convergence theorems presented in [10] . We end this section, defining useful performance index to measure segmentation’s quality.

Section 4 is devoted to conclusions and acknowledgment.

Section 5 is an appendix with proofs of the results obtained in the preceding sections.

Further and related works on this subject can be read in one of author’s PhD theses [11] .

2. Model Description

In this section we propose a model for describe Facebook dynamics.

Consider t . Let P t denote the set of all internet users at instant t and F t the set of Facebook’s profiles at time t. Of course P t P t + 1 and, we’ll also suppose that once a profile is created it cannot be eliminated. (Actually a profile can be eliminated, but this is tedious and difficult to do, so we disregard this behaviour.) Then F t F t + 1 .

We also denote P and F to the sets P = t = 0 P t and F = t = 0 F t respectively.

For each instant t, we will model the network with a random graph where the nodes represent profiles and the edges represent friendship links.

Definition 1. Let f , g F t . We define the random “friendship” function at time t as the function α t : F t × F t { 0,1 } , such that if f g ,

α t ( f , g ) = { 1 , if f and g are friend satinstant t , 0 , if f and g are not friend satinstant t ,

and

α t ( f , f ) = ( 1, forall f F t , 0, if f F F t .

Remark. The “friendship” relationship is symmetric, so function α t is also symmetric. Then, the random graph determined by α t is bi-directional and we can define the adjacency matrix as follows.

Definition 2. Let M be the cardinal number of F and let M M × M be the set of binary symmetric matrices of order M. We define the random “friendship” matrix at time t as the matrix A t M M × M whose elements are the values taken by α t for each pair of profiles ( f , g ) F × F .

Due to the high level of interactivity on social networks, is natural to suppose that { A t } is a Markov chain with states space in M M × M . Then, given two states A and B of { A t } , we denote p t , t + 1 A , B to the one step transition probability from A to B.

Because of A t symmetry’s, p t , t + 1 A , B is determined by the one step transition probability of the lower subdiagonal’s elements of A t . Then, if we say that f precedes f j if i < j and we write f i f j and, given A M M × M , we denote S D ( A ) = { A i , j : i > j } to the set of the lower subdiagonal’s entries of A, we have that:

p t , t + 1 A , B = P [ S D ( A t + 1 ) = S D ( B ) / S D ( A t ) = S D ( A ) ] . (1)

For (1) calculation’s, we describe through events the profile’s actions which have impact on the transition. These events will be linked to certain indices that we will construct to measure affinities between profiles.

We will use the following notation. Let A be any set, we will denote:

A × A D ( A × A ) = { ( a , a ) : a , a A , a a }

to the set of distinct pairs of A’s elements, and by

A × A × A D ( A × A × A ) = { ( a , a , a ) : a , a , a A , a a , a a , a a }

to the set of triples of A’s elements they are different by pairs.

Definition 3. Let p , p P , with p p . We define the p “image index” over p as the function X : P × P D ( P × P ) such that

( X ( p , p ) > 0 , if p has positive image of p , X ( p , p ) = 0 , if p is indiferent to p , X ( p , p ) < 0 , if p has negative image of p .

Remark. We will suppose that the network lies at steady state, this implies that image between profiles has also evolved to a steady state, so the function X does not depends on time t. Besides, function X is not symmetric, non observable and monotonic.

Definition 4. Let f , g F , with f g . We define f “image index” over g as the function Y t : F × F D ( F × F ) , given by

Y t ( f , g ) = 1 m l i = 1 m j = 1 l Φ ( X ( p i , p j ) ) + ε t ( f , g ) , (2)

with { p 1 , , p m } and { p 1 , , p l } the users sets that administrate f and g profiles respectively, Φ : a monotonic regression function, and { ε t ( f , g ) } independent and identically distributed random variables with E ( ε t ) = 0 and E ( ε t 2 ) = σ 2 > 0 , for all pair ( f , g ) F × F , f g .

To triples ordered of users and triples ordered of profiles we define the following indices.

Definition 5. Let p , p , p P . For the triple ordered of users ( p , p , p ) we define an “image index” as the function

U : P × P × P D ( P × P × P )

that assigns to each triple, a real number that represents acceptance level for the action p suggests to p to be a friend of p .

Remark. As with X, U doesn’t depend on t, is non symmetric and non observable.

Definition 6. Let f , g , h F distinct by pairs. For the triple ordered of profiles ( f , g , h ) we define the “index image” as the function

W t : F × F × F D ( F × F × F ) ,

given by

W t ( f , g , h ) = 1 m k l i = 1 m j = 1 k r = 1 l Ψ ( U ( p i , p j , p r ) ) + η t ( f , g , h ) , (3)

with { p 1 , , p m } , { p 1 , , p k } and { p 1 , , p l } the users sets that administrate f, g and h profiles respectively, Ψ : a monotonic regression function and { η t ( f , g , h ) } independent and identically distributed random variables with E ( η t ) = 0 and E ( η t 2 ) = τ 2 > 0 , for every triple ( f , g , h ) F × F × F , distinct by pairs.

Then, given f , g , h F , we will suppose that there are δ B > 0 , δ R > 0 and δ I > 0 , such that:

i) “f breaks friendship with g” Û { Y t ( f , g ) < δ B } .

(The breakdown of friendship may be due to that f take decision to eliminate g or vice versa).

ii) “f successfully requests friendship to g” Û { Y t ( f , g ) > δ R } .

(The request of successful friendship arises when f asks to g for friendship and g accepts to f as his friend).

iii) “f successfully suggests to g, h’s friendship” Û { W t ( f , g , h ) > δ I } .

(The suggestion of successful friendship is given when f suggests h to g, g requests h’s friendship and h accepts).

Then, if we denote with I t ( f ) to the interventions of f regardless its effects in the transitions from one instant to the next from A to B , these can be described in a disjunct union of the following events:

D t 1 ( f ) = f sbreaks = g f : A f , g = 1 , B f , g = 0 { Y t ( f , g ) < δ B } ,

D t 2 ( f ) = f snobreaks = g f : A f , g = 1, B f , g = 1 { Y t ( f , g ) δ B } ,

D t 3 ( f ) = f s requests = g f : A f , g = 0 , B f , g = 1 { Y t ( f , g ) > δ R } g f : A f , g = 0 , B f , g = 0 { Y t ( f , g ) δ R } , D t 4 ( f ) = f ssuggestions = g f : A f , g = 1 ( h f : A f , h = 1 , B g , h = 1 { W t ( f , g , h ) > δ I } h f : A f , h = 1 , B g , h = 0 { W t ( f , g , h ) δ I } ) , that is, I t ( f ) = i = 1 4 D t i ( f ) , and the transition probabilities in one step from A to B can be expressed in the following formula:

p t , t + 1 A , B = P ( f F I t ( f ) ) . (4)

Proposition 1. The one step transition probability from A to B of the Markov chain { A t } is given by:

p t , t + 1 A , B = f F ( i = 1 4 P ( D t i ( f ) ) i = 1 3 j = i + 1 4 P ( D t i ( f ) ) P ( D t j ( f ) ) + i = 1 2 j = i + 1 3 k = j + 1 4 P ( D t i ( f ) ) P ( D t j ( f ) ) P ( D t k ( f ) ) i = 1 4 P ( D t i ( f ) ) ) ,

with

P ( D t 1 ( f ) ) = g f : A f , g = 1 , B f , g = 0 P ( Y t ( f , g ) < δ B ) ,

P ( D t 2 ( f ) ) = g f : A f , g = 1 , B f , g = 1 P ( Y t ( f , g ) δ B ) ,

P ( D t 3 ( f ) ) = 1 g f : A f , g = 0 , B f , g = 1 P ( Y t ( f , g ) δ R ) g f : A f , g = 0 , B f , g = 0 P ( Y t ( f , g ) > δ R )

and

P ( D t 4 ( f ) ) = g f : A f , g = 1 [ 1 h f : A f , h = 1 , B g , h = 1 P ( W t ( f , g , h ) δ I ) h f : A f , h = 1 , B g , h = 0 P ( W t ( f , g , h ) > δ I ) ] .

Complete Transversality

Complete Transversality (CT) in a social network is associated to a certain communication behavior. This behavior implies that relationship probability between two profiles is always the same. No matter who the profiles are.

CT arises from social scientific theories [12] that poses that massification of social networks would bring an horizontality in communication and transversality in connection between people, overpassing social, economic, ethnographic and other differences. Virtual social networks would bring balance and democracy to all people connection.

In terms of the model, this could be reflected in the “image index” X of one user p over another p and in the “image index” U of the triple ordered of users ( p , p , p ) distinct by pairs. So, averages of regression functions Φ ( X ) and Ψ ( U ) takes the same value in (2) and (3). Let’s say they are equal to C 1 for all distinct user pairs and C 2 for all triples ordered of users distinct by pairs. Under this conditions, “image indices” given by (2) and (3) are reduced to C 1 + ε t ( f , g ) and C 2 + η t ( f , g , h ) .

When network follows this behaviour, the following results can be established.

Theorem 1. (Homogeneity)

Under the CT context, { A t } is a time-homogeneous Markov chain.

Theorem 2. (Ergodicity)

Under the CT context, the Markov chain { A t } is ergodic, and when initial distribution is ergodic, the friendship indicator between a pair of profiles ( f , g ) F × F , with f g , denoted by α ( f , g ) , is a random variable with Bernoulli distribution and parameter p, 0 < p < 1 , with the same distribution and independent of the friendship indicator of any pair of distinct profiles.

From the results exposed in former Theorem we can conclude that, for A C , ergodic distribution under CT is

π A ( ) f F g f g F p A f , g ( 1 p ) 1 A f , g , 0 < p < 1.

3. Test and Estimation

We are aimed to discuss the CT’s validity in social network Facebook. For this, we will propose two statistics based on samples of N profiles and will study their asymptotic distribution under CT when N increases. Besides, we will present the CT tests.

Further, in this section we will introduce a lighter version of CT called Segment Transversality (ST) related to a given segmentation, that allow us to elaborate segmentation quality index.

3.1. Average Communication between Profiles

Let

E N = 1 C N 2 i = 1 N 1 j = i + 1 N α ( f i , f j ) , (5)

this statistic averages proportion of friends who have profiles on the sample and therefore measures “sample’s communication average”.

Focusing on long term dynamics, we have seen that under CT, the random variables α ( f i , f j ) , i j are independent Bernoulli with parameter p. Then,

we can verify that E ( E N ) = p and V a r ( E N ) = 2 p ( 1 p ) N ( N 1 ) .

We want to find EN’s asymptotic distribution, so we study asymptotic behaviour of

E N p = 1 C N 2 i = 1 N 1 j = i + 1 N ( α ( f i , f j ) p ) (6)

when sample size N grows towards infinity.

If we denote Y N i = 1 C N 2 j = i + 1 N ( α ( f i , f j ) p ) , we have that E N p = i = 1 N 1 Y N i , where { Y N i : i = 1 , 2 , , N 1 } forms a triangular array.

Theorem 3. If the triangular array { Y N i : i = 1 , 2 , , N 1 } verifies the following conditions:

i) For each N , { Y N i : i = 1 , 2 , , N 1 } are independents;

ii) E ( Y N i ) 0 , when N ;

iii) s N 2 = i = 1 N 1 V a r ( Y N i ) < ;

iv) Exists δ > 0 such that

E [ ( Y N i ) 2 + δ ] < , for all N and for all i,

and Lyapunov conditions is met, that is,

L ( N , δ ) = 1 s N 2 + δ i = 1 N 1 E [ ( Y N i ) 2 + δ ] 0 , when N , (7)

then

1 s N i = 1 N 1 Y N i w N ( 0,1 ) , when N . (8)

Proof 1. Hypothesis i) is trivial because for i i and N fixed, j = i + 1 N ( α ( f i , f j ) p ) is independent of j = i + 1 N ( α ( f i , f j ) p ) .

Besides, E ( Y N i ) = 1 C N 2 ( N i ) E ( α ( f i , f j ) p ) = 0 , and

s N 2 = i = 1 N 1 V a r ( Y N i ) = 1 ( C N 2 ) 2 i = 1 N 1 j = i + 1 N V a r ( α ( f i , f j ) ) = 1 C N 2 p ( 1 p ) < (9)

Then, the hypothesis ii) and iii) are met.

Let δ = 2 . We will see that iv) holds. For this, we will calculate

E ( ( Y N i ) 4 ) = 1 ( C N 2 ) 4 E [ ( j = i + 1 N ( α ( f i , f j ) p ) ) 4 ] = 1 ( C N 2 ) 4 E [ j = i + 1 N E ( α ( f i , f j ) p ) 4 + C 4 2 j = i + 1 N 1 k = j + 1 N E ( α ( f i , f j ) p ) 2 E ( α ( f i , f k ) p ) 2 ] = 1 ( C N 2 ) 4 [ ( N i ) p + ( N i 1 ) ˜ p ] ,

with p = 3 p 4 + 3 p 3 4 p 2 + p < 3 p 3 + p = p , ˜ p = 3 p 2 ( 1 p ) 2 , N i N 1 and N i 1 N 2 .

Then,

E ( ( Y N i ) 4 ) 1 ( C N 2 ) 4 [ ( N 1 ) p + ( N 1 ) ( N 2 ) ˜ p ] = 1 ( C N 2 ) 4 { N 2 ˜ p [ ( 3 N 2 ) ˜ p ( N 1 ) p ] } 1 ( C N 2 ) 4 N 2 ˜ p < , for all N and for all i ,

and Lyapunov condition given by (7) can be verified for δ = 2 :

L ( N , 2 ) = 1 s N 4 i = 1 N 1 E ( ( Y N i ) 4 ) 1 s N 4 1 ( C N 2 ) 4 ( N 1 ) N 2 ˜ p = N 2 ( N 1 ) ( C N 2 ) 2 ˜ p p 2 ( 1 p ) 2 1 N 4 ˜ p p 2 ( 1 p ) 2 0 + , when N .

Consequently the hypothesis i)-iv) holds and we conclude that

1 s N i = 1 N 1 Y N i w N ( 0,1 ) , when N .

Returning to the centered statistic expression E N p = i = 1 N Y N i and to the expression of s N given by (9) results the following:

1 s N i = 1 N Y N i N 2 p ( 1 p ) ( E N p ) w N ( 0,1 ) , when N ,

that is,

N ( E N p ) w N ( 0,2 p ( 1 p ) ) , when N . (10)

As EN is a communication estimation between profiles, if we select different profile samples under CT, we shouldn’t detect differences among p’s estimations.

To study this, we propose the following hypothesis test to compare communication average.

Given f 1 , , f N and g 1 , , g N , independent profile samples of F , that verifies { f 1 , , f N } { g 1 , , g N } = , we build the statistics:

E N = 1 C N 2 i = 1 N 1 j = i + 1 N α ( f i , f j ) and E N * = 1 C N 2 i = 1 N 1 j = i + 1 N α ( g i , g j )

and we have that

μ = E ( E N ) = E ( E N * ) = p ( 0 , 1 ) ,

and

σ 2 = V a r ( E N ) = V a r ( E N * ) = 2 p ( 1 p ) ,

with σ = σ ( μ ) , this is, mean and variance are linked and σ ( μ ) is derivable. So, if we take,

g ( μ ) = 2 a r c s e n ( μ ) ,

g ( μ ) = 1 σ ( μ ) and,

N ( g ( E N ) g ( p ) ) w N ( 0, σ 2 ( p ) g ( p ) 2 ) = N ( 0,1 ) , when N , (11)

and

N ( g ( E N * ) g ( p ) ) w N ( 0, σ 2 ( p ) g ( p ) 2 ) = N ( 0,1 ) , when N . (12)

Consequently, to test average communication, we can substract (11) and (12). This statistic, under CT assumption, results

N ( g ( E N ) g ( E N * ) ) w N ( 0,2 ) , when N . (13)

and, given a signification level α, we obtain the following critic region

R α = { N 2 | g ( E N ) g ( E N * ) | z α / 2 } .

Real Data Testing

We perform such an experiment in a real profile network that gives permission to the authors for sampling. For confidential reasons we cannot release any of the data used to make calculations. We can state that we take two independent and disjunct samples of size N = 75 . The statistic value was

N 2 | g ( E N ) g ( E N * ) | = 2.1 . Therefore, with a α = 0.05 significance level, we reject CT hypothesis.

This results indicates the social network Facebook is a platform in which communication between people or groups of people is it NOT TRANSVERSAL.

3.2. Mean Square Deviation of the Communication between Profiles

Let

T N = 1 C N 2 i = 1 N 1 j = i + 1 N [ α ( f i , f j ) 1 C N 2 k = 1 N 1 l = k + 1 N α ( f k , f l ) ] 2 = 1 C N 2 [ i = 1 N 1 j = i + 1 N α ( f i , f j ) 2 1 C N 2 ( k = 1 N 1 l = k + 1 N α ( f k , f l ) ) 2 ] = 1 ( C N 2 ) 2 i = 1 N 1 j = i + 1 N k = 1 N 1 l = k + 1 N ( α ( f i , f j ) α ( f k , f l ) ) 2 2 , (14)

be an statistic that estimates the mean square deviation of the communication between profiles respect to its mean.

In order to find the asymptotic distribution of TN we use properties of U-statistics introduced in [13] .

Suppose that X 1 , , X N are independent and identically distributed random variables and that h : r , 1 r N is some symmetric function respect to permutations.

Definition 7. A U-statistic of order r with kernel h is defined as

U = U N = 1 C N r 1 i 1 < i 2 < < i r N h ( X i 1 , , X i r ) .

We state the following theorem whose proof can be seen in [10] .

Theorem 4. Let UN be a U-statistic of order r with kernel h. Suppose that E ( h 2 ( X 1 , , X r ) ) < and that

i) X 1 , X 2 , , X r , Y 2 , , Y r are independent and identically distributed random variables,

ii) θ = E ( h ( X 1 , , X r ) ) and,

iii) ζ 1 = C o v ( h ( X 1 , X 2 , , X r ) , h ( X 1 , Y 2 , , Y r ) ) .

Then, if 0 < ζ 1 < ,

N ( U N θ ) w N ( 0, r 2 ζ 1 ) , when N . (15)

Then we have that TN is a U-statistic of order 2 with kernel

h ( v ( f i ) , v ( f k ) ) = ( α ( f i , f j ) α ( f k , f l ) ) 2 2 ,

where v ( f i ) = ( α ( f i , f h ) ) h F and v ( f k ) = ( α ( f k , f h ) ) h F . Thus, under CT, θ = p ( 1 p ) and ζ 1 = 1 4 p ( 1 p ) p 2 ( 1 p ) 2 .

Remark. For the proposed model for Facebook, p is the friendship probability of any pair of profiles in long term and, because of the large size of the network, p is probably less than 0.5. Thus, p 0 , p 1 and p 0.5 , therefore ζ 1 0 .

As 0 < ζ 1 < , by the Theorem 4 is

N ( T N p ( 1 p ) ) w N ( 0, p ( 1 p ) 4 p 2 ( 1 p ) 2 ) , when N . (16)

As μ = p ( 1 p ) and σ = p ( 1 p ) 4 p 2 ( 1 p ) 2 , we see that σ = σ ( μ ) and σ ( μ ) is derivable. Taking

g ( μ ) = a r c s e n ( 2 μ ) ,

g ( μ ) verifies that g ( μ ) = 1 σ ( μ ) and the limit expression on (16) is

N ( g ( T N ) g ( p ( 1 p ) ) ) w N ( 0, σ 2 ( p ( 1 p ) ) g ( p ( 1 p ) ) 2 ) = N ( 0,1 ) , (17)

when N .

We can make a test to prove CT by comparing the mean square deviation of two independent populations of profiles. For this we take two independent samples of N profiles of F , f 1 , , f N and g 1 , , g N such that { f 1 , , f N } { g 1 , , g N } = , we construct the statistics

T N = 1 C N 2 i = 1 N 1 j = i + 1 N [ α ( f i , f j ) 1 C N 2 k = 1 N 1 l = k + 1 N α ( f k , f l ) ] 2

and

T N * = 1 C N 2 i = 1 N 1 j = i + 1 N [ α ( g i , g j ) 1 C N 2 k = 1 N 1 l = k + 1 N α ( g k , g l ) ] 2 ,

then, by (17), we obtain the asymptotic distribution of the statistic for the comparison of mean square deviations under the context of CT, that is,

N ( g ( T N ) g ( T N * ) ) w N ( 0,2 ) , N ,

being the critical region at the level of significance α

R α = { N 2 | g ( T N ) g ( T N * ) | z α / 2 } .

Similarly as for the test comparing the average proportion of communication between profiles, we use the same real profile and, on his network of friends, we take two independent and disjoint samples and calculate the statistic and the critical region, concluding that the hypothesis of CT is rejected again.

3.3. Segmented Transversality

A context of CT in a social network like Facebook is far from reality as evidenced by the findings of the two tests that we made. Is reasonable to think that profiles tend to cluster in different segments according to social criteria such as political ideologies, economic interests, musical tastes, ages, etc., and that these segments are also related to each other.

We introduce the concept of Segmented Transversality (ST), that is, CT between segments. Then, making a priori segmentation on the network, we will introduce a statistic representing the communication between pairs of segments and we will prove CT between the profiles of the segments.

Let S 1 , , S k be a partition in segments of F . We notice with a i = c a r d { S i } c a r d { F } to the proportion of profiles of S i , i = 1 , 2 , , k , a i > 0 , i = 1 k a i = 1 , and we make a random stratified sampling by segments as follows: f 1 , , f [ a 1 N ] are chosen randomly inside of S 1 , f [ a 1 N ] + 1 , , f [ ( a 1 + a 2 ) N ] are chosen randomly inside of S 2 , and so on until f [ ( i = 1 k 1 a i ) N ] + 1 , , f ( i = 1 k a i ) N are chosen randomly inside of S k , being [x] the integer part of x, that is, [ x ] = m a x { k : k x } , for x > 0 .

Given the sets I 1 = { 1 , , [ a 1 N ] } , I 2 = { [ a 1 N ] + 1 , , [ ( a 1 + a 2 ) N ] } , , I k = { [ ( i = 1 k 1 a i ) N ] + 1 , , ( i = 1 k a i ) N } and, given S r and S t two segments of

the partition on k segments of F , for f i S r and f j S t , with i I r and j I t , we notice with q i j to the probability of friendship between f i and f j , that is, q i j : = P ( α ( f i , f j ) = 1 ) = q j i .

Remark. Friendship’s random functions ( α ( f i , f j ) ) i < j still are independent random variables with Bernoulli distribution, but now the parameter distribution depends on the intensity of the relationship between the pair of profiles considered.

Definition 8. Given two segments of profiles S r and S t , we say that there is CT between them if given f i S r and f j S t , with i I r and j I t , α ( f i , f j ) has Bernoulli distribution with parameter q r t , for all f i S r and for all f j S t .

That is, CT between segments means a distinctive homogeneous behavior in the communication between them.

So, let

E ( r , t ) = 1 c a r d { I r } i I r 1 c a r d { I t } j I t α ( f i , f j ) , (18)

be the average proportion of friends of the profiles of S r in the segment S t . Then, under CT between S r and S t , we have that E ( E ( r , t ) ) = q r t and

V a r ( E ( r , t ) ) = q r t ( 1 q r t ) c a r d { I r } c a r d { I t } .

Of the same way as we obtain the asymptotic distribution of the centered estimator E N p of the expression (6), we can obtain the asymptotic distribution of

E ( r , t ) q r t = i I r 1 c a r d { I r } j I t α ( f i , f j ) q r t c a r d { I t } , (19)

resulting

c a r d { I r } c a r d { I t } ( E ( r , t ) q r t ) w N ( 0, q r t ( 1 q r t ) ) , when N . (20)

If we want to test CT between a pair of segments S r and S t we make a stratified random sampling independent from the previous one in which g 1 , , g [ a 1 N ] are chosen randomly inside of S 1 , g [ a 1 N ] + 1 , , g [ ( a 1 + a 2 ) N ] are chosen randomly inside of S 2 , and so on until g [ ( i = 1 k 1 a i ) N ] + 1 , , g ( i = 1 k a i ) N are chosen randomly inside of S k , with

{ f 1 , , f [ a 1 N ] } { g 1 , , g [ a 1 N ] } = ,

{ f [ a 1 N ] + 1 , , f [ ( a 1 + a 2 ) N ] } { g [ a 1 N ] + 1 , , g [ ( a 1 + a 2 ) N ] } = ,

{ f [ ( i = 1 k 1 a i ) N ] + 1 , , f ( i = 1 k a i ) N } { g [ ( i = 1 k 1 a i ) N ] + 1 , , g ( i = 1 k a i ) N } = ,

and we construct the statistic

E * ( r , t ) = 1 c a r d { I r } i I r 1 c a r d { I t } j I t α ( g i , g j ) . (21)

Then, under CT between S r and S t , are E ( E * ( r , t ) ) = q r t , V a r ( E * ( r , t ) ) = q r t ( 1 q r t ) c a r d { I r } c a r d { I t } and

c a r d { I r } c a r d { I t } ( E * ( r , t ) q r t ) w N ( 0, q r t ( 1 q r t ) ) , when N . (22)

For E ( r , t ) and E * ( r , t ) we have that the variance s is a function of the expected value m, that is, σ ( μ ) = μ ( 1 μ ) , where μ = q r t and σ ( μ ) is derivable. Then, taking

g ( μ ) = 2 a r c s e n (μ)

we have that g ( μ ) = 1 σ ( μ ) and, similarly as in the previous section, we can conclude that

c a r d { I r } c a r d { I t } ( g ( E ( r , t ) ) g ( E * ( r , t ) ) ) w N ( 0,2 ) , when N ,

being the critical region at the level of significance α,

R α = { c a r d { I r } c a r d { I t } 2 | g ( E ( r , t ) ) g ( E * ( r , t ) ) | z α 2 } .

Therefore, if the test leads to the rejection of the hypothesis of CT between the segments S r and S t , with an error probability of α we say that such segments do not have a distinctive homogeneous behavior in the communication.

3.4. Quality on Segmentation

If we divide the network into k disjoint segments, we can take all possible pairs of those k segments, C k 2 , and make a total of C k 2 test, one of each pair and test whether the segment of this pair have a distinctive homogeneous behavior in the communication. We can represent these C k 2 test by a binary symmetric matrix of order k, S , in which each element S i j is one if the segments S i and S j were not homogeneous in terms of communication, that is, if the corresponding test rejects the hypothesis of CT and, S i j equals zero, otherwise.

Then, noticing the cardinal of the set of “ones” in the subdiagonal of S as

g ( d ) = c a r d { 1 s in S D ( S ) } ,

we can define the following useful performance index to measure the quality on segmentation:

C p = g ( d ) C k 2 100 % . (23)

If we keep the segmentation and make a stratified random sampling segment m times, with m sufficiently large, and calculate m times the index defined in (23), C p i , with i = 1 , , m , we can observe the histogram representing the distribution of quality on segmentation. If the most of the times this measure results, for example, greater than the mean of the observations, we continue segmenting according to the criteria which has been used, otherwise it is desirable to modify the segmentation criteria.

Let’s illustrate this with an example. Suppose we segment people by age (>15, <20, >20 and <30, >30) and gender (M, F).

Given this segmentation, suppose we conduct the 15 hypothesis test for segment transversality and obtain the following segment adjacency matrix

A one in this matrix means that segment S i with segment S j behaves distinctly in the sense of transversality communication. Zeros in the upper triangle matrix doesn’t mean anything. Each segment compare to each self is homegenous (zero in the diagonals) except in segment 6. This could mean that further segmentation within it might improve audience segmentation. Nevertheless, segment 6 distinguish from other segments anyway.

If we sum the ones under the diagonal and divided by the number of segment combinations (see (23)), we calculate the performance of this segmentation which is 40%. The more the zeros the lower the performance index.

Following this framework we can improvement this segmentation by:

1) fusion homogenous segments

2) explore intra-segmentation in the cases were there was one in the diagonal

For actions in the first group, we look that in this toy example S1 and S2 doesn’t differ in communication behaviour, so we could consider to group them as one segment. We could summaries this saying that gender doesn’t segment among young people (under 20 years). We calculate again the matrix with this augmented segment and calculate the performance. Then occam’s razor lead us to select the least segmented partition when we have two or more with same performance.

For segment 6, this is female over thirty years, we calculate that it is inhomogenous with itself, so we could try a sub-segmentation by education degree or by motherhood. Then repeat the five tests between the new segment with the previous ones and calculate performance of this new matrix with one or more rows.

Of course this iterative method implies significant work with estimation, data recompilation, amount of data, independent sampling, etc. that although might be of extreme relevance it’s beyond the scope of pure mathematics and poses a great source of scientific challenge and interdisciplinary work.

4. Conclusions

In this work we analyze Facebook social network, modelling it with a Markov Chain and several random variables representing profiles friendships. We further propose communication behaviour between all profiles called complete transversality that assumes no bias between profiles willingness to connect as friends. This CT behaviour leads to estimations that allow us a hypothesis test by means of mean square deviation to reject the CT assumption. This might be an obvious conclusion (because people behaves within Facebook as they behave in real context), but it has all the hypothesis testing machinery behind which gives it strong rigorosity.

Next step in our work was to weaken CT and, for this, we introduce ST (segment Transversality). This is given a determined network segmentation, and each segment profile connects to any other segment profile with the same probability. (Of course this probability can change with different pair of segments.)

In this ST scenario we were able to compare between two entire segments and determine whether they behave in the same way or differently. If we don’t find differences we can safely group the two segments in order to improve the original segmentation. For a given segmentation we test intra segment and every pair of different segments and define a performance index that reflects a percentage of the segments that behaves differently from all the possible comparisons. With this result, we join similar segment (don’t reject null hypothesis) and recalculate tests and performance index to the new segmentation.

This iterative process of grouping homogenous segments or leaving different leads to a hierarchy in segmentations according to segmentation’s quality. This information could be interpreted and used to determine if segmentation’s rules are adequate to distinguish segments; of course the comparison falls in this communication behavior sense.

Cite this paper

Bavio, J., Guardiola, M. and Perera, G. (2018) Facebook Dynamics: Modelling and Statistical Testing. Advances in Pure Mathematics, 8, 380-399. https://doi.org/10.4236/apm.2018.84021

References

  1. 1. Freeman, L.C. (2006) The Development of Social Network Analysis: A Study in the Sociology of Science. Empirical Press, Vancouver.

  2. 2. Rao, A.R., Bandyopadhyay, S. and Sinha, B.K. (2011) Models of Social Networks with Statistical Applications. SAGE Publications Ltd., Thousand Oaks.

  3. 3. Furht, B. (2010) Handbook of Social Network Technologies and Applications. Springer, Berlin.
    https://doi.org/10.1007/978-1-4419-7142-5

  4. 4. Scott, J., Carrington, P. and Wasserman, S. (2005) Models and Methods in Social Network Analysis. Cambridge University Press, Cambridge.

  5. 5. Ralescu, D.A., Chen, L. and Peng, J. (2017) Uncertain Vertex Coloring Problem. Soft Computing, 21, 1-10.

  6. 6. Rosyida, I., Peng, J., Chen, L., Widodo, W., Indrati, Ch.R. and Sugeng, K.A. (2016) An Uncertain Chromatic Number of an Uncertain Graph Based on Alpha-Cut Coloring. Fuzzy Optimization and Decision Making, 17, 103-123.

  7. 7. Meyn, S.P. and Tweedie, R.L. (1993) Markov Chains and Stochastic Stability. Springer, New York.
    https://doi.org/10.1007/978-1-4471-3267-7

  8. 8. De Meer, H., Bolch, G., Greiner, S. and Trivedi, K. (2006) Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. Wiley and Sons, Hoboken.

  9. 9. DasGupta, A. (2008) Asymptotic Theory of Statistics and Probability. Springer, Berlin.

  10. 10. Sering, R. (1980) Approximation Theorems of Mathematical Statistics. John Wiley and Sons, New York, Chichester, Brisbane, Toronto.
    https://doi.org/10.1002/9780470316481

  11. 11. Guardiola, M. (2013) Estadística de procesos estocásticos y aplicaciones a las redes sociales.
    http://repositoriodigital.uns.edu.ar/bitstream/123456789/446/1/TESIS%20DOCTORAL%20Melina%20Guardiola.pdf

  12. 12. De Ugarte, D. (2011) El poder de las redes. Manual ilustrado para ciberactivistas. Colección Biblioteca de las Indias Electrónicas.

  13. 13. Hoeffding, W. (1948) A Class of Statistics with Asymptotically Normal Distribution. Annals of Mathematical Statistics, 19, 293-325.
    https://doi.org/10.1214/aoms/1177730196

Appendix

1.1. Proof of Proposition 1

Let f F , f f and I t ( f ) = i = 1 4 D t i ( f ) . We will see that I t ( f ) is independent of I t ( f ) . For this, it is enough prove that D t i ( f ) is independent from D t i ( f ) , when i = 1 , 2 , 3 , 4 . In fact, if we look at D t i ( f ) , the first coordinate of both involved indices Y t ( f , g ) for i = 1 , 2 , 3 and W t ( f , g , h ) for i = 4 , is fixed and is the first coordinate of the random variables ε t ( f , g ) and η t ( f , g , h ) of the mentioned indices respectively. Then, Y t ( f , g ) is independent from Y t ( f , g ) and W t ( f , g , h ) is independent from W t ( f , g , h ) , hence, D t i ( f ) is independent from D t i ( f ) , when i = 1 , 2 , 3 , 4 . Thus,

p t , t + 1 A , B = P ( f F I t ( f ) ) = f F P ( I t ( f ) ) .

Applying inclusion-exclusion principle for the events D t i ( f ) , i = 1 , 2 , 3 , 4 , we have

p t , t + 1 A , B = f F ( i = 1 4 P ( D t i ( f ) ) i = 1 3 j = i + 1 4 P ( D t i ( f ) D t j ( f ) ) + i = 1 2 j = i + 1 3 k = j + 1 4 P ( D t i ( f ) D t j ( f ) D t k ( f ) ) P ( i = 1 4 D t i ( f ) ) ) (24)

We have to check that the events D t i ( f ) , i = 1 , 2 , 3 , 4 , are independent between them for a same profile f. D t 4 ( f ) is independent from D t 1 ( f ) , from D t 2 ( f ) and D t 3 ( f ) , because η t ( f , g , h ) , which corresponds to image index W t ( f , g , h ) involved in D t 4 ( f ) is independent from ε t ( f , g ) which corresponds to index Y t ( f , g ) that appears in D t 1 ( f ) , D t 2 ( f ) and D t 3 ( f ) .

D t 1 ( f ) is independent from D t 2 ( f ) and from D t 3 ( f ) , because if a profile f F appears in the intersections of D t 1 ( f ) , it doesn’t figure either in the intersections of D t 2 ( f ) nor in the unions of D t 3 ( f ) . The same argument allows us to affirm that D t 2 ( f ) is independent from D t 3 ( f ) .

Therefore, transition probability (24) is given by

p t , t + 1 A , B = f F ( i = 1 4 P ( D t i ( f ) ) i = 1 4 j = i + 1 4 P ( D t i ( f ) ) P ( D t j ( f ) ) + i = 1 4 j = i + 1 4 k = j + 1 4 P ( D t i ( f ) ) P ( D t j ( f ) ) P ( D t k ( f ) ) i = 1 4 P ( D t i ( f ) ) ) .

We will see how P ( D t i ( f ) ) , i = 1 , 2 , 3 , 4 is finally written.

If g , g F , with g f , g f , g g , ε t ( f , g ) is independent from ε t ( f , g ) . Then, are independents: { Y t ( f , g ) < δ B } from { Y t ( f , g ) < δ B } , { Y t ( f , g ) δ B } from { Y t ( f , g ) δ B } , { Y t ( f , g ) δ R } from { Y t ( f , g ) δ R } and { Y t ( f , g ) > δ R } from { Y t ( f , g ) > δ R } . Thus,

P ( D t 1 ( f ) ) = P ( g f : A f , g = 1 , B f , g = 0 { Y t ( f , g ) < δ B } ) = g f : A f , g = 1 , B f , g = 0 P ( Y t ( f , g ) < δ B ) ,

P ( D t 2 ( f ) ) = P ( g f : A f , g = 1 , B f , g = 1 { Y t ( f , g ) δ B } ) = g f : A f , g = 1 , B f , g = 1 P ( Y t ( f , g ) δ B ) ,

and, applying set properties we have that

P ( D t 3 ( f ) ) = P ( g f : A f , g = 0 , B f , g = 1 { Y t ( f , g ) > δ R } g f : A f , g = 0 , B f , g = 0 { Y t ( f , g ) δ R } ) = 1 P ( g f : A f , g = 0 , B f , g = 1 { Y t ( f , g ) δ R } g f : A f , g = 0 , B f , g = 0 { Y t ( f , g ) > δ R } ) = 1 g f : A f , g = 0 , B f , g = 1 P ( Y t ( f , g ) δ R ) g f : A f , g = 0 , B f , g = 0 P ( Y t ( f , g ) > δ R ) ,

as if g { g f : A f , g = 0 , B f , g = 1 } then g { g f : A f , g = 0 , B f , g = 0 } and accordingly we have that g f : A f , g = 0 , B f , g = 1 { Y t ( f , g ) δ R } and g f : A f , g = 0 , B f , g = 0 { Y t ( f , g ) > δ R } are independents.

Finally,

P ( D t 4 ( f ) ) = P ( g f : A f , g = 1 ( h f : A f , h = 1 , B g , h = 1 { W t ( f , g , h ) > δ I } h f : A f , h = 1 , B g , h = 0 { W t ( f , g , h ) δ I } ) ) = g f : B f , g = 1 ( 1 P ( h f : A f , h = 1 , B g , h = 1 { W t ( f , g , h ) δ I } h f : A f , h = 1 , B g , h = 0 { W t ( f , g , h ) > δ I } ) ) ,

and, for fixed f , g F , h , h F , h h , h f , h f , η t ( f , g , h ) is independent from η t ( f , g , h ) . So, are independents { W t ( f , g , h ) δ I } from { W t ( f , g , h ) δ I } and { W t ( f , g , h ) > δ I } from { W t ( f , g , h ) > δ I } . Besides, h f : A f , h = 1 , B g , h = 1 { W t ( f , g , h ) δ I } and h f : A f , h = 1 , B g , h = 0 { W t ( f , g , h ) > δ I } are independents, as if h { h f : A f , h = 1 , B g , h = 1 } then h { h f : A f , h = 1 , B g , h = 0 } . Therefore,

P ( D t 4 ( f ) ) = g f : A f , g = 1 ( 1 h f : A f , h = 1 , B g , h = 1 P ( { W t ( f , g , h ) δ I } ) h f : A f , h = 1 , B g , h = 0 P ( { W t ( f , g , h ) > δ I } ) ) .

1.2. Proof of Theorem 1

Suppose we are in a CT context. To prove time homogeneity in Markov chain { A t } we will see that probabilities involved in p t , t + 1 A , B , P ( D t i ( f ) ) , i = 1 , 2 , 3 , 4 , don’t depend on time. To do this we introduce some useful notations.

Let F and G be the distribution function for random variables ε t ( f , g ) and η t ( f , g , h ) related with image indices Y t ( f , g ) and W t ( f , g , h ) respectively.

The CT hypothesis implies that F and G neither depend on time nor on profiles. This is, P ( ε t ( f , g ) x ) = F ( x ) and P ( η t ( f , g , h ) x ) = G ( x ) , for all t = 0 , 1 , , for all f , g F and every ordered triple f , g , h F , with F and G continuous functions.

We also denote with c A , B i , j ( f ) and d A , B 1, j ( f , g ) the following cardinals numbers of sets:

c A , B i , j ( f ) = c a r d { g f : A f , g = i , B f , g = j ; i , j { 0,1 } }

and

d A , B 1, j ( f , g ) = c a r d { h f : A f , h = 1, B h , g = j ; j { 0,1 } } .

Consequently, under CT hypothesis we have,

P ( D t 1 ( f ) ) = g f : A f , g = 1 , B f , g = 0 P ( C 1 + ε t ( f , g ) < δ B ) = F ( δ B C 1 ) c A , B 1 , 0 ( f ) ,

P ( D t 2 ( f ) ) = g f : A f , g = 1 , B f , g = 1 P ( C 1 + ε t ( f , g ) δ B ) = ( 1 F ( δ B C 1 ) ) c A , B 1 , 1 ( f ) ,

P ( D t 3 ( f ) ) = 1 g f : A f , g = 0 , B f , g = 1 P ( C 1 + ε t ( f , g ) δ R ) g f : A f , g = 0 , B f , g = 0 P ( C 1 + ε t ( f , g ) > δ R ) = 1 F ( δ R C 1 ) c A , B 0 , 1 ( f ) ( 1 F ( δ R C 1 ) ) c A , B 0 , 0 (f)

and

P ( D t 4 ( f ) ) = g f : A f , g = 1 ( 1 h f : A f , h = 1 , B g , h = 1 P ( C 2 + η t ( f , g , h ) δ I ) h f : A f , h = 1 , B g , h = 0 P ( C 2 + η t ( f , g , h ) > δ I ) ) = ( 1 G ( δ I C 2 ) d A , B 1 , 1 ( f , g ) ( 1 G ( δ I C 2 ) ) d A , B 1 , 0 ( f , g ) ) c A , B 1 , 1 ( f ) + c A , B 1 , 0 ( f ) .

Then, P ( D t i ( f ) ) , i = 1 , 2 , 3 , 4 , and therefore p t , t + 1 A , B doesn’t depend on t. Consequently, { A t } is homogeneous in time.

1.3. Proof of Theorem 2

Let S = M M × M be the space states of { A t } and suppose that we are in a CT context.

We denote C = { A S : A i , i = 1, for all i , i = 1, , M } to the set of states with all ones on the diagonal.

Lemma 1. C is a closed, irreducible and aperiodic communication class.

Proof 1. Let A S such that A C . Then, for some i = 1 , , M , A i , i = 0 , that is, at time t, the friendship state A indicates that for some i = 1 , , M , the profile f i doesn’t exists on Facebook. As we have supposed that once a profile is created it can’t be deleted, then if A C , p t , t + 1 A , A = 0 , that is, an state outside of C is not accessible from a state inside of C. Then, C is closed.

Also, given A , B C , outside the diagonal, an entry equal to one can turn into in zero, or an entry with zero can turn into one in one step, that is, A and B are communicated and aperiodic, so C is irreducible and aperiodic.

As we are studying chain’s behaviour at steady state of the network, we suppose that all profiles had been created. Then, if A C at time t, in a finite amount of steps the state A will be attracted by C. Therefore, on the long term, every states of S will be attracted in a finite amount of states by the closed, irreducible and aperiodic class C. As C is a finite set, all of its states are positive recurrent. Then, restricting the chain { A t } to C, it is closed, irreducible, aperiodic and every state is positive recurrent, hence it is ergodic.

The ergodic property, ensures limit distribution existence. In this case, if A C ,

π A ( ) = l i m t P ( A t = A ) = l i m t P [ f F ( { α t ( f , f ) = 1 } g f g F { α t ( f , g ) = A f , g } ) ] .

Moreover, friendship functions between profiles f and g of F t , α t , were defined as dicotomic variables taking zero or one according to whether they were Facebook friends or not at time t. Then, under the ergodic distribution, as time tends to infinity, random variable α ( f , g ) , with f g , has Bernoulli distribution.

Under CT hypothesis, each profile of F engages friendship with any other profile with the same probability, let’s say p, so that random variables α ( f , g ) , with f g , f , g F are identically distributed.

Also, this variables keeps direct relation with “image index” between profile pairs Y t . Under CT this index is reduced to a constant plus a random variable. This variables are independent for all f , g F , with f g , so image index between distinct profile pairs are also independent. Consequently, α ( f , g ) , with f g , f , g F are independent.