Privacy Preserving Two-Party Hierarchical Clustering Over Vertically Partitioned Dataset

doi:10.4236/jsea.2013.65B006

Paper Menu >>

Journal Menu >>

A Journal of Software Engineering and Applications, 2013, 6, 26-31

doi:10.4236/jsea.2013.65B006 Published Online May 2013 (http://www.scirp.org/journal/jsea)

Privacy Preserving Two-Party Hierarchical Clustering

Over Vertically Partitioned Dataset

Animesh Tripathy1, Ipsa De2

1School of Computer Engineering, KIIT University, Bhubaneswar, India; 2School of Computer Engineering, KIIT University, Bhu-

baneswar, India.

Email: animesh_tripathy@kiit.ac.in; ipsade@gmail.com

Received 2013

ABSTRACT

Data mining has been a popular research area for more than a decade. There are several problems associated with data

mining. Among them clustering is one of the most interesting problems. However, this problem becomes more chal-

lenging when dataset is distributed between different parties and they do not want to share their data. So, in this paper

we propose a privacy preserving two party hierarchical clustering algorithm vertically partitioned data set. Each site

only learns the final cluster centers, but nothing about the individual’s data.

Keywords: Data Mining; Privacy; Hierarchical; Clustering

1. Introduction

Data Mining has been a popular research area for more

than a decade because of its ability of efficiently extract-

ing statistics and trends from large sets of data. Basically

data mining refers to extraction or mining knowledge

from large amounts of data.[1] It is the process of proc-

essing large volumes of data (usually stored in a database)

searching for patterns and relationships within that data.

The overall goal of the data mining process is to extract

information from a data set and transform it into an un-

derstandable structure for further use.

Clustering of data is a well known approach towards

this. However, when dealing with such sensitive infor-

mation, the privacy issues become major concerns, as

any leakage or compromise of data may result in poten-

tial harm to individuals or financial losses to the corpo-

rate. It is widely used in the applications of financial af-

fairs, marketing, insurance, medicine, chemistry, ma-

chine learning, data mining, etc. [3,4,6-8]

So, the main privacy preserving clustering problem is

that there are n data objects distributed over physically

apart partitioned databases and those objects should be

clustered in k number of clusters based on their similarity

without compromising about the privacy of the databases.

The existing clustering algorithms [1,4,6] are quite

straight forward. But when the data is distributed among

many parties, [9] who want to preserve their private data,

then solution becomes quite challenging. Depending on

the problems there can be various approaches to solve the

problem.

These kinds of problems have been studied in the field

of privacy-preserving data mining or PPDM for short

[1,8-10]. But most of the clustering problem solutions for

vertically partitioned data set are based on k-means algo-

rithm. Only in [13] a secure version of BRICH is given

over vertically partitioned data. In this paper a secure

hierarchical clustering approach over vertically parti-

tioned data is provided. This hierarchical clustering is

more efficient than k-means clustering algorithm in iden-

tifying cluster centers.

In the next section preliminaries are given. A secure

algorithm for computing clusters over vertically parti-

tioned dataset is given in section 3. Efficiency and pri-

vacy of the algorithm is discussed in section 4. Experi-

mental results are shown in section 5. We conclude in

section 6 with some discussions and future research work

direction.

2. Preliminaries

In this section, we will present some preliminaries. We

first introduce how the data set can be partitioned. We

then explain some basic privacy preserving techniques

which can be used in data mining algorithms to imple-

ment privacy.

2.1. Distributed Data Mining

As stated earlier here in this model data sources are as-

sumed to be distributed across multiple sites. A simple

approach to mine data over multiple data sites can be

Privacy Preserving Two-Party Hierarchical Clustering Over Vertically Partitioned Dataset 27

done by combining data into a single site securely and

combine the result. Now, the data can be partitioned in

two ways horizontally or vertically. A brief overview of

partitioned data set is given below:

Horizontal Partitioning: In horizontal partitioning

different sites collect the same set of information, but

about different entities. Figure 1. illustrates horizontal

partitioning and shows the credit card databases of two

different (local) credit unions. Taken together, one may

find that fraudulent customers often have similar transac-

tion histories, etc.[2]

Vertical Partitioning: Vertical partitioning of data

implies that though different sites gather information

about the same set of entities, they collect different fea-

ture sets. An illustrative example of vertical partitioning

is given in Figure 2. The figure describes two databases;

one contains medical records of people while another

contains cell phone information for the same set of peo-

ple. Mining the joint global database might reveal infor-

mation like Cell phones with Li/Ion batteries lead to

brain tumors in diabetics.[2]

2.2. Cluster Analysis

Clustering finds sub-families among a large collection of

data. More formally given a set of data points, each hav-

ing a set of attributes, and a similarity measure among

them, find clusters such that (1) Data points in one clus-

ter are more similar to one another and (2) Data points in

separate clusters are less similar to one another.[1]

Many clustering algorithms exist in literature. Among

them hierarchical clustering algorithm is one of the most

important clustering algorithm.

Figure 1. Horizontally partitioned data set.

Figure2. Vertically Partitioned Data Set.

Hierarchical clustering algorithms: It creates a hier-

archical decomposition of the given set of data objects. It

can be either agglomerative or divisive, based on how the

hierarchical decomposition is formed (shown in Figure

3.). The agglomerative approach (bottom-up approach)

starts with each object forming a separate group. It suc-

cessively merges the objects or groups that are close to

one another, until all of the groups are merged into one

or until a termination condition holds. The divisive ap-

proach (top-down approach) starts with all of the objects

in the same cluster. In each successive iteration, a cluster

is split up into smaller clusters, until eventually each ob-

ject is in one cluster, or until a termination condition

holds. Hierarchical methods suffer from the fact that

once a step (merge or split) is done, it can never be un-

done. In this paper the work is mainly based on agglom-

erative approach. [1]

2.3. Some Basic Privacy Preserving Techniques

In this section we introduce some existing cryptographic

protocols that we use in this paper. The main motto for

privacy preserving data mining is to reduce the risk of

data misuse. It is extensively studied in recent years. A

number of effective methods for data mining have been

proposed considering the privacy factor. In most of the

cases slight modification to the main mining algorithm

has been done to achieve security without hampering the

flavor of mining. Some basic concepts that are used to

achieve privacy are given below:

Random Share: It is a popular method in current pri-

vacy preserving data mining studies. Each computed

intermediate values are designated as uniformly distrib-

uted random values, with each party holding one of these

values. Their sum is the actual intermediate value. It ba-

sically adds noise to the individual data values so that

they can’t be recovered and aggregation of those ran-

domized values should provide the final output and the

final result should be available to all the parties. [5,7]

Figure3. Hierarchical clustering on data objects (a, b, c, d,

e).

Privacy Preserving Two-Party Hierarchical Clustering Over Vertically Partitioned Dataset

Homomorphic Encryption: Encryption method mainly

resolves the problems that people jointly conduct mining

tasks based on the private inputs they provide. The en-

cryption method can ensure that the transformed data is

exact and secure.

Homomorphic encryption schemes allow certain

computations on encrypted values. In particular, an en-

cryption scheme is additively homomorphic if there is

some operation ⊗ on encryptions such that for all clear-

text values a and b, E(a) ⊗ E(b) = E(a + b).

Let (G, E, D, M) be a homomorphic encryption

scheme, where G is an algorithm for generating keys, E

and D are the encryption and decryption functions, and

M is the message space. It has the following properties:

• (G, E, D) is semantically secure

• E(a1) × E(a2) = E(a1 + a2), for a1, a2 ∈ M.

• E(a1)α = E(α · a1), for a1 ∈ M and α ∈N.[4,7,12]

Secure Scalar Product Protocol: In this scheme the

scalar product or dot product of two vectors is computed

securely. It means that the actual values of the vectors are

not revealed only the result is revealed.

Let a vector X = (x1,...,xn), held by Alice, and Y =

(y1,..., yn), held by Bob. They need to securely compute

the scalar product X ·Y =∑i=1-n (xi*yi). So, at the end of

the protocol Alice knows only X.Y not Y and so for Bob.

[3,7,11]

Secure Add and Compare: It is based on the homo-

morphic encryption. It builds a circuit that has two inputs

from each party, sums the first input of both parties and

the second input of both parties, and returns the result of

comparing the two sums. It is a semantically secure ap-

proach.

Let two parties, P1 and P2, such that P1 has numbers a1

and b1, and P2 has numbers a2 and b2, then this scheme

securely find if a1 + a2 < b1 + b2 without revealing the

following two pieces of information to the other party: (1)

the numbers possessed by each party; and (2) the differ-

ence between a1 + a2 and b1 + b2. [7]

3. Privacy Preserving Clustering Algorithm

3.1. Privacy Preserving Clustering Algorithm

Assume that a data set D consists of n instances with m

attributes and it is vertically partitioned between Alice

and Bob. So, Alice contains her own database with first

m1 attributes and Bob contains his own database with last

m2 attributes, m1+m2=m.

Alice and Bob want to find the final k-clusters over the

total partitioned data but trying to protect their own pri-

vacy. So, both parties learn final k clusters and nothing

else.

In this paper a novel approach of hierarchical cluster-

ing is given for vertically partitioned data set which con-

siders the privacy factor also. The assumptions made

during the process are:

1. Both parties contain the same number of data re-

cords.

2. Alice contains some attributes for all records, Bob

contains the other attributes.

3.2. Clustering on Vertically Partitioned Data Set

This algorithm runs in a divide, conquer and merge

strategy. So in the first step each party will compute k

number of clusters on their own data set. In the next

stage both parties will compute the distance between

each record and each of the k cluster centers. So, n × k

matrix will be computed by each party. Up to this stage

there will be no privacy issue as all the computations are

done by the parties on their private databases. Then in the

third stage both parties send their distance matrices to the

other party. Along with the matrix the k cluster centers

are also sent to the other party in the randomized form.

Next each party computes the all possible combinations

of cluster centers from the total 2k clusters. So, finally k2

cluster centers will be formed. Now each party will have

information about these k2 clusters and they will compute

the distance between each point and the k2 cluster centers.

So, minimum closest cluster will be chosen for n data

points and finally they will be merged into k-clusters.

This merging is done by choosing a best pair of clusters

Ci and Cj and then clusters are replaced by (Ci U Cj). A

best pair of clusters is one with least error. The error be-

tween the clusters can be computed by using the follow-

ing formula.

Let C1 and C2 be two clusters being considered for a

merge and C.weight denote the number of objects in

cluster C and dist2(C1, C2) is the distance between the

clusters C1 and C2 [6], then

Error (C1 U C2) = C1.weight × C2.weight × dist2(C1, C2)

(1)

3.3. Secure Hierarchical Clustering on Vertically

Partitioned Data Set

The approach discussed in this paper is given in brief in

the above section. The algorithm and privacy of this ap-

proach is analyzed in detail in this section. So, Alice and

Bob first locally compute k-clusters each from their own

part of the data. Next, each party computes the following

n × k matrix. So, Alice computes her corresponding dis-

tance matrix MAlice

MAlice=

a2,1 a2

1 a1

2 a1,

an,1 an,2 an,

Privacy Preserving Two-Party Hierarchical Clustering Over Vertically Partitioned Dataset 29

In the above matrix each row represents the distance

between an instance to all the k cluster center i.e ai,j

represents the distance between ith instance and jth cluster

center. Similarly Bob will compute the same matrix

based on his data set i.e. MBob

MBob=

b2,1 b2,2

1 b1

2 b1,

bn,1 bn,2 bn,

b2,

Now both parties have the two piece of information

the k-cluster center and the distance matrix. They ran-

domly share their information using the permute share

algorithm which is given below. Here Alice has a vector

of clusters C of the form (C1,...,Ck), and this permute

share algorithm helps Bob to obtain a permuted random

share of the vector C. Similarly Bob has a vector of clus-

ters C of the form (Ck+1,...,C2k), and Alice will receive a

permuted random share of that vector. At the beginning

of the protocol Alice and Bob both will choose their in-

dividual encryption key. Both the party will encrypt their

corresponding vector C and send it to the other party.

Next each party forms the all possible combination of

cluster centers from the received encrypted cluster center

information and his/her cluster center information.

So, finally both parties will have same k2 number of

cluster center. Then each party will compute the distance

between each of n instance and all the k2 cluster centers

by using the two distance matrices and then a cluster will

be chosen which has the minimum distance to an in-

stance.

Algorithm 1: Permute share Algorithm

Input: Alice has a vector of cluster centers C of the

form (C1,C2,...,Ck), Bob has vector of cluster centers C of

the form (Ck+1,Ck+2,...,C2k).

Output: Alice obtains encrypted value of Bob’s k

cluster center and Bob obtains encrypted value of Alice’s

k cluster center.

1. Alice and Bob choose their individual encryption

key.

2. Alice computes the encryption of the vector C as

(E(C1),..., E(Ck)) and sends to Bob

3. Bob computes the encryption of the vector C as

(E(Ck+1),..., E(C2k)) and sends to Alice.

4. Alice obtains her share (C1,C2,..., Ck, E(Ck+1),...,

E(C2k)) and Bob obtains his share (E(C1),..., E(Ck),

Ck+1,..., C2k)

So, after this stage all the points are assigned to its

closest cluster. But there are k2 numbers of clusters. Now

these k2 clusters will be merged using Equation (1) and

finally k clusters will be formed. At the end of merging

process both parties will have information about final k

cluster centers. But no party will know about other

party’s data set. This approach is efficient provided that

k2 is not too large.

Algorithm 2: Privacy Preserving Hierarchical

k-clustering

Input: Alice’s k cluster centers, Bob’s k cluster cen-

ters, Alice’s distance matrix and Bob’s distance matrix.

Output: Assignment of cluster numbers to objects.

1. Alice compute k-cluster center (C1,C2,...,Ck) from

first m1 attributes and Bob compute k-cluster center

(Ck+1,Ck+2,...,C2k) from rest m2 attributes.

2. Alice and Bob compute the distance matrices MAlice

and MBob

3. Alice and Bob randomly share cluster centers using

Permute share algorithm and distance matrices with each

other.

4. Alice and Bob form all possible cluster centers from

the existing cluster’s information i.e k2 cluster will be

formed.

5. Closest-cluster

6. Find minimum value on each row of X matrix to

find closest cluster for each instance i.e if ith column has

minimum value in jth row then jth instance will be closest

ith cluster.

7. Place n instances to appropriate closest clusters.

8. Merge k2 clusters to form final k clusters.

Algorithm 3: Closest-cluster

Input: Distance matrix of Alice MAlice (n × k) and

Distance matrix of Bob MBob (n × k)

Output: Closest cluster assignment for n instances; a

matrix X (n × k2) that holds the distance between each

pair of n points and k2 cluster centers

1. for p= 1 to n

2. l=0

3. for q= 1 to k

4. for r=1 to k

5. l=l+1

6. X

pl=apq+bpr

7. end for

8. end for

9. end for

10. Return X

4. Efficiency and Privacy Analysis

Alice and Bob compute k clusters on their own data set

and then they share it to each other after encrypting the

values. The encryption is done through Permute Share

algorithm which takes O(k) time for each party. Next, the

computational complexity for computing the distance

matrix by each party is O (nk). Step 4 of privacy pre-

serving hierarchical k-clustering takes O(k2) time. Again

the computational complexity for closest-cluster proce-

dure is O(nk2). Step 6 runs n times for each instance and

for each instance it takes O(k2) time. So, total complexity

for step 6 is of O(nk2). Total computational complexity

Privacy Preserving Two-Party Hierarchical Clustering Over Vertically Partitioned Dataset

for this algorithm is of O(nk2).

For communication complexity, Alice and Bob send k

encrypted values to each other. If we assume that it takes

c bits to represent each encryption then the total commu-

nication complexity is O(kc). Again to send the distance

matrix O(nk)time is taken. Thus it totally takes O(nk)

communication time.

Both parties compute k clusters independently from

the data objects they have. Then this information is shared

between the two but in the encrypted form. So, it does

not reveal the intermediate cluster center. Next the dis-

tance matrices are shared. But, it only holds the distance

between cluster center and the instances, no private in-

formation. So, it does not leak any private information.

After merging, final k cluster centers are revealed to each

party according to their share. Hence the privacy-pre-

serving hierarchical k clustering algorithm is secure and

does not leak any information.

5. Experimental Results

All of our experiments have been conducted on a com-

puter with 2.60 GHz Intel CORE i5 CPU and 4 GB main

memory. The operating system of the computer is Mi-

crosoft Windows 7. The algorithms are all implemented

in NetBeans IDE 7.1.

We performed our experiments on the UCI data-sets:

Ionosphere, iris, weather [14]. We assume that the

data-sets are vertically partitioned between Alice and

Bob. The description of the data-sets is presented in Ta-

ble 1. n denotes the number of instance, m is the number

of attributes, Alice holds the first m1 attributes, Bob holds

the last m2 attributes. k is the number of clusters. We

report the running time in Table 2 and it can be also

visualized through Figure 4. It shows how the secure

hierarchical clustering protocol scales with the size of the

data sets.

Additionally we have implemented three other existing

works with our proposed work. A comparison among all

these four different clustering algorithms is shown in

Table 3 and all these experiments are done on the Iris

database. First k-means is implemented for horizontally

partitioned data set (HPD) and vertically partitioned data

set (VPD) and then hierarchical clustering algorithm is

implemented for HPD and VPD. From the comparison

we can state that hierarchical clustering is better than

k-means with respect to running time and number of da-

tabase scanning. So, hierarchical clustering is a better

approach. Previously there was no approach where pri-

vacy preserving hierarchical clustering is considered for

VPD. This is the first paper which gives the above stated

idea and it can be seen from Table 3. that this idea re-

duces the running time for VPD from the existing ap-

proaches.

Table 1. Description of data set.

Data Set n m m1 m

2 k

IONO 35134 17 17 2

Iris 1505 2 3 3

Weather 14 5 2 3 2

Table 2. Running time on IONO, iris, weather.

Data Set Runtime (Sec)

IONO 0.541

Iris 0.451

Weather 0.326

Figure4. Execution time of secure algorithm on IONO of

various sizes.

Table 3. Comparison among different Clusterers.

Type of

Clusters

HPD with

K-Means

Clustering

HPD with

Hierarchical

Clustering

VPD with

K-Means

Clustering

VPD with

Hierarchical

Clustering

No. of

database

scanning

More than

one One More than

one One

Running

Time (Sec)2.6305 0.304 2.8795 0.453

6. Conclusions and Future Research Work

Here the privacy preserving clustering problem is

analyzed for vertically partitioned data set. There are

various approaches to solve this problem like adding

noise or encrypting data values. In this paper a novel

secure hierarchical clustering approach is given to cluster

the data objects which are vertically partitioned among

two parties.

The future research work can be to find a solution for

hierarchical clustering for data set which is vertically

partitioned among multiple party. Also, most of the solu-

tions developed are valid within the semi-honest model

of computation. So some exploration may be done to go

through a complete malicious model without giving up

the efficiency.

Privacy Preserving Two-Party Hierarchical Clustering Over Vertically Partitioned Dataset

REFERENCES

[1] J. W. Han and M. Kamber, “Data Mining: Concepts and

Techniques,” 2 nd Edition, China Machine Press, Beijing,

2006.

[2] J. S. Vaidya, “Privacy Preserving Data Mining over Ver-

tically Partitioned Data,” Ph.D Thesis, Purdue University,

2004, pp. 1-149.

[3] J. Vaidya and C. Clifton, “Privacy Preserving K-Means

Clustering over Vertically Partitioned Data,” Proceedings

of the ninth ACM SIGKDD international conference on

Knowledge discovery and data mining, Washington DC,

USA, 2003, pp. 206-215. doi:10.1145/956750.956776

[4] T. K. Yu, D. T. Lee, Shih-Ming Chang and Justin Zhan,

“Multi-Party k-Means Clustering with Privacy Consid-

eration,” International Symposium on Parallel and Dis-

tributed Processing with Applications, IEEE Computer

Society, 2010, pp. 200- 207.

[5] G. Jagannathan and R. N. Wright, “Privacy Preserving

Distributed k-Means Clustering over Arbitrarily Parti-

tioned Data,” Proceedings of the 11th ACM, SIGKDD In-

ternational Conference on Knowledge Discovery and

Data Mining, USA, 2005, pp. 1-7.

[6] G. Jagannathan, K. Pillaipakkamnatt and R. Wright, “A

New Privacy Preserving Distributed k-Clustering Algo-

rithm,” In Proceeding of the 6th SIAM International Con-

ference on Data Mining, 2006, pp. 492-496.

[7] G. Jagannathan, K. Pillaipakkamnatt, R. N. Wright and D.

Umano, “Communication-Efficient Privacy Preserving

Clustering,” Transactions on Data Privacy 3,Vol. 3, No.

1, 2010, pp.1-25.

[8] V. Estivill-Castro, “Private Representative-Based Clus-

tering for Vertically Partitioned Data,” Proceedings of the

Fifth Mexican International Conference in Computer

Science (ENC’04), IEEE Computer Society, 2004, pp.1-8.

[9] Y. Lindell and B. Pinkas, “Privacy Preserving Data Min-

ing”, In Advances in Cryptology (Crypto 2000), 2000, pp.

36-54.

[10] R. Agrawal and R. Srikant, “Privacy-preserving data min-

ing,” In Proceedings of the 2000 ACM SIGMOD Confer-

ence on Management of Data, 2000, pp. 439-450.

doi:10.1145/342009.335438

[11] P. Bunn and R. Ostrovsky, “Secure Two-Party k-Means

Clustering,” In Proceedings of the 14th ACM Conference

on Computer and Communications Security, 2007, pp.

486-497. doi:10.1145/1315245.1315306

[12] S. Jha, L. Kruger and P. McDaniel, “Privacy Preserving

Clustering,” 10th European Symp on Research in Com-

puter Security, 2005, pp. 397-417.

[13] P. K. Prasad and C. P. Rangan, “Privacy Preserving

BIRCH Algorithm for Clustering over Vertically Parti-

tioned Databases,” SDM 2006, LNCS 4165, Springer,

Berlin, Hiedelberg, 2006, pp. 84-99.

[14] A. Asuncion and D. J. Newman, “UCI Machine Learning

Repository,” 2007.

[http://www.ics.uci.edurmlearnIMLRepository.html].