Privacy preserving data mining (PPDM) has become more and more important because it allows sharing of privacy sensitive data for analytical purposes. A big number of privacy techniques were developed most of which used the k- anonymity property which have many shortcomings, so other privacy techniques were introduced ( l -diversity, p -sensitive k- anonymity, ( α , k )-anonymity, t -closeness, etc.). While they are different in their methods and quality of their results, they all focus first on masking the data, and then protecting the quality of the data. This paper is concerned with providing an enhanced privacy technique that combines some anonymity techniques to maintain both privacy and data utility by considering the sensitivity values of attributes in queries using sensitivity weights which determine taking in account utility-based anonymization and then only queries having sensitive attributes whose values exceed threshold are to be changed using generalization boundaries. The threshold value is calculated depending on the different weights assigned to individual attributes which take into account the utility of each attribute and those particular attributes whose total weights exceed the threshold values is changed using generalization boundaries and the other queries can be directly published. Experiment results using UT dallas anonymization toolbox on real data set adult database from the UC machine learning repository show that although the proposed technique preserves privacy, it also can maintain the utility of the publishing data .
Many organizations collect and hold very large volumes of data like hospitals, credit card companies, real estate companies and search engines. They would like to publish the data for the purposes of data mining. Data mining is a technique for automatically and intelligently extracting information or knowledge from very large amount of data [
1) A PPDM algorithm should have to prevent the discovery of sensitive information.
2) It should be resistant to the various data mining techniques.
3) It should not compromise the access and the use of non sensitive data.
4) It should not have an exponential computational complexity.
A number of effective techniques for PPDM have been proposed. Most techniques use some form of transformation on the original data in order to maintain the privacy preservation. The transformed dataset could be available for mining and must achieve privacy requirements without affecting the mining benefits.
In recent years, a lot of techniques have been proposed for implementing k-anonymity [
K-anonymity classifies the attributes of tables into four different classes [
Suppose that we have two tables,
Let IM be the initial microdata and MM be the released (masked) microdata [
Identifier attribute | Quasi-identifier | Sensitive attributes | |||
---|---|---|---|---|---|
Name | Race | Birth | Sex | Zip code | Disease |
Alice | Blank | 1965 | M | 02141 | Flu |
Bob | Blank | 1965 | M | 02142 | Cancer |
David | Blank | 1966 | M | 02135 | Obesity |
Helen | Blank | 1966 | M | 02137 | Gastritis |
Jane | White | 1968 | F | 02139 | HIV |
Paul | White | 1968 | F | 02138 | Cancer |
Quasi-identifier | Sensitive attributes | |||
---|---|---|---|---|
Race | Birth | Sex | Zip code | Disease |
Blank | 1965 | M | 0214* | Flu |
Blank | 1965 | M | 0214* | Cancer |
Blank | 1966 | M | 0213* | Obesity |
Blank | 1966 | M | 0213* | Gastritis |
White | 1968 | F | 0213* | HIV |
White | 1968 | F | 0213* | Cancer |
butes, Sensitive or confidential attributes as shown in
Definition 1 (QI-Cluster): consists of all the records with identical combination of quasi-identifier attribute values in M [
Definition 2 (K-Anonymity Property): The k-anonymity property for an MM is satisfied if every QI-cluster from MM contains k or more tuples (records) [
There are two ways to achieve generalization, namely global recoding and local recoding. Another name for global recoding is domain generalization. The generalization happens at the domain level. When an attribute value is generalized, every occurrence of the value is replaced by the new generalized value. A global recoding method may over-generalize a table. In global recoding, all occurrences of an attribute value are recoded to the same value. In contrast local-recoding method generalizes attribute values at cell level. A local recoding method does not over-generalize a table and hence may minimize the distortion of an anonymous view. If local recoding is adopted, occurrences of the same value of an attribute may be recoded to different values.
Second Suppression involves not releasing a value at all [
This technique [
Definition 3: (Maximum Allowed Generalization Value): Let Q be a quasi-identifier attribute (numerical or categorical), and HQ its predefined value generalization hierarchy [
・ For any released microdata, the value v is permitted to be generalized only up to MAGVal(v) and
・ When different MAGVals exist on the path between v and the hierarchy root, then the MAGVal(v) is the first MAGVal that is reached when following the path from v to the root node.
From
The second restriction in the MAGVal’s definition specifies that the hierarchy path between a leaf value v and MAGVal(v) can contain no node other than MAGVal(v) that is a maximum allowed generalization value. This restriction is forced in order to avoid any ambiguity about the MAGVals of the leaf values in a sensitive attribute hierarchy. Note that several MAGVals may exist on a path between a leaf and the root as a result of defining MAGVals for other leaves within that hierarchy.
Definition 4: (Maximum Allowed Generalization Set): The set of all MAGVals for attribute Q is called Q’s maximum allowed generalization set, and it is denoted by MAGSet(Q) = {MAGVal(v)| "v Î leaves(HQ)} (The notation leaves(HQ) represents all the leaves from the HQ value generalization hierarchy).
From
Definition 5: (Constraint Violation): the masked microdata MM has a constraint violation if one QI-value, v, in IM, is generalized in one record in MM beyond its specific maximal generalization value, MAGVal(v).
Definition 6: (Constrained K-Anonymity): The masked microdata MM satisfies the constrained k-anonymity property if it satisfies k-anonymity and it does not have any constraint violation.
Automatic detection of sensitive attribute in PPDM [
1) Client’s query is analyzed then taking in account the sensitive values in sensitive queries.
2) Queries having sensitive values are to be scrambled only depending on threshold value.
3) Different weights assigned to individual attributes are used to calculate threshold value.
4) The attributes whose total weights exceed the threshold values is scrambled depending on swapping techniques.
5) Queries whose total weights under the threshold values can be directly released.
6) The data owner is responsible for predetermine threshold limit.
Now we introduce the mechanism steps:
1) The number of weights assigned to attributes in a given database is equal to the total number of attributes.
2) The weights of attributes are allocated according to how much a particular attribute is involved in revealing the identity of an individual.
3) The attribute which can directly defines individual and leads to privacy violation is assigned highest weight.
4) Based on their level of priority similar weights can be assigned for some attributes.
How we apply the mechanism steps:
1) Highest value of weights are assigned for The attributes which directly identify an individual like name, personal identification number, social security number, etc.
2) The sensitive attribute or attributes are identified by summing up weights of the attributes submitted in the client’s query.
3) If the total weight exceeds the threshold limit then the attribute values are modified using modification technique in PPDM.
The Architecture for sensitive attribute identification in PPDM [
In this technique they focus on determining the sensitive attributes before releasing the data required to the client. To determine the sensitive attributes the weights are assigned to each attribute depending on some facts as [
1) If single attribute could disclosure identity of an individual.
2) If there is a possibility that group of attributes can indirectly disclosure the identity of an individual.
According to above facts weights are given to each attribute, only the sensitive data available in the database is modified for those attribute or group of attributes whose values exceeds the threshold limit of sensitiveness.
Anonymized data is often for the purposes of analysis and data mining [
1) To achieve k-anonymity suppose that we can generalize from a five-digit full zip-code to a four-digit prefix (e.g., from 53712 to 5371*).
2) We can also generalize alternatively, attribute age to age groups (e.g., from to [
3) In many cases we find that the age information is critical to disease analysis, while the information loss on the accurate location is often acceptable (a four digit prefix still identifies a relatively local region).
4) Therefore the age attribute has more utility than the zipcode attribute, and should be retained as accurate as possible in anonymization process.
Could we make the anonymization utility aware? Previous anonymization techniques has not consider attri- butes utility , so the researchers have intent to focus their attention on this important issue in the current paper to benefit from it in determining attributes weights.
This Toolbox is designed for anonymize datasets [
This toolbox only supports unstructured text files which could be represented by ASCII files where each line represents a record that consists of attributes values separated by some character sequence (e.g., comma, tab, and semi-column), specified in the configuration file or passed as an argument to the toolbox. All descriptive information will be collected through the configuration file regarding attribute types and indices. So a header with such description will not be necessary.
SSNo weight = 7 | Name weight = 6 | Age weight = 4 | Location weight = 5 | Salary weight = 3 | Gender weight = 2 |
---|---|---|---|---|---|
ASD1 | Ali | 26 | KSA | 26 K | M |
QDF4 | Soha | 33 | Bahrin | 33 k | F |
GTR3 | Alaa | 45 | Qatar | 66 k | F |
AEF4 | Moustafa | 33 | Egypt | 34 k | M |
The output format likes the input format. There is only one difference between them caused by anonymization for quasi-identifier attributes through replacement of specific values with their generalized values. This output format is referred to as genVals in the toolbox. The toolbox permits releasing anonymized records with additional information as described in some researches likes, the first approach, using anonymized data for classification, proposed in [
There are two global utility measures presented here which capture differences in the distributions of the original and masked data. The first measure is Cluster Analysis Measure [
Cluster analysis, a shape of unsupervised machine language, divides records into groups according to having similar values for selected variables. A cluster can be considered as a group of objects which are similar between them and are dissimilar to the objects belonging to other clusters. When proportion of observations from original or masked data for each cluster is constant we could say that the two data sets have the same distributions. The distribution of original and masked data are compared by assigning observations in pooled data to clusters then computing differences between the number of observations from original and masked data for each cluster. Let g be the number of clusters. Then, cluster utility is defined by [
where ni is the total number of observations grouped in i-th cluster,
These measures completely describe probability function and tests the differences between the empirical distribution functions obtained from the original and masked data which could be appropriate for measuring data utility. Let SX and SY be the empirical distributions obtained from the original data, X, and the masked data, Y, respectively. When X has dimension Nx × d, we have, [
where xij equals the value of the j-th variable for the i-th observation, and I(.) equals one when the condition inside the parentheses is true and equals zero otherwise The
In this paper the researchers proposes Utility-Based Anonymization using Generalization Boundaries to protect Sensitive Attributes Depending on Attributes Sensitivity Weights. In this technique researchers start with considering the sensitivity of values in queries and then only quires having sensitive values (taking in account Utility-based Anonymization) are generalized using Generalization Boundaries and the other quires that doesn’t have sensitive values can be directly published.
The objective of proposed model is trying to achieve Privacy and in the same time maintain Data utility and minimum information loss to enable efficient statistical analysis using data mining as follows [
・ Data utility―The goal is to eliminate the privacy violation (how much an adversary Learn from the released data) and increase utility (accuracy of data mining task) of published database. This is achieved by generalizing quasi-identifiers of only those Attributes having high sensitive attribute values taking in account Utility-based Anonymization as mentioned before.
・ Privacy―To provide the individual data privacy by generalization in such a way that data re-identification cannot be possible
・ Minimum information loss―The loss of information is minimized by giving Sensitivity level for sensitive attribute values, and attributes which has high sensitive level is only generalized and the rest of attributes are released as it is.
Proposed approach will be illustrated by an algorithm in Section 3.2.
The operation of this technique is as below:
First: we use automatic detection of sensitive attribute in PPDM [
Second: we use utility-based anonymization [
Third: we use constrained k-anonymity [
The Algorithm of proposed technique is explained in
1) Client submit query to the data miner.
2) The data miner identifies various data owner who are ready to share their databases.
3) The data miner sends the client’s query to data owner.
4) The data owner accepts the client’s query and calculates the total weight assigned to these attributes, i.e. as informed in
・ Identifier attributes such as Name and SSN that can be used to identify a record we give it weights equal to threshold to be suppressed first.
・ Quasi-identifier attributes that has more utility needs like Age takes big weights near from threshold to be latest generalized one if we need to generalize them and others takes different weights according to their utility-based needs.
・ Sensitive or confidential attributes doesn’t take any weights to keep them without generalization for statistical analysis.
5) If the total weight is under the predefined threshold limit of sensitiveness then the data owner released the data requested in the client’s query.
6) If the total weight is above the predefined threshold limit of sensitiveness then the related attributes are considered as sensitive attributes and the data values under such Attributes are modified using generalization
Attribute | Distinct values | Generalizations | Height | Weights |
---|---|---|---|---|
Name | --------- | ------------ | ----- | 6 |
Age | 100 | 10-, 20-, 30-, …. | 4 | 5 |
Marital status | 8 | taxonomy tree | 2 | 3 |
Race | 5 | Taxonomy tree | 2 | 4 |
Sex | 2 | Person | 1 | 2 |
Salary | 2 | Sensitive attribute | 1 | Without |
boundaries and suppression as follows:
・ Starting with suppress for the identifier attributes which has weight equal to threshold.
・ Then starting generalization for quasi-identifier attributes using generalization boundaries as in Figures 5-8 with attributes that has lowest weight then making k-anonymity test and if anonymization doesn’t exist we start with next higher weight attributes till reaching suggested k-anonymity maintaining l-diversity.
7) The modified database portion is transferred to the data miner to perform various data mining operation as per client’s request.
Now we consider sample query as sample examples and how researchers treat them according to proposed model.
Example: Query 1: Select * From Sample database Table.
From query 1 we find that
・ According to the weights for the five attributes there is one attribute equal to threshold, Name = 6, so we suppress it first.
・ Other five attributes are less than threshold limit because all of them under “6”, so we use only generalization boundaries according to Figures 5-8.
・ We start generalization with Sex attribute which has lowest weight “2” using
・ Next attribute is Marital-status which has next weight “3” using
・ Next attribute is Race which has next weight “4” using
・ Next attribute is Age which has next weight “5” using
The researchers compare three models as follows:
・ First: Original Data Model as on Real Data Set Adult Database from the UC machine learning repository. The Adult Database contains 32561 records from US Census data. After preprocessing data and removing records containing missing values 30162 records are selected. This database contains 42 attributes from that only 5 attributes are used as
・ Second: Modified Data Model According to our Utility-Based Anonymization Using Generalization Boundaries to protect Sensitive Attributes Depending on Attributes Sensitivity Weights developed in this paper.
・ Third: K-anonymity Data Model According to Automatic Detection of Sensitive Attribute Without using Generalization Boundaries and without determining attributes weight; we refer to it as old model in next sections.
Name weight = 6 | Marital_status weight = 3 | Age weight = 5 | Race weight = 4 | Sex weight = 2 | Salary |
---|---|---|---|---|---|
Alice | Never_married | 32 | White | M | 50000+ |
Lelyan | Divorced | 30 | Black | F | −50000 |
Charley | Married-spouse_absent | 42 | Amer_Indian_Aleut_or_Eskimo | M | 50000+ |
Dave | Married-civilian_spouse_present | 40 | Asian_or_Pacific_Islander | M | −50000 |
John | Never_married | 20 | other | M | −50000 |
Casey | Widowed | 25 | Asian_or_Pacific_Islander Wichita | F | 50000+ |
Name weight = 6 | Marital_status weight = 3 | Age weight = 5 | Race weight = 4 | Sex weight = 2 | Salary |
---|---|---|---|---|---|
* | Not_married | 30 - 40 | Colored | Person | 50000+ |
* | Not_married | 30 - 40 | Colored | Person | −50000 |
* | Married | 40 - 50 | Others | Person | 50000+ |
* | Married | 40 - 50 | Others | Person | −50000 |
* | Not_married | 20 - 30 | Others | Person | −50000 |
* | Not_married | 20 - 30 | Others | Person | 50000+ |
A cluster can be regarded as a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.
The researchers concentrate here to make our experiments on numerical data such as Age then we can generalize the results for all numerical and categorical data.
From
These measures assess the differences between the empirical distribution functions obtained from the original and masked data.
The researchers operate three kinds of test analysis comparing original data (c1) and proposed technique data (c1mm) as follows:
ks-test (Kolmogorov-Smirnov test), t-test (Welch Two Sample t-test), and var-test (F test to compare two variances)
・ ks-test (Kolmogorov-Smirnov test):
The Kolmogorov Goodness-of-Fit Test (Kolmogorov-Smirnov one-sample test) [
a) A test for goodness of fit usually involves examining a random sample from some unknown distribution in order to test the null hypothesis that the unknown distribution function is in fact a known, specified function.
b) We usually use Kolmogorov-Smirnov test to check the normality assumption in Analysis of Variance.
c) A random sample X1, X2, ..., Xn is drawn from some population and is compared with F*(x) in some way to see if it is reasonable to say that F*(x) is the true distribution function of the random sample.
d) One logical way of comparing the random sample with F*(x) is by means of the empirical distribution function S(x).
・ t-test (Welch two sample t-test) [
a) t-test performs t-tests on the equality of means. In the first form, t-test tests that varname has a mean of #.
b) In the second form, t-test tests that varname has the same mean within the two groups defined by groupvar.
c) In the third form, t-test tests that varname 1 and varname 2 have the same mean, assuming unpaired data.
d) In the fourth form, t-test tests that varname 1 and varname 2 have the same mean, assuming paired data.
e) t-test is the immediate form of t-test. For the equivalent of a two-sample t-test with sampling weights (pweights), use the svy: mean command with the over ( ) option, and then use lincom.
f) The t-test with Welch (1936, 1938) correction was compared to parametric and permutational t-tests for two types of data distributions and for equal and unequal population variances. The result of a t-test is identical to that of an anova computed for two groups; the t-statistic is the square root of the F-statistic used in anova.
g) The Welch correction was designed to provide a valid t-test in the presence of unequal population variances. It consists of using a corrected number of degrees of freedom to assess the significance of the t-statistic computed as usual. n is the next smaller integer of the value obtained.
・ Var-test (f test to compare two variances):
a) Performs an F-test to compare variances of two samples from normal populations.
b) In R the function var-test allows for the comparison of two variances using an F-test. Although it is possible to compare values of s2 for two samples, there is no capability within R for comparing the variance of a sample, s2, to the variance of a population, σ2. There are two ways to interpret the results provided by R.
c) First, the p-value provides the smallest value of α for which the F-ratio is significantly different from the hypothesized value. If this value is larger than the desired α, then there is insufficient evidence to reject the null hypothesis; otherwise, the null hypothesis is rejected.
d) Second, R provides the desired confidence interval for the F-ratio; if the calculated value falls within the confidence interval, then the null hypothesis is retained.
ks-test can be used to conclude that p-value = 0.0138 < 0.05 (Significant level) therefore, the distributions are different from one another but not in mean and variances as shown in t-test & f-test so the different between the two distributions is negligible.
t-test can be used to conclude that the two distributions have the same mean because p-value = 0.7278 > 0.05
f-test can be used to conclude that the two distributions have the same variance because p-value = 0.9857 > 0.05.
ks-test can be used to conclude that the two distributions are, indeed, significantly different from one another (p-value < 0.05) but not in their means and variances. Therefore, the researchers conclude that using generalization boundaries could maintain data utility and doesn’t affect data.
The researchers operate here three kinds of test analysis comparing original (c1) and old models data (D) as follows:
t-test can be used to conclude that the two distributions have different mean.
ks-test can be used to conclude that the two distributions are more significantly difference from one another where p-value < 2.2e−16 = 0 and there is different between means and variances as stated in t-test & f-test with the same above p-value.
The researchers operate analysis test to show the difference between the three tests according to age intervals and record the output in
・ ks.test can be used to conclude that the test is more significant difference as well as the Age interval increases.
・ t.test and f.test can be used to conclude that the two distributions are in significantly difference in means and variances with high p-value in the age interval 10 years.
・ The above table can be used to conclude that as the Age interval decreases as the test results are more insignificant, i.e., better results. So as we decrease generalization according to our boundaries as we have more utility.
There are many threats in traditional k-anonymity privacy preserving algorithms which consider all of sensitive
Interval | ks.test (p-value) | t.test (p-value) | f.test (p-value) |
---|---|---|---|
10 years | 0.0138 | 0.7278 | 0.9857 |
20 years | 4.814e−07 | 0.5555 | 0.8743 |
30 years | 3.331e−16 | 0.193 | 0.6996 |
40 years | <2.2e−16 | 0.2965 | 0.7264 |
50 years | <2.2e−16 | 0.2081 | 0.6054 |
0 < +100 (dummy) | <2.2e−16 | 2.312e−08 | <2.2e−16 |
attribute values at the same level and apply generalization on all, this leads to some issues like, information loss, data utility, and privacy measure.
So there is a need to develop a method which provides the privacy with minimum information loss and maximum data utility. As shown in the related work, many techniques have been developed to automatically determine which part of database needs scrambling; others have been developed to scrambling database using generalization and suppression at all which leads to very height information loss and others making scrambling using generalization boundaries. In this paper, the researchers aimed to propose a new utility-based anonymity model based on sensitivity of attributes in queries which determine which part of database needs changes according to sensitivity weights for each attribute; then only queries having sensitive values are to be changed depending on threshold value. Therefore, according to these weights we can determine which attributes must be generalized first using generalization boundaries, so information loss is reduced and data utility is increased because only sensitive attributes in queries are not only generalized at all but also generalized using generalization boundaries. So both privacy of individual and data utility are preserved. Proposed model combines all previous techniques starting from automatic detection of sensitive attributes by calculating their sensitivity weights using generalization boundaries taking in account utility-based anonymization. Utility based anonymization reduces information loss by starting generalization from low weights in ascending weights manner. After finishing each attribute generalization we can make test for anonymity which decide when we stop generalization for next weight. So our technique maintains the previous three issues, reducing information loss, increasing data utility and maintains privacy, together in one model which proves with practical experiments. The researchers conclude that all privacy techniques should not only give attention to privacy but also take into account information loss and data utility to avoid the trade-offs between them. There are many issues and ideas that can be tackled in future, most of which are related to enhance privacy techniques by combining some anonymity techniques taking into account their advantages and avoiding their disadvantages to maintain both privacy and data utility for useful extracting knowledge.