Anonymization examples

Contents

4.16. Anonymization examples#

4.16.1. Table of Contents#

  1. Basic Example

  2. k-Anomymity via MDAV

  3. k-Anomymity via Microaggregation

  4. k-t-closeness via Microaggregation

  5. Differential privacy via MDAV

  6. Comparative study

4.16.2. 1. Basic example#

4.16.2.1. Anonymizing a Toy data set that includes all available attribute types applying k-Anomymity via MDAV#

4.16.2.2. 1.1 First it is necessary to install the anonymization library in the active environment#

# !pip install nltk
# !pip install ipywidgets
# !pip install -i https://test.pypi.org/simple/ anonymization-crisesURV==0.0.13

4.16.2.3. 1.2 Import classes and methods#

from anonymization.entities.dataset_CSV import Dataset_CSV
from anonymization.entities.dataset_DataFrame import Dataset_DataFrame
from anonymization.algorithms.anonymization_scheme import Anonymization_scheme
from anonymization.algorithms.k_anonymity import K_anonymity
from anonymization.algorithms.mdav import Mdav
from anonymization.utils import utils
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[4], line 1
----> 1 from anonymization.entities.dataset_CSV import Dataset_CSV
      2 from anonymization.entities.dataset_DataFrame import Dataset_DataFrame
      3 from anonymization.algorithms.anonymization_scheme import Anonymization_scheme

ModuleNotFoundError: No module named 'anonymization'

4.16.2.3.1. 1.3 Following, it is indicated the path to the csv file containing the data set and the path to the xml file describing the attributes in the data set. Inside the xml file, there is a detailed descritpion about how to fill this xml file in order to properly configure the different attribute types in the data set#

path_csv = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv"
path_settings = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/metadata_toy_all_types.xml"
# path_csv = "anonymization/input_datasets/toy_all_types.csv"
# path_settings = "anonymization/input_datasets/metadata_toy_all_types.xml"
data_frame = utils.read_dataframe_from_csv(path_csv)

4.16.2.3.2. 1.4 The data set is loaded from a DataFrame passed as parameter#

dataset = Dataset_DataFrame(data_frame, path_settings)
dataset.description()
Loading dataset
Dataset loaded: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records loaded: 10
Dataset: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Dataset head:
hours-per-week age income date occupation native-country location datetime
0 40 39.5 23546 1/11/2016 clerk United_States 43.8430139:10.507994 2011-02-03 08:34:04
1 13 50.3 10230 6/12/2015 executive United_States 43.54427:10.32615 2011-02-03 09:34:04
2 40 38.0 0 19/7/2015 cleaner United_States 43.70853:10.4036 2011-02-03 10:34:04
3 40 53.1 152000 25/7/2015 cleaner United_States 43.77925:11.24626 2011-02-04 10:34:04
4 40 28.8 54120 10/8/2016 specialist Cuba 43.8430139:10.507994 2011-02-04 08:34:04
Dataset description:
Data set: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records: 10
Attributes:
Name Attribute_type Sensitivity_type
0 hours-per-week numerical_discrete quasi_identifier
1 age numerical_continuous quasi_identifier
2 income numerical_discrete confidential
3 date date quasi_identifier
4 occupation plain_categorical quasi_identifier
5 native-country plain_categorical quasi_identifier
6 location coordinate quasi_identifier
7 datetime datetime quasi_identifier

4.16.2.3.3. 1.5 The data set is anonymized, in this case, applying k_anonymity via MDAV with a privacy level of k=3#

k = 3
anonymization_scheme = K_anonymity(dataset, k)
algorithm = Mdav()
anonymization_scheme.calculate_anonymization(algorithm)
Anonymizing k-Anonymity, k = 3 via MDAV
Anonymization runtime: 0:00:00

4.16.2.3.4. 1.6 Information loss metrics are calculated by comparing original and anonymized data set#

Metrics calculated: Sum of Square Error (SSE) and, for each attribute, mean and variance

information_loss = Anonymization_scheme.calculate_information_loss(dataset, anonymization_scheme.anonymized_dataset)
information_loss.description()
Calculating information loss metrics

Information loss metrics:
SSE: 0.892
Name Original mean Anonymized mean Original variance Anonymized variance
0 hours-per-week 36.4 36.3 1.302000e+02 5.550000e+01
1 age 42.33 42.34 6.654810e+01 2.932440e+01
2 income 35771.4 35771.4 1.687522e+09 1.687522e+09
3 date 18/2/2016 18/2/2016 3.245814e+14 7.524886e+13
4 occupation executive specialist 6.000000e-01 6.000000e-01
5 native-country United_States United_States 2.000000e-01 0.000000e+00
6 location 43.70666917:10.495949199999998 43.70666917:10.495949200000002 8.156504e-02 1.272151e-02
7 datetime 2011-02-04 09:46:04 2011-02-04 09:46:04 4.491418e+09 1.714414e+09

4.16.2.3.5. 1.7 Disclosure risk is calculated via record linkage between anonymized and original data set#

The disclosure risk estimates the percentage of anonymized records that correct match with the original ones

disclosure_risk = Anonymization_scheme.calculate_record_linkage(dataset, anonymization_scheme.anonymized_dataset)
disclosure_risk.description()
Calculating record linkage (disclosure risk)
Disclosure risk: 3.000 (30.00%)

4.16.2.3.6. 1.8 The anonymized data set can be saved to a csv formated file#

anonymization_scheme.save_anonymized_dataset("toy_all_types_anom.csv")
'Dataset saved: toy_all_types_anom.csv'

4.16.2.3.7. 1.9 The anonymized data set can be converted to DataFrame#

df_anonymized = anonymization_scheme.anonymized_dataset_to_dataframe()
df_anonymized.head()
hours-per-week age income date occupation native-country location datetime
0 40 38.4 23546 28/4/2016 clerk United_States 43.75335796666667:10.438398000000001 2011-02-03 17:34:04
1 25 50.6 10230 18/9/2015 executive United_States 43.643851299999994:10.386764666666666 2011-02-04 10:34:04
2 40 38.4 0 28/4/2016 clerk United_States 43.75335796666667:10.438398000000001 2011-02-03 17:34:04
3 42 39.1 152000 21/4/2016 specialist United_States 43.718765975:10.621001 2011-02-04 21:19:04
4 42 39.1 54120 21/4/2016 specialist United_States 43.718765975:10.621001 2011-02-04 21:19:04

4.16.2.3.8. 1.10 The previously saved anonymized and original data sets can be loaded in order to calculate the privacy metrics a posteriori#

4.16.2.3.8.1. The original data set is loaded#
path_csv = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv"
path_settings = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/metadata_toy_all_types.xml"
df = utils.read_dataframe_from_csv(path_csv)
dataset_original = Dataset_DataFrame(df, path_settings)
dataset_original.description()
Loading dataset
Dataset loaded: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records loaded: 10
Dataset: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Dataset head:
hours-per-week age income date occupation native-country location datetime
0 40 39.5 23546 1/11/2016 clerk United_States 43.8430139:10.507994 2011-02-03 08:34:04
1 13 50.3 10230 6/12/2015 executive United_States 43.54427:10.32615 2011-02-03 09:34:04
2 40 38.0 0 19/7/2015 cleaner United_States 43.70853:10.4036 2011-02-03 10:34:04
3 40 53.1 152000 25/7/2015 cleaner United_States 43.77925:11.24626 2011-02-04 10:34:04
4 40 28.8 54120 10/8/2016 specialist Cuba 43.8430139:10.507994 2011-02-04 08:34:04
Dataset description:
Data set: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records: 10
Attributes:
Name Attribute_type Sensitivity_type
0 hours-per-week numerical_discrete quasi_identifier
1 age numerical_continuous quasi_identifier
2 income numerical_discrete confidential
3 date date quasi_identifier
4 occupation plain_categorical quasi_identifier
5 native-country plain_categorical quasi_identifier
6 location coordinate quasi_identifier
7 datetime datetime quasi_identifier

4.16.2.3.8.2. The anonymized data set is loaded (from the previously saved data)#
path_csv = "toy_all_types_anom.csv"
path_settings = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/metadata_toy_all_types.xml"
df = utils.read_dataframe_from_csv(path_csv)
dataset_anomymized = Dataset_DataFrame(df, path_settings)
dataset_anomymized.description()
Loading dataset
Dataset loaded: toy_all_types_anom.csv
Records loaded: 10
Dataset: toy_all_types_anom.csv
Dataset head:
hours-per-week age income date occupation native-country location datetime
0 40 38.4 23546 28/4/2016 clerk United_States 43.75335796666667:10.438398000000001 2011-02-03 17:34:04
1 25 50.6 10230 18/9/2015 executive United_States 43.643851299999994:10.386764666666666 2011-02-04 10:34:04
2 40 38.4 0 28/4/2016 clerk United_States 43.75335796666667:10.438398000000001 2011-02-03 17:34:04
3 42 39.1 152000 21/4/2016 specialist United_States 43.718765975:10.621001 2011-02-04 21:19:04
4 42 39.1 54120 21/4/2016 specialist United_States 43.718765975:10.621001 2011-02-04 21:19:04
Dataset description:
Data set: toy_all_types_anom.csv
Records: 10
Attributes:
Name Attribute_type Sensitivity_type
0 hours-per-week numerical_discrete quasi_identifier
1 age numerical_continuous quasi_identifier
2 income numerical_discrete confidential
3 date date quasi_identifier
4 occupation plain_categorical quasi_identifier
5 native-country plain_categorical quasi_identifier
6 location coordinate quasi_identifier
7 datetime datetime quasi_identifier

4.16.2.3.8.3. Information loss and disclosure risk metrics are calculated#
information_loss = Anonymization_scheme.calculate_information_loss(dataset_original, dataset_anomymized)
information_loss.description()
disclosure_risk = Anonymization_scheme.calculate_record_linkage(dataset_original, dataset_anomymized)
disclosure_risk.description()
Calculating information loss metrics

Information loss metrics:
SSE: 0.892
Name Original mean Anonymized mean Original variance Anonymized variance
0 hours-per-week 36.4 36.3 1.302000e+02 5.550000e+01
1 age 42.33 42.34 6.654810e+01 2.932440e+01
2 income 35771.4 35771.4 1.687522e+09 1.687522e+09
3 date 18/2/2016 18/2/2016 3.245814e+14 7.524886e+13
4 occupation executive specialist 6.000000e-01 6.000000e-01
5 native-country United_States United_States 2.000000e-01 0.000000e+00
6 location 43.70666917:10.495949199999998 43.70666917:10.495949200000002 8.156504e-02 1.272151e-02
7 datetime 2011-02-04 09:46:04 2011-02-04 09:46:04 4.491418e+09 1.714414e+09
Calculating record linkage (disclosure risk)
Disclosure risk: 3.000 (30.00%)
df_anonymized = anonymization_scheme.anonymized_dataset_to_dataframe()
df_anonymized.head()
hours-per-week age income date occupation native-country location datetime
0 40 38.4 23546 28/4/2016 clerk United_States 43.75335796666667:10.438398000000001 2011-02-03 17:34:04
1 25 50.6 10230 18/9/2015 executive United_States 43.643851299999994:10.386764666666666 2011-02-04 10:34:04
2 40 38.4 0 28/4/2016 clerk United_States 43.75335796666667:10.438398000000001 2011-02-03 17:34:04
3 42 39.1 152000 21/4/2016 specialist United_States 43.718765975:10.621001 2011-02-04 21:19:04
4 42 39.1 54120 21/4/2016 specialist United_States 43.718765975:10.621001 2011-02-04 21:19:04

4.16.2.4. References:#

  1. Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5

  2. Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “t-Closeness through microaggregation: strict privacy with enhanced utility preservation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, no. 11, pp. 3098-3110, Oct 2015. DOI: https://doi.org/10.1109/TKDE.2015.2435777

  3. Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “Enhancing data utility in differential privacy via microaggregation-based k-anonymity”, The VLDB Journal, Vol. 23, no. 5, pp. 771-794, Sep 2014. DOI: https://doi.org/10.1007/s00778-014-0351-4

  4. Josep Domingo-Ferrer and Vicenç Torra, “Disclosure risk assessment in statistical data protection”, Journal of Computational and Applied Mathematics, Vol. 164, pp. 285-293, Mar 2004. DOI: https://doi.org/10.1016/S0377-0427(03)00643-5