4.16. Anonymization examples#
4.16.1. Table of Contents#
4.16.2. 1. Basic example#
4.16.2.1. Anonymizing a Toy data set that includes all available attribute types applying k-Anomymity via MDAV#
4.16.2.2. 1.1 First it is necessary to install the anonymization library in the active environment#
# !pip install nltk
# !pip install ipywidgets
# !pip install -i https://test.pypi.org/simple/ anonymization-crisesURV==0.0.13
4.16.2.3. 1.2 Import classes and methods#
from anonymization.entities.dataset_CSV import Dataset_CSV
from anonymization.entities.dataset_DataFrame import Dataset_DataFrame
from anonymization.algorithms.anonymization_scheme import Anonymization_scheme
from anonymization.algorithms.k_anonymity import K_anonymity
from anonymization.algorithms.mdav import Mdav
from anonymization.utils import utils
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 from anonymization.entities.dataset_CSV import Dataset_CSV
2 from anonymization.entities.dataset_DataFrame import Dataset_DataFrame
3 from anonymization.algorithms.anonymization_scheme import Anonymization_scheme
ModuleNotFoundError: No module named 'anonymization'
4.16.2.3.1. 1.3 Following, it is indicated the path to the csv file containing the data set and the path to the xml file describing the attributes in the data set. Inside the xml file, there is a detailed descritpion about how to fill this xml file in order to properly configure the different attribute types in the data set#
path_csv = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv"
path_settings = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/metadata_toy_all_types.xml"
# path_csv = "anonymization/input_datasets/toy_all_types.csv"
# path_settings = "anonymization/input_datasets/metadata_toy_all_types.xml"
data_frame = utils.read_dataframe_from_csv(path_csv)
4.16.2.3.2. 1.4 The data set is loaded from a DataFrame passed as parameter#
dataset = Dataset_DataFrame(data_frame, path_settings)
dataset.description()
Loading dataset
Dataset loaded: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records loaded: 10
Dataset: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Dataset head:
| hours-per-week | age | income | date | occupation | native-country | location | datetime | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 39.5 | 23546 | 1/11/2016 | clerk | United_States | 43.8430139:10.507994 | 2011-02-03 08:34:04 |
| 1 | 13 | 50.3 | 10230 | 6/12/2015 | executive | United_States | 43.54427:10.32615 | 2011-02-03 09:34:04 |
| 2 | 40 | 38.0 | 0 | 19/7/2015 | cleaner | United_States | 43.70853:10.4036 | 2011-02-03 10:34:04 |
| 3 | 40 | 53.1 | 152000 | 25/7/2015 | cleaner | United_States | 43.77925:11.24626 | 2011-02-04 10:34:04 |
| 4 | 40 | 28.8 | 54120 | 10/8/2016 | specialist | Cuba | 43.8430139:10.507994 | 2011-02-04 08:34:04 |
Dataset description:
Data set: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records: 10
Attributes:
| Name | Attribute_type | Sensitivity_type | |
|---|---|---|---|
| 0 | hours-per-week | numerical_discrete | quasi_identifier |
| 1 | age | numerical_continuous | quasi_identifier |
| 2 | income | numerical_discrete | confidential |
| 3 | date | date | quasi_identifier |
| 4 | occupation | plain_categorical | quasi_identifier |
| 5 | native-country | plain_categorical | quasi_identifier |
| 6 | location | coordinate | quasi_identifier |
| 7 | datetime | datetime | quasi_identifier |
4.16.2.3.3. 1.5 The data set is anonymized, in this case, applying k_anonymity via MDAV with a privacy level of k=3#
k = 3
anonymization_scheme = K_anonymity(dataset, k)
algorithm = Mdav()
anonymization_scheme.calculate_anonymization(algorithm)
Anonymizing k-Anonymity, k = 3 via MDAV
Anonymization runtime: 0:00:00
4.16.2.3.4. 1.6 Information loss metrics are calculated by comparing original and anonymized data set#
Metrics calculated: Sum of Square Error (SSE) and, for each attribute, mean and variance
information_loss = Anonymization_scheme.calculate_information_loss(dataset, anonymization_scheme.anonymized_dataset)
information_loss.description()
Calculating information loss metrics
Information loss metrics:
SSE: 0.892
| Name | Original mean | Anonymized mean | Original variance | Anonymized variance | |
|---|---|---|---|---|---|
| 0 | hours-per-week | 36.4 | 36.3 | 1.302000e+02 | 5.550000e+01 |
| 1 | age | 42.33 | 42.34 | 6.654810e+01 | 2.932440e+01 |
| 2 | income | 35771.4 | 35771.4 | 1.687522e+09 | 1.687522e+09 |
| 3 | date | 18/2/2016 | 18/2/2016 | 3.245814e+14 | 7.524886e+13 |
| 4 | occupation | executive | specialist | 6.000000e-01 | 6.000000e-01 |
| 5 | native-country | United_States | United_States | 2.000000e-01 | 0.000000e+00 |
| 6 | location | 43.70666917:10.495949199999998 | 43.70666917:10.495949200000002 | 8.156504e-02 | 1.272151e-02 |
| 7 | datetime | 2011-02-04 09:46:04 | 2011-02-04 09:46:04 | 4.491418e+09 | 1.714414e+09 |
4.16.2.3.5. 1.7 Disclosure risk is calculated via record linkage between anonymized and original data set#
The disclosure risk estimates the percentage of anonymized records that correct match with the original ones
disclosure_risk = Anonymization_scheme.calculate_record_linkage(dataset, anonymization_scheme.anonymized_dataset)
disclosure_risk.description()
Calculating record linkage (disclosure risk)
Disclosure risk: 3.000 (30.00%)
4.16.2.3.6. 1.8 The anonymized data set can be saved to a csv formated file#
anonymization_scheme.save_anonymized_dataset("toy_all_types_anom.csv")
'Dataset saved: toy_all_types_anom.csv'
4.16.2.3.7. 1.9 The anonymized data set can be converted to DataFrame#
df_anonymized = anonymization_scheme.anonymized_dataset_to_dataframe()
df_anonymized.head()
| hours-per-week | age | income | date | occupation | native-country | location | datetime | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 38.4 | 23546 | 28/4/2016 | clerk | United_States | 43.75335796666667:10.438398000000001 | 2011-02-03 17:34:04 |
| 1 | 25 | 50.6 | 10230 | 18/9/2015 | executive | United_States | 43.643851299999994:10.386764666666666 | 2011-02-04 10:34:04 |
| 2 | 40 | 38.4 | 0 | 28/4/2016 | clerk | United_States | 43.75335796666667:10.438398000000001 | 2011-02-03 17:34:04 |
| 3 | 42 | 39.1 | 152000 | 21/4/2016 | specialist | United_States | 43.718765975:10.621001 | 2011-02-04 21:19:04 |
| 4 | 42 | 39.1 | 54120 | 21/4/2016 | specialist | United_States | 43.718765975:10.621001 | 2011-02-04 21:19:04 |
4.16.2.3.8. 1.10 The previously saved anonymized and original data sets can be loaded in order to calculate the privacy metrics a posteriori#
4.16.2.3.8.1. The original data set is loaded#
path_csv = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv"
path_settings = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/metadata_toy_all_types.xml"
df = utils.read_dataframe_from_csv(path_csv)
dataset_original = Dataset_DataFrame(df, path_settings)
dataset_original.description()
Loading dataset
Dataset loaded: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records loaded: 10
Dataset: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Dataset head:
| hours-per-week | age | income | date | occupation | native-country | location | datetime | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 39.5 | 23546 | 1/11/2016 | clerk | United_States | 43.8430139:10.507994 | 2011-02-03 08:34:04 |
| 1 | 13 | 50.3 | 10230 | 6/12/2015 | executive | United_States | 43.54427:10.32615 | 2011-02-03 09:34:04 |
| 2 | 40 | 38.0 | 0 | 19/7/2015 | cleaner | United_States | 43.70853:10.4036 | 2011-02-03 10:34:04 |
| 3 | 40 | 53.1 | 152000 | 25/7/2015 | cleaner | United_States | 43.77925:11.24626 | 2011-02-04 10:34:04 |
| 4 | 40 | 28.8 | 54120 | 10/8/2016 | specialist | Cuba | 43.8430139:10.507994 | 2011-02-04 08:34:04 |
Dataset description:
Data set: https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/toy_all_types.csv
Records: 10
Attributes:
| Name | Attribute_type | Sensitivity_type | |
|---|---|---|---|
| 0 | hours-per-week | numerical_discrete | quasi_identifier |
| 1 | age | numerical_continuous | quasi_identifier |
| 2 | income | numerical_discrete | confidential |
| 3 | date | date | quasi_identifier |
| 4 | occupation | plain_categorical | quasi_identifier |
| 5 | native-country | plain_categorical | quasi_identifier |
| 6 | location | coordinate | quasi_identifier |
| 7 | datetime | datetime | quasi_identifier |
4.16.2.3.8.2. The anonymized data set is loaded (from the previously saved data)#
path_csv = "toy_all_types_anom.csv"
path_settings = "https://raw.github.com/CrisesUrv/SoBigDataTraining/master/anonymization/input_datasets/metadata_toy_all_types.xml"
df = utils.read_dataframe_from_csv(path_csv)
dataset_anomymized = Dataset_DataFrame(df, path_settings)
dataset_anomymized.description()
Loading dataset
Dataset loaded: toy_all_types_anom.csv
Records loaded: 10
Dataset: toy_all_types_anom.csv
Dataset head:
| hours-per-week | age | income | date | occupation | native-country | location | datetime | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 38.4 | 23546 | 28/4/2016 | clerk | United_States | 43.75335796666667:10.438398000000001 | 2011-02-03 17:34:04 |
| 1 | 25 | 50.6 | 10230 | 18/9/2015 | executive | United_States | 43.643851299999994:10.386764666666666 | 2011-02-04 10:34:04 |
| 2 | 40 | 38.4 | 0 | 28/4/2016 | clerk | United_States | 43.75335796666667:10.438398000000001 | 2011-02-03 17:34:04 |
| 3 | 42 | 39.1 | 152000 | 21/4/2016 | specialist | United_States | 43.718765975:10.621001 | 2011-02-04 21:19:04 |
| 4 | 42 | 39.1 | 54120 | 21/4/2016 | specialist | United_States | 43.718765975:10.621001 | 2011-02-04 21:19:04 |
Dataset description:
Data set: toy_all_types_anom.csv
Records: 10
Attributes:
| Name | Attribute_type | Sensitivity_type | |
|---|---|---|---|
| 0 | hours-per-week | numerical_discrete | quasi_identifier |
| 1 | age | numerical_continuous | quasi_identifier |
| 2 | income | numerical_discrete | confidential |
| 3 | date | date | quasi_identifier |
| 4 | occupation | plain_categorical | quasi_identifier |
| 5 | native-country | plain_categorical | quasi_identifier |
| 6 | location | coordinate | quasi_identifier |
| 7 | datetime | datetime | quasi_identifier |
4.16.2.3.8.3. Information loss and disclosure risk metrics are calculated#
information_loss = Anonymization_scheme.calculate_information_loss(dataset_original, dataset_anomymized)
information_loss.description()
disclosure_risk = Anonymization_scheme.calculate_record_linkage(dataset_original, dataset_anomymized)
disclosure_risk.description()
Calculating information loss metrics
Information loss metrics:
SSE: 0.892
| Name | Original mean | Anonymized mean | Original variance | Anonymized variance | |
|---|---|---|---|---|---|
| 0 | hours-per-week | 36.4 | 36.3 | 1.302000e+02 | 5.550000e+01 |
| 1 | age | 42.33 | 42.34 | 6.654810e+01 | 2.932440e+01 |
| 2 | income | 35771.4 | 35771.4 | 1.687522e+09 | 1.687522e+09 |
| 3 | date | 18/2/2016 | 18/2/2016 | 3.245814e+14 | 7.524886e+13 |
| 4 | occupation | executive | specialist | 6.000000e-01 | 6.000000e-01 |
| 5 | native-country | United_States | United_States | 2.000000e-01 | 0.000000e+00 |
| 6 | location | 43.70666917:10.495949199999998 | 43.70666917:10.495949200000002 | 8.156504e-02 | 1.272151e-02 |
| 7 | datetime | 2011-02-04 09:46:04 | 2011-02-04 09:46:04 | 4.491418e+09 | 1.714414e+09 |
Calculating record linkage (disclosure risk)
Disclosure risk: 3.000 (30.00%)
df_anonymized = anonymization_scheme.anonymized_dataset_to_dataframe()
df_anonymized.head()
| hours-per-week | age | income | date | occupation | native-country | location | datetime | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 38.4 | 23546 | 28/4/2016 | clerk | United_States | 43.75335796666667:10.438398000000001 | 2011-02-03 17:34:04 |
| 1 | 25 | 50.6 | 10230 | 18/9/2015 | executive | United_States | 43.643851299999994:10.386764666666666 | 2011-02-04 10:34:04 |
| 2 | 40 | 38.4 | 0 | 28/4/2016 | clerk | United_States | 43.75335796666667:10.438398000000001 | 2011-02-03 17:34:04 |
| 3 | 42 | 39.1 | 152000 | 21/4/2016 | specialist | United_States | 43.718765975:10.621001 | 2011-02-04 21:19:04 |
| 4 | 42 | 39.1 | 54120 | 21/4/2016 | specialist | United_States | 43.718765975:10.621001 | 2011-02-04 21:19:04 |
4.16.2.4. References:#
Josep Domingo-Ferrer and Vicenç Torra, “Ordinal, continuous and heterogeneous k-anonymity through microaggregation”, Data Mining and Knowledge Discovery, Vol. 11, pp. 195-212, Sep 2005. DOI: https://doi.org/10.1007/s10618-005-0007-5
Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “t-Closeness through microaggregation: strict privacy with enhanced utility preservation”, IEEE Transactions on Knowledge and Data Engineering, Vol. 27, no. 11, pp. 3098-3110, Oct 2015. DOI: https://doi.org/10.1109/TKDE.2015.2435777
Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez and Sergio Martínez, “Enhancing data utility in differential privacy via microaggregation-based k-anonymity”, The VLDB Journal, Vol. 23, no. 5, pp. 771-794, Sep 2014. DOI: https://doi.org/10.1007/s00778-014-0351-4
Josep Domingo-Ferrer and Vicenç Torra, “Disclosure risk assessment in statistical data protection”, Journal of Computational and Applied Mathematics, Vol. 164, pp. 285-293, Mar 2004. DOI: https://doi.org/10.1016/S0377-0427(03)00643-5