Data anonymization plays a pivotal role in Artificial Intelligence (AI) & Machine Learning (ML) to ensure individual privacy even as it allows data-led insight. Large datasets in healthcare, finance, and many other industries carry highly sensitive information; consequently, privacy is a sensitive issue. Microsoft's open-source tool, Presidio, detects and eliminates personally identifiable information (PII) in structured and unstructured data. In this article, we examine the tradeoff between privacy and model performance, where Presidio's machine learning-friendly PII anonymization techniques for tokens, masks, and data perturbation allow good data to be protected while still providing utility. Moreover, the article looks at how Presidio helps an organization meet the requirements of privacy laws like GDPR, HIPAA, and CCPA and how it does it responsibly. Data anonymization has key challenges, such as loss of model accuracy and re-identification risks, which are discussed with insight into how Presidio helps mitigate them. Presidio takes this further by utilizing effective data anonymization to allow organizations to develop privacy-compliant, high-performing AI systems that respect privacy and run on data responsibly.
Data Anonymization, AI Privacy, Machine Learning, Presidio, Personally Identifiable Information (PII)
IRE Journals:
Surya Gangadhar Patchipala
"Data Anonymization in AI and ML Engineering: Balancing Privacy and Model Performance Using Presidio" Iconic Research And Engineering Journals Volume 6 Issue 10 2023 Page 992-1004
IEEE:
Surya Gangadhar Patchipala
"Data Anonymization in AI and ML Engineering: Balancing Privacy and Model Performance Using Presidio" Iconic Research And Engineering Journals, 6(10)