SIDA - Sensitive Data Detection


Background

The organization Sida needs to publish information, documents and reports on an open database, in order to comply with the transparency guarantee. Before publishing the data needs to be reviewed in order to make sure that no sensitive personal data is published. What is sensitive can change over time due to the current national legislation and situation in countries where Sida sponsors aid projects.

 

Challenge

First - Decrease the amount of time a document reviewer at Sida spends reviewing and prioritizing documents in order to report and set an initial risk factor on each document.

Second – Learn how data can be transferred from the open data base to the review tool and how the review tool can access new documents that are up for review. 

 

Solution

The assignment started with Sida and Cybercom agreeing on a definition of sensitive personal data, below “data”. Using machine learning Cybercom wanted to detect data in published documents and identify where in the documents the data could be found.

To automatically detect relevant combinations data, the machine learning algorithms of “Name entity recognition” and “Entity linking” were chosen. A simple user interface was built where the result could be presented to and handled by the reviewer. 

The solution uses the machine learning model to detect data, tag pages in a document where the data can be found and sets a sensitivity score on the document. The solution can also scan a set of documents and create a list prioritized on the sensitivity score.

 

Effect

This prototype allowed Sida to understand the potential of machine learning and how things can be automated. The result can be used as a tool whenever Sida wants to speed up the reviewing process.

Irfan Khalid

Head of Innovation Cybercom Group

Call contact
Email contact