Privacy issues in machine learning and artificial intelligence – a task for data science consultant

Machine learning and artificial intelligence are increasingly impacting a lot of our decisions. This article is part of series on website categorization.

Many rely each day for numerous of their tasks on digital assistants, be it Cortana on Windows or Siri on mobile phones.

Then there is Alexa of Amazon and Google with its own offering. All of this programs are driven by enormous amounts of data.

The data has become the new “oil” of this economy.

Some of the data involved is not data of us humans. E.g. in industry one may be interested in when a machine may fail so that one can order a new one or prepare for repair of the current one. For this purpose on may install all kinds of sensors on machines and then use these data as input to machine learning models in order to try to predict future failures.

This is an area called predictive maintenance and it does not actually involve any personal data. Just that of machines.

On the other hand, for a lot of decisions that machine learning and AI model make about us, the data involved is of course personal.

Machine learning models need our data for two purposes:

– to learn, train the model

– to make predictions or inferences

In first case, the amount of data required can often be very large. There is an old rule about machine learning models: usually, the more data the better in terms of performance, as measures by metrics such accuracy, precision, recall, f1 score or ROC AUC. If we want to create a tool for crypto sentiment analysis of posts about blockchain assets, we would first to have to train a model on a large number of labelled social media posts.

In the second case, making predictions or inferences, the data required to make a decision about us is our personal data or about entities related to us.

Data privacy breaches in machine learning

So where can privacy violations occur in machine learning model usage?

One possibility is the data is somehow retained in the weights of the machine learning model. If the model is widely available after training, this may be a problem.

Second possibility is that personal data is not directly visible in weights of the model but can be extracted if we poll or use the machine learning model repeatedly.

Third possibility is that machine learning model serves as partial source of information and we can deduce personal information from this source along with using other external sources.

This is an interesting paper on the so-called Model inversion:

How can we prevent privacy breaches in machine learning?

If we want to prevent privacy breaches when training and applying machine learning model, we need appropriate approaches for that. Data science consultant Alpha Quantum usesĀ  several best practices approaches for that: differential privacy, federated learning, secure multi-party computation.

First approach is that we simply obfuscate the data that we use.

If we e.g. did a computer vision project, e.g. face detection, we could noise each image by random pixelization. This could involve changing randomly the color of each pixel in the image or adding random pixels to the images of faces.

Approach, where we randomly alter the data is known as differential privacy.

We can control how much we change the personal data with a parameter called epsilon.

One problem with differential privacy is that one can still reconstruct the original data by running the machine learning model repeatedly, whereby on each repetition randomness is being reduced and can obtain the original data after sufficient number of iterations.

That is why differential privacy has implemented so-called privacy budget, where one can run the machine learning model only a maximum number of times.

Not all fields have a problem with data privacy. If we e.g. consider recent results obtained in AI content generation tools, obtained with tools such as GPT-2 and BERT or Bidirectional Encoder Representations from Transformers, there is no large personal data privacy involved inherently in those AI content generation tools. Care should only be taken about the data input to those models, namely that any personal info is removed from them.

AI Content Writing Tools have greatly improved in the last couple of years. The traditional content writing services for machine learning e.g. may become more interesting as they are still a bit complicated, but traditional content writing for non-technical fields may become interesting for AI tools in the next few years.