Data hunt for better machine learning

4 January 2021

Artificial intelligence or machine learning has experienced an enormous upswing in the last ten years. Many industries are now investing heavily in solutions based on machine learning. The demand for qualified specialists has also increased by leaps and bounds.

Several universities worldwide offer degrees with a focus on data science or artificial intelligence, and this content is also becoming significantly more important at German universities. While universities focus primarily on the mathematical and theoretical concepts, the skills and knowledge required for training machine learning models for problems in the real world can look very different.

Availability of the necessary data

In most cases, the availability of data determines whether or not machine learning can be used to solve a particular problem. Before starting a new project, the question arises: will a model trained on this data provide the correct answers most of the time?

This question applies regardless of the model, library, or language chosen for the ML experiment. And there are other decisive criteria. A model is only as good as the data that is fed into it. It is therefore important to clarify:

Is there enough data to train a good model? Unless it goes over the hardware budget, it is almost always right to use more data.
Are the prognoses reliable for a monitored learning process? Is the model being fed the correct information?
Is this data an accurate representation of the real distribution? Are there enough variations in the samples to cover the problem area?
Is there constant access to a steady stream of new data that can be used to update the model and keep it current?

Compiling the data

The data required to create a data set for an ML solution are often distributed across multiple sources. Different parts of a sample are collected via different products and managed by different teams on different platforms. Therefore, often the next step in the process is to combine all of this data into a single format and store it so that it is easily accessible.

More challenges and a curse

With the data collected and aggregated, one would think that the fabulous new ML algorithm could now be started. However, further steps are still necessary, because there will inevitably still have to be overcome:

Missing data

Sometimes valid values may not be available for all observations. Data could have been damaged during collection, storage or transmission, and it is important to find these missing data points and, if necessary, to delete them from the data set.

Duplicate dates

While this is not a particularly alarming problem in terms of model performance, duplicate data should be removed from the data store to make the model training process more efficient and possibly avoid overfitting.

Different normalization schemes

Slight differences in the way the data is processed and stored can be a major headache when training a model. For example, different products can cut the same free text field to different lengths or anonymize data differently, which leads to inconsistencies in the data. If one of these sources contains predominantly malware and another source contains benign patterns, the ML model can learn to identify them based on, for example, the bleed length.

Free text field data

This actually deserves a category on its own because it can be so difficult to deal with. Free text fields are the data engineer's bane as they have to deal with typing errors, slang, near duplicates, variations in upper and lower case, spaces, punctuation, and a host of other inconsistencies.

Constant updating

Finally, data drift is an important problem to be addressed when designing an ML system. Once a model is trained, it becomes more and more imprecise over time as the distribution of the new incoming data changes. Therefore, regular updates of the model should be established to ensure that performance remains within expected limits.

In security, for example, we see great volatility as threat actors change their exploits and behavior over time, and vulnerabilities are discovered and remedied. This was a brief summary of the typical steps that must be taken to select, collect, and cleanse data for an ML solution. If all of these have been carried out, a clean data set is probably available. Let the experiment begin.

[starboxid=15]

Matching articles on the topic

Stories

, data collection, KI, Machine Learning, ML

Data hunt for better machine learning

Share post

Availability of the necessary data

Compiling the data

More challenges and a curse

Missing data

Duplicate dates

Different normalization schemes

Free text field data

Constant updating

Matching articles on the topic

Which user group do you belong to? (No tracking!)

Important links

Latest Articles & News

It was searched for

Social Media