Data hunt for better machine learning 

Machine Learning

Share post

Artificial intelligence or machine learning has experienced an enormous upswing in the last ten years. Many industries are now investing heavily in solutions based on machine learning. The demand for qualified specialists has also increased by leaps and bounds.

Several universities worldwide offer degrees with a focus on data science or artificial intelligence, and this content is also becoming significantly more important at German universities. While universities focus primarily on the mathematical and theoretical concepts, the skills and knowledge required for training machine learning models for problems in the real world can look very different.

Availability of the necessary data

In most cases, the availability of data determines whether or not machine learning can be used to solve a particular problem. Before starting a new project, the question arises: will a model trained on this data provide the correct answers most of the time?

This question applies regardless of the model, library, or language chosen for the ML experiment. And there are other decisive criteria. A model is only as good as the data that is fed into it. It is therefore important to clarify:

  • Is there enough data to train a good model? Unless it goes over the hardware budget, it is almost always right to use more data.
  • Are the prognoses reliable for a monitored learning process? Is the model being fed the correct information?
  • Is this data an accurate representation of the real distribution? Are there enough variations in the samples to cover the problem area?
  • Is there constant access to a steady stream of new data that can be used to update the model and keep it current?

 

Compiling the data

The data required to create a data set for an ML solution are often distributed across multiple sources. Different parts of a sample are collected via different products and managed by different teams on different platforms. Therefore, often the next step in the process is to combine all of this data into a single format and store it so that it is easily accessible.

More challenges and a curse

With the data collected and aggregated, one would think that the fabulous new ML algorithm could now be started. However, further steps are still necessary, because there will inevitably still have to be overcome:

Missing data

Sometimes valid values ​​may not be available for all observations. Data could have been damaged during collection, storage or transmission, and it is important to find these missing data points and, if necessary, to delete them from the data set.

Duplicate dates

While this is not a particularly alarming problem in terms of model performance, duplicate data should be removed from the data store to make the model training process more efficient and possibly avoid overfitting.

Different normalization schemes

Slight differences in the way the data is processed and stored can be a major headache when training a model. For example, different products can cut the same free text field to different lengths or anonymize data differently, which leads to inconsistencies in the data. If one of these sources contains predominantly malware and another source contains benign patterns, the ML model can learn to identify them based on, for example, the bleed length.

Free text field data

This actually deserves a category on its own because it can be so difficult to deal with. Free text fields are the data engineer's bane as they have to deal with typing errors, slang, near duplicates, variations in upper and lower case, spaces, punctuation, and a host of other inconsistencies.

Constant updating

Finally, data drift is an important problem to be addressed when designing an ML system. Once a model is trained, it becomes more and more imprecise over time as the distribution of the new incoming data changes. Therefore, regular updates of the model should be established to ensure that performance remains within expected limits.

In security, for example, we see great volatility as threat actors change their exploits and behavior over time, and vulnerabilities are discovered and remedied. This was a brief summary of the typical steps that must be taken to select, collect, and cleanse data for an ML solution. If all of these have been carried out, a clean data set is probably available. Let the experiment begin.

[starboxid=15]

 

Matching articles on the topic

IT security: NIS-2 makes it a top priority

Only in a quarter of German companies do management take responsibility for IT security. Especially in smaller companies ➡ Read more

Cyber ​​attacks increase by 104 percent in 2023

A cybersecurity company has taken a look at last year's threat landscape. The results provide crucial insights into ➡ Read more

The AI ​​Act and its consequences for data protection

With the AI ​​Act, the first law for AI has been approved and gives manufacturers of AI applications between six months and ➡ Read more

Mobile spyware poses a threat to businesses

More and more people are using mobile devices both in everyday life and in companies. This also reduces the risk of “mobile ➡ Read more

Crowdsourced security pinpoints many vulnerabilities

Crowdsourced security has increased significantly in the last year. In the public sector, 151 percent more vulnerabilities were reported than in the previous year. ➡ Read more

AI on Enterprise Storage fights ransomware in real time

NetApp is one of the first to integrate artificial intelligence (AI) and machine learning (ML) directly into primary storage to combat ransomware ➡ Read more

Digital Security: Consumers trust banks the most

A digital trust survey showed that banks, healthcare and government are the most trusted by consumers. The media- ➡ Read more

Darknet job exchange: Hackers are looking for renegade insiders

The Darknet is not only an exchange for illegal goods, but also a place where hackers look for new accomplices ➡ Read more