In this explainer, we delve into the machine learning process. We'll focus on the pre-training steps involved in building a good machine learning model: Data Preparation, Data Representation and Model Selection. We aim to give you an overview of each step so that you can understand what to expect when creating your own machine learning models. We'll use examples in later sections and courses to illustrate each step.
This explainer is really a talk, so below we now summarise the key takeaway points.
Data Preparation for Machine Learning
Data preparation is a huge part of any machine learning project, and at digiLab we estimate it can take up to 60% of a project's time if you're using new data. This makes it essential for us to develop efficient data workflows that allow us to quickly and accurately clean, organize, and use relevant data during our machine learning projects. This is the only way to ensure we can reliably and quickly make use of data in our models.
In this explainer, we summarise different types of data issues and talk about high-level strategies for resolving them. These include
-
Non-alignment of Data - This is really common in multi-channel time series data. The step involves syncing data to the same timestamps and naturally involves a process called interpolation.
-
Missing or Partial Data - A really common problem, I've seen anything from blanks, "-99999" to "gone home" in my datasets. How do we deal with this in a sensible way? In a way this can be a machine learning task itself, so here we are thinking about
-
Noisy Data or Outliers - When is data noisy, rather than an outlier? Any real-world data is noisy. Lots of techniques available the best known are filtering-based methods.
Imbalanced Data - This is a really common problem in classification tasks when looking at rare events, for example, pictures of skin cancer or buildings falling down. These causes are a really big problem, we talk through techniques for handling this by both over and under-sampling.
Starting with the end in mind
It's important to remember that building good data needs to be done right from the start of the project. Many issues related to data preparation can be resolved if you plan ahead. It also pays off to think about how your data will eventually be used in a machine learning model, and plan accordingly. Finally, it's useful to store more information than just the data itself, such as meta-data about how it was collected. Following the FAIR principles of Findability, Accessibility, Interoperability and Reuseability can help you create a data management plan that will serve you well in the future. The Turing Way is a great resource to get started with understanding these principles - here is the link Turing Way
Data Representation
When designing a machine learning model, it is essential to consider carefully which features of the data should be included and which should be left out. With large datasets, it can be tempting to allow the algorithm to do all of the work; however, this approach can be costly and inefficient. A concept known as the "curse of dimensionality" indicates that the data requirements expand exponentially with the size of the input space. Unsupervised learning techniques are often applied to reduce the dimension of the input, whilst preserving as much information as possible. These techniques form an integral part of developing a successful machine learning workflow.
Model Selection
Model selection requires trying a variety of approaches in order to select the best model for a particular dataset, task and desired outcome. It's important to understand the different options available in order to develop an effective machine learning workflow. twinLab is one platform that can assist with this process, by ranking models based on their score (which takes into account both lack of fit and model complexity). This helps to ensure that the most suitable model is selected for the job.