Many people new to machine learning have no idea that labeling data is a problem they need to think about.
To be clear: "labeled datasets" aren't a thing in the real world.
Here is an excellent approach to getting past this problem:
I've found that gaining access to lots of data is not always an issue.
But getting access to "labeled data" is usually not a thing.
Sometimes you can solve this by throwing people at the problem, but this is not always a solution.
Looking for cats on images? Labeling will be relatively cheap.
Looking for cancer in x-rays? Good look finding enough specialists to work with you.
Even worse: imagine that "labeling" means having to drill a hole to determine whether there's oil underground.
If you don't have reliable labels, you can't use Supervised Learning techniques.
How do you move past this?
Let's talk about Active Learning.
Active Learning is a semi-supervised learning method.
Bottom line: we can start building a model with a few labeled samples.
Not zero, but not the entire dataset.
We will start training with a few samples and interactively ask for new labeled data as we need it.
The critical idea here:
We will only need to label the most informative samples.
Instead of needing 10,000 labels in a 10,000-sample dataset, we will only need a few of them: the essential ones that will maximize the learning of our model.
Here is a rough, hypothetical sketch of how it works:
• Start with 1,000 labeled samples
• Train a model
• Use that model to predict the 9,000 samples
• Label 10% of the worst predictions
• Repeat
I mentioned above that "worst predictions" are the samples that will be the most informative.
But how do you measure that?
You can use multiple heuristics to determine which samples to label next.
Here are some existing methods that you can use:
• Least Confidence Uncertainty
• Smallest Margin Uncertainty
• Entropy Reduction
If you are interested, do some research on each of these.
To recap:
We can use Active Learning to train a model while minimizing the amount of labeled data.
This translates into significant savings. Sometimes, this is the difference that could make your solution viable.
This post is a good reflection of the content I post weekly.
Follow me @svpino for practical tips and stories as I deconstruct my experience as an engineer focused on applied machine learning.