Splitting your data

•

Every machine learning course talks about splitting your data. Surprisingly, many people don't understand how to use each set properly. Let's talk about some of the things you should know about splitting your data:

We usually split the data into three different sets: 1. Train set 2. Validation set 3. Test set The first thing you should do at this point: Forget that your test set exists.

Train set: The data you'll use to train your model. This set is your entire world. No other data exists outside of this train set. Going forward, you'll use this data for every analysis, transformation, and decision.

Validation set: As you experiment, you'll use this data to compute your model's performance and decide how to improve it. The validation set gives you feedback. You can use this feedback to improve your model.

Here is the iterative process we follow: • Train a model • Evaluate it with your validation set • Improve the model • Evaluate it with your validation set • Improve the model Something happens because of this:

Inevitably, your model will start overfitting to the validation set after some time. Your model will become good at predicting the validation data, which won't be helpful anymore. That sucks, but here is how you fix it:

After several iterations, throw your validation set into your train set and get new data. If you don't have more data, you must rely heavily on k-fold cross-validation.

Test set: Until the very end, you never look at your test data. You never use it to do any analysis or transformations. Never make decisions that affect your model using the test data. You treat your test data as if it doesn't exist.

The goal of your test set: To provide a final, unbiased estimation of your model's performance. A good test set will give you a performance similar to what you expect when processing production data.

Many people run their model on their test set and discover that their model is not good. They go back and make changes to the model until the performance improves. Nothing wrong, except when they use the same test set again!

Use your test data once. After that, merge it into your train set and find new test data.

The effectiveness of your test set decreases proportionally with the number of times you use it. Soon the test set won't longer be an accurate measure of how good your model is.

Let me TL;DR this quickly: 1. After a few iterations, rotate your validation data. 2. Don't use your test set more than once.

This post is a good reflection of the content I post weekly. Follow me @svpino for practical tips and stories as I deconstruct my experience as an engineer focused on applied machine learning.