Thirty days of machine learning with Kaggle

Last month, I received an email from Kaggle inviting me to participate in a beginner-friendly “30 Days of Machine Learning” challenge. It was a timely reminder to continue on my data science learning journey, especially since I haven’t made any progress for quite some time.

It was scheduled to start on Monday 2nd August, but there was just one small problem — it was right smack in the middle of Olympics season. Given that this Olympics would be held in Tokyo, conveniently within the Asian timezone, I had already planned to catch as many games as possible.

When I got the first email which laid out the curriculum for the first week, I was relieved as it basically covered the Kaggle Learn introductory Python course, which I had already completed late last year. And I was able to fully enjoy the Olympics guilt-free, especially the many exciting badminton, volleyball, sport climbing and archery competitions.

The daily emails from Kaggle arrived one after another, and since I was used to ignoring them, I continued ignoring them until early this week, when I saw the latest email with the subject: “Welcome to the final week of the 30 Days of ML program!”

Oh, crap! It looks like I have a lot of catching up to do, and not a lot of time.

Curriculum

I decided to systematically go through all the past emails and dutifully do all the tutorials and exercises. Thankfully they were quite concise and I was able to finish all of them in a few days. The links to each tutorial have been added for ease of reference.

Day	Description
Day 1	Level up to Contributor
Day 2	Hello, Python (Python Lesson 1)
Day 3	Functions and Getting Help (Python Lesson 2)
Day 4	Booleans and Conditionals (Python Lesson 3)
Day 5	Lists (Python Lesson 4) Loops and List Comprehensions (Python Lesson 5)
Day 6	Strings and Dictionaries (Python Lesson 6)
Day 7	Working with External Libraries (Python Lesson 7)
Day 8	How Models Work (Intro to ML Lesson 1) Basic Data Exploration (Intro to ML Lesson 2)
Day 9	Your First Machine Learning Model (Intro to ML Lesson 3) Model Validation (Intro to ML Lesson 4)
Day 10	Underfitting and Overfitting (Intro to ML Lesson 5) Random Forests (Intro to ML Lesson 6)
Day 11	Machine Learning Competitions (Intro to ML Lesson 7)
Day 12	Introduction (Intermediate ML Lesson 1) Missing Values (Intermediate ML Lesson 2) Categorical Variables (Intermediate ML Lesson 3)
Day 13	Pipelines (Intermediate ML Lesson 4) Cross-Validation (Intermediate ML Lesson 5)
Day 14	XGBoost (Intermediate ML Lesson 6) Data Leakage (Intermediate ML Lesson 7)
Day 15 to 30	Participate in “30 Days of ML” competition

Source: Kaggle

I had already gotten the certificate of completion for the introductory Python course previously, and promptly added the new “Intro to Machine Learning” and “Intermediate Machine Learning” ones to my collection.

The intermediate machine learning course was quite useful and provided examples of several important concepts and techniques.

Pipelines

Perhaps the most practical one was the idea of using Pipelines to combine data preprocessing and model specification into one easy-to-manage process. Below is a code snippet from a Kaggle-hosted notebook that gives a concrete example of how a simple pipeline is coded.

Cross Validation

Cross Validation is a common technique used to extract more representative error metrics across the entire training dataset, instead of using just one static train/validate cut, and the scikit-learn (sklearn) package makes it ridiculously easy to implement.

Hyperparameter Optimisation

The combination of pipelines and cross-validation, together with a user-defined function, enables Hyperparameter Optimisation to be implemented in an efficient manner.

In the code snippets below, the user-defined function get_score() takes in one parameter n_estimators, which is used to set the “number of trees” hyperparameter for the random forest model.

The function then defines a pipeline that uses the hyperparameter, and calculates an average error metric using (three-fold) cross-validation across the entire training dataset.

The get_score() user-defined function is then called eight times, each using a different n_estimators value, and the respective error metrics are stored in the results dictionary. Notice how compact the code is when dict comprehensions are used when calling the function!

The results are plotted in a simple matplotlib line chart, where it’s clear that using the hyperparameter value of n_estimators=200 gives the model with the lowest error metric, and hence the best performing model among the eight settings.

XGBoost

The course also introduces the very popular and powerful XGBoost algorithm and shows how easy it is to implement (using default settings) in just three lines of code. There are clearly many more moving parts in the model, and definitely worth a separate deep-dive.

Competition

Armed with the necessary foundation in Python and Machine Learning, days 15 to 30 of the programme are dedicated to a customised InClass competition. The regression problem requires target values to be predicted from a set of categorical and numerical features.

The first competition submission is trivial as Kaggle provides a notebook with data preprocessing and model building already coded. All that’s needed is to run the entire notebook, submit the test results and you’re immediately placed on the leaderboard.

The default notebook uses an OrdinalEncoder() for the categorical features, a RandomForestRegressor() with default settings for the model and doesn’t implement a pipeline. For my second submission, I made a simple change in the model to XGBoost, which improved my score and moved me up on the rankings.

To practise what I had learnt in the earlier course, I decided to set up a preprocessing and modeling pipeline, which required a fair amount of changes to the sample notebook. I commented out the affected portions and appended new cells with code implementing a simple pipeline.

I also made some tweaks to the XGBoost model, specifically by increasing the n_estimators and reducing the learning_rate parameters, which would theoretically improve the fit of the model with the additional cost of compute time. There was also a chance of overfitting, but since I didn’t push the parameters too much, the risk should be minimal.

Adding the pipeline shouldn’t affect the score since it was more about refactoring the code to improve the plumbing, but changing the model should theoretically improve the score, which it did. Not bad for very quick and minor changes.

There are still a few days left before the end of the competition, but I decided to call it a day. Reviewing the Python course was useful and the two machine learning courses were quite informative and provided practical code examples.

Now that my data science learning engine has restarted, I’ll continue with additional Kaggle Learn courses and try out other InClass competitions. Hopefully the momentum doesn’t fizzle out too fast this time, and I’ll be able to continue picking up additional knowledge.