Checkpoint Delta: Basic machine learning toolbox

Given the recent hype about machine learning and artificial intelligence, you might think that it’s a new and upcoming field.

But in reality, it’s been around for decades, and if you take a broader definition, you could even say that it’s been around for centuries. The first artificial neural networks appeared in the 1940s and linear regression analysis has been around since the 1800s.

Algorithms

There are many ways to categorise the numerous machine learning algorithms, with two common approaches of labeling them as Supervised vs Unsupervised, or Prediction vs Classification.

For my needs, I just need a simple way to list down the algorithms that I’ll be picking up incrementally, and so will go with Basic vs Advanced.

Using the 80/20 rule, the Basic algorithms would probably be sufficient to handle 80% of data science problems and should take 20% of my effort to understand, while the Advanced ones would be the inverse.

Here’s my initial list of Basic algorithms:

Linear regression
Logistics regression
Decision trees
K-means clustering

Learning Approach

I’ll approach this by incrementally building a toolbox of algorithms that can be applied to problem solving. Starting with Basic ones like linear regression and decision trees; and working up to more Advanced algorithms and techniques.

For each algorithm, I’ll create a cheatsheet following a consistent framework which can be used for future reference and application:

Name of algorithm e.g. Linear regression.
What does it do? e.g. prediction, classification.
When should it be used? i.e. identify key problem characteristics that would make this a suitable tool, provide some use cases.
How does it do it? i.e. describe the core concepts without diving too deep and getting lost in the details.
What are similar algorithms? including comparisons across different choices.
How to build the model? e.g. calibration of parameters, setting of hyper-parameters.
How to evaluate the model? e.g. what performance metrics to use, what is considered good or bad.
How to visualise and interpret results? i.e. find the best way to present the results to an end audience that will use to make decisions and take action.
How to implement in code? using Python and identifying appropriate open source packages and functions.
Build a working example i.e. find sample dataset and use case, create Jupyter notebook with Python code and include EDA, model building, results analysis and visualisation.

Building the toolbox framework using the Basic algorithms first would give me a few iterations to refine the structure and content.

After all, it’s easier to eat an elephant one bite at a time.

Featured image credit: scikit-learn.org