While strengthening my theoretical foundation, I’ll start familiarising myself with the basic tools to run data science projects. The good news here is that most of these are open source, freely available on the internet and well used by the global data science community.
Which means that they are battle-tested with good support on sites like Stack Overflow. And as always, Google is your friend. For any given problem, there’s almost 100% certainty that someone else out there has faced it before and has found a solution.
Python Programming
The programming languages commonly used for data science appear to be either Python or R, with both having good open source packages for analytics and visualisation.
Having dabbled briefly in both, I find Python easier to pick up and hence will start with that. I might get to R at a later stage, but one language is sufficient for now.
I’m very rusty at programming and learned how to code using Pascal. Yes, that Pascal. Actually, come to think of it, that wasn’t my first programming language. I recall using Logo way way back when to create simple patterns on the Apple IIe. It was fun, but there’s really not much you can do with it.
During my undergraduate days, I had to self-learn C to control robotic arms and then pick up Matlab/C++ during my master’s program to price financial derivatives. Since then I’ve not done any serious programming.
My approach to coding has always been to focus on getting the job done as quickly as possible. Which doesn’t make for particularly elegant or efficient code, but I blame it on my engineering background.
Since I have some programming background, I’ve decided not to sign-up for an online course and instead pick up Python by self-study and practice.
Having skimmed through the very useful “Python data science handbook” by Jake VanderPlas, I find the chapters on Python programming a good quick-start guide.
I’ll start with this first, and decide if I need additional references. Depending on how it goes, I might even change my mind and sign up for an online course after all.
Juptyer Notebook
I suppose you could go hardcore dev mode and write code in a text editor and execute it via command line, but I’ve decided to use an integrated development environment (IDE) like most mere mortals.
After narrowing it down to two options – either Spyder or Jupyter Notebook, both available in Anaconda – I’m going with Jupyter given its ability to integrate code and output in a story-like narrative.
Spyder is a more traditional IDE and has the benefit of being similar to RStudio, but Jupyter gets me closer to the end product of any data science analysis. Once I get Anaconda running on my Windows 10 laptop, I’m good-to-go.
Open Source Packages
One benefit of using Anaconda is in the ease of managing open source packages. There’s been so much already done by the open source community that there’s really no need to re-invent the wheel and write raw code for data handling, analytics and visualisation.
I suppose once you’re a data scientist with years and years of experience, there will come a time when there’s no existing code for what you want to do. But for me, learning how to use pandas, numpy, scikit-learn, keras etc is more than sufficient at this time.
When, not if but when, I hit a problem, the answer will either be somewhere in the respective documentation or on the internet.
Repeat after me, Google is your friend.
And that’s it really. I don’t think there’s much else I’d need to start doing some damage. Easy peasy.
But ask me again when I’m knee-deep hunting down bugs, and I’ll share some interesting Hokkien words with you.