Lewis Carroll

If you don’t know where you’re going, any road will get you there

I’ve always been a firm believer of knowing where you want to end up before starting on any meaningful journey. One of the things I’d like to achieve is to be a practising data scientist. Someone who uses data and analytics to answer questions and solve problems.

In order to reach my objective, there are a few key skills that I believe I would need:

- Mathematics
- Statistics
- Computer science
- Data visualisation
- Machine learning
- Communication
- Domain expertise

Why these specific skills?

The first book I’m reading to get started is “Doing data science – straight talk from the frontline” by Cathy O’Neil and Rachel Schutt.

Cathy O’Neil is a senior data scientist at Johnson Research Labs, earned her Ph.D in math from Harvard, and was a postdoc in the math department at MIT and a professor at Barnard College.

Rachel Schutt is an SVP of data science at News Corp, an adjunct professor of statistics at Columbia University, and is a founding member of CU’s Education Committee for the Institute of Data Sciences and Engineering.

Two co-authors with strong credentials in both academia and business, and who have been in the thick of the action.

### Data science profile

In Chapter 1 of their book, they introduce the idea of a person’s data science profile based on different levels of expertise in various domains (i.e. the seven key skill sets that I listed previously), and get students in their class to profile themselves.

The “perfect” data scientist would be someone with the highest ability in all seven skill sets. But that would be a unicorn, and becoming a unicorn is too much work, so I’ll settle for something more practical and achievable instead.

On a scale of 1 to 10, where 10 is the highest, here’s where I’d like to get to within a reasonable timeframe:

- Mathematics = 4
- Statistics = 4
- Computer science = 6
- Data visualisation = 10
- Machine learning = 8
- Communication = 10
- Domain expertise = 6

And here’s where I think I’m currently at:

- Mathematics = 3
- Statistics = 3
- Computer science = 3
- Data visualisation = 6
- Machine learning = 2
- Communication = 10
- Domain expertise = 3

Or, using a radar plot:

I clearly have a long road ahead of me, but it’ll no doubt be entertaining.

### Mathematics and Statistics

Mathematics and statistics are the foundation of data science, and include topics like linear algebra, multivariate calculus, conditional probabilities among others.

I’m aiming to acquire just enough knowledge in these areas to be dangerous, hence the relatively low scores.

### Computing

Computing is necessary to process data, run algorithms, analyse results and pretty much every aspect of the data science workflow. This includes the ability to write code in Python or R, incorporate various open source packages like scikit-learn, use software like Tableau and leverage infrastructure like Amazon Web Services.

I’m treating computing as a means-to-an-end, and similarly, aim to get things done, rather than being an actual developer.

### Visualisation

Visualisation includes the visual representation and analysis of upstream data all the way to downstream results. This is an area that I’m particularly interested in, especially when dealing with high-dimensional data and how to view and interpret them in intuitive ways.

As humans, we’re constrained to three-dimensional space plus time as a fourth dimension. Given that any meaningful data analysis will likely include tens, hundreds, thousands or even more variables, being able to make sense of it all would be particularly insightful.

I plan to dive deep and hopefully become a subject matter expert in this field, hence the solid 10 target.

### Machine learning

Machine learning is a very broad field and spans simple techniques like linear regression, to more advanced approaches like neural networks of various architectures.

I intend to acquire a broad working knowledge of most algorithms, understand the theory behind each one and be able to determine which to use in any particular problem.

### Communication and Domain expertise

Finally, communication and domain expertise.

It’s all well and good to tinker around endlessly with data and models, but eventually any findings need to be communicated to other people. To have any meaningful impact, the audience will likely be decision makers with access to sufficient resources to make things happen.

Understanding any particular business domain takes time and experience, hence I’m not looking to be a subject matter expert in many domains, but instead to acquire sufficient knowledge to understand the issues and assess results on an as-needed basis.

Communication is a critical soft-skill but I believe I’ve already acquired a high level of proficiency over 22 years of working experience, and therefore won’t spend additional time on this area.

So, that describes the end that I have in mind. The next step is to lay out a plan to achieve it.