A Whirlwind Tour of Python

A whirlwind tour of A Whirlwind Tour of Python

“A Whirlwind Tour of Python” by Jake VanderPlas is a handy little reference for those with some prior programming experience but who are new to Python. It is compact (only 98 pages) yet feature rich, and is a good starting point for picking up Python.

In Jake’s own words:

“… this report in no way aims to be a comprehensive introduction to programming, or a full introduction to the Python language itself… Instead, this will provide a whirlwind tour of some of Python’s essential syntax and semantics, built-in data types and structures, function definitions, control flow statements, and other aspects of the language. My aim is that readers will walk away with a solid foundation from which to explore the data science stack”

Which sounds exactly right for me.

One of the first books I read on data science was the excellent “Python Data Science Handbook: Essential Tools for Working with Data” written by Jake VanderPlas. In addition to well-written and easily understandable chapters on various machine learning algorithms, the book also guides readers on Python programming, data manipulation using Pandas and visualisation using Matplotlib.

This Whirlwind report was mentioned in the Python section of that book, and so I decided to take a step back and read it to familiarise myself with the basics. Jake and O’Reilly Media have been kind enough to provide the entire report online and in downloadable PDF.

To give you an overview of what’s inside, here’s the content page:

Image credit: Jake VanderPlas and O’Reilly Media

After some preliminaries on how to install and run Python code (I recommend using Jupyter Notebooks via Anaconda), it starts by introducing the programming philosophy behind Python through an Easter egg within the interpreter that is exposed by running import this.

I’ve found that the best way for me to internalise what I’ve read is to physically type out and execute sample code. So, the code screenshots you see from here on were taken from a Jupyter notebook I setup for this purpose.

Basic Syntax

The basic Python syntax is introduced using the sample code below, which I’ve annotated with blue arrows for compactness.

Two things that struck me, coming from a C background, were: (i) code blocks are not delineated with curly braces {} but with indentations instead, and (ii) statements are not terminated with a special character like semicolon (;) but with a simple end-of-line. It takes a bit of getting used to, but I can see how it would improve code readability.

One trick I found which greatly simplifies the toggling of comments in Jupyter is to use <Ctrl-/> instead of typing out # all the time. It may not seem like a big deal, but it helps tremendously during debugging when you need to comment and uncomment multiple lines of code frequently. Just select all the relevant lines and press <Ctrl-/>.

Two important Python semantics that are useful to remember:

  • Variables are Pointers: which allows for dynamic typing vs restrictive static type declarations needed in other languages like C, but which can sometimes cause confusion. The sample code below illustrates this clearly.
  • Everything is an Object: even simple things like a floating point number is an object that has attributes and methods.

The arithmetic, comparison and Boolean operators available are summarised below and are similar to most other programming languages. The identity (is/is not) and membership (in/not in) operators are new features which warrant a closer look.

The identity operators (is/is not) check for object identity, which is different from equality (==), as can be seen from the example below. Since all variables in Python are pointers, the is operator checks whether the two variables are pointing to the same object, rather than what the object contains. Something to be aware and be careful of when using the identity operators.

The membership operators (in/not in) are really interesting and very useful when combined with for loops. There’s no need to mess around with array indexing, just go straight to the individual items in a list and process them directly. Coupled with dynamic typing, this allows for very flexible and readable logic.

Data Types

The data types available in Python include the usual suspects of int, float, bool etc plus additional built-in data structures, namely list, tuple, dict and set.

Again, everything in Python is an object, and the built-in help function is very useful when you need to find out what methods are available for each data types. For example, you can create a new list variable and then call help on it to show the available methods.

List indexing is zero-based and so the first element of a list L is L[0]. Negative indices are allowed, and the last element can be accessed using L[-1]. Besides accessing single elements, slices of data can also be retrieved using a start point (inclusive), end point (non-inclusive) and an optional step size.

Leaving out the first index defaults to a value of 0 and leaving out the second index defaults to the length of the list, effectively pointing to the last element (remember that the end point is non-exclusive). The indexing and slicing functionality also applies to setting values in the list. The simple examples below provide a good illustration of the various possibilities.

list

Tuples are similar to lists except that they are immutable i.e. contents and size cannot be changed after creation, and are commonly used to return multiple values in function calls. Tuples are defined using parentheses (or without any brackets) instead of square brackets, but note that accessing elements in a tuple is still done using square brackets.

tuple

Dictionaries are extremely flexible mappings of keys to values and are created using a comma-separated list of key:value pairs in curly braces {}. Elements are accessed and set not using zero-based indexing, but by valid keys and do not have any defined order.

dict

Sets are similar to lists and tuples, except they are unordered and declared with curly braces {}. Built-in methods and operators have been implemented for various set operations like union, intersection, difference and symmetric difference.

set

Flow Control

Python provides the usual flow control functionality such as if/else, for and while loops, together with break/continue execution controls. Syntax examples are given below and are fairly self-explanatory.

if/else
for
while
break/continue

Functions

Function definitions in Python use a straight-forward syntax that allows for default values of input arguments.

There is also flexibility to write functions where the input arguments are not pre-defined and can be whatever the user passes in. In this case, *args and **kwargs are used to catch all arguments passed in. I’ve not had experience writing functions that require this level of flexibility, but it’s good to know that I can if and when I ever have to.

Another way of defining functions is to use short one-off lambda functions, which can also be passed into other functions as an argument, inception-style. Similar to the movie, this will take me a while to wrap my head around it, but I can already see how powerful this could be, not to mention its value in writing compact and readable code.

In the example below, the lambda function is used to extract only the first name in the data and use it as the sorting key. It’s beautiful how concise and elegant this approach is.

List Comprehension

List comprehensions are probably the most useful and unique syntax in Python, allowing for compact and readable code. The basic syntax is in the form of [expr for var in iter], where expr is any valid expression, var is a variable name and iter is any Python iterable object. A good way to visualise how they are coded is to compare with an equivalent for loop.

Nested iterations can be achieved by adding additional variables, with multiple levels possible. This looks like it would be useful when dealing with n-dimensional Euclidean space.

Conditionals on the iterator are allowed by adding expressions. In the example below, if (val % 3 > 0) is added to the iterator to filter out multiples of 3.

Conditionals on the value are also allowed, which make the code even more compact though less readable but this can be easily addressed with the judicious use of line breaks. In the example below, if val % 2 else – val is used as the value in the earlier list comprehension to negate any value that is a multiple of 2 (after the earlier filtering out of multiples of 3).

The syntax for comprehensions are not just limited to list but also set and dict, as illustrated in the code samples below. Note that using curved brackets () does not result in tuple comprehension but in generator expressions, which are a dynamic way of generating values and a separate topic.

String Manipulation

Strings can be defined using either single or double quotes (both are identical) with multi-lines defined using triple quotes.

String manipulation is another aspect that Python does particularly well, with many built-in methods available for common tasks like adjusting case, removing and padding spaces,

Adjusting case
Removing spaces
Padding spaces

finding substrings,

replacing substrings,

… as well as partitioning, splitting, joining,

… and formatting strings. Details on additional formatting specifications are given in the Python online documentation.

The full power and flexibility of Python’s string manipulation functionality is best seen when used with regular expressions (regex). I’ve come across regular expressions in the past but have found the syntax relatively cryptic (see example below). At this time though, I’ll stick with the simple stuff and deep-dive into regular expressions when I truly need it, starting with the basics in this same book.

Regular expression (regex) simple example

Modules and Packages

The Python standard library comes with many useful tools, like the built-in regular expressions module and many others including those listed below, with details given in Python’s online standard library documentation.

Image credit: Jake VanderPlas, O’Reilly Media

Loading modules is done via the import statement using the respective module name with its contents preserved in a specific namespace. For longer module names, aliases can be used for shorter namespaces.

The Python open source community has created an entire ecosystem of third-party modules that are particularly useful for data science. Anything that you’d practically need as a starting data scientist would almost surely be available in a module somewhere. Therefore the first instinct would be to find an appropriate module, understand and use it instead of writing code from scratch.

Before a third-party module can be imported, the package has to be fetched and installed first. The standard package registry is the Python Package Index (PyPI) and packages can be installed via command line or using a package manager like that found in the Anaconda distribution.

For ease of use and maintenance, the latter is recommended, but here’s an example command line instruction to install the popular scikit-learn package: $ pip install scikit-learn

The list below highlights some essential third-party packages for data science applications, but there are many many more that are available.

  • NumPy: efficient way to store and manipulate dense arrays, and used as foundation for other modules like Pandas and SciPy.
  • Pandas: labeled column-oriented data built on NumPy, providing a very useful labeled interface for multi-dimensional data in the form of a DataFrame object that allows for select, aggregate, join and group etc operations.
  • Matplotlib: currently the most popular scientific visualisation package in Python. Even though its functionality is more basic than other viz packages like seaborne and plotly, it is simpler and gets the job done.
  • SciPy: scientific functionality built on NumPy, including linear interpolation, numerical optimisation, statistical analysis, linear algebra, fast Fourier transforms etc.
  • Scikit-Learn: implementation of numerous machine learning algorithms, including linear regression, logistic regression, kNN, decision trees, random forests etc. Almost all the usual suspects can be found in this package.

Other Stuff

There are additional chapters and sections contained in the book that I’ve glossed over and haven’t read through in detail. I’ve listed them below for completeness, so that I can re-visit in the future as needed.

  • Errors and exceptions: try, except, else, finally
  • Iterators: enumerate, zip, filter, map, itertools
  • Generators: similar to list comprehension except that“… list is a collection of values, while a generator is a recipe for producing values”. More memory and compute efficient, but slightly more complicated.

“A Whirlwind Tour of Python” does exactly as advertised. To use a chess analogy, it explains the rules of chess, but doesn’t go into detail on the various chess tactics, strategies or opening moves.

Having already completed Kaggle Learn’s online Python course, I plan to review the programming chapters of Jake’s “Python Data Science Handbook”. After that, I should have sufficient knowledge in Python to be dangerous and start serious coding as I continue on my data science learning journey.

Source: “A Whirlwind Tour of Python by Jake VanderPlas (O’Reilly). Copyright 2016 O’Reilly Media, Inc., 978-1-491-96465-1.”

2 comments

  1. You are awesome! Thanks for sharing this. I’m = “Brand New” to Python and feel that I could easily get addicted to it. This is a really cool guide and appreciate you sharing it. I work at a print shop and plan on printing a copy for myself 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *