Getting started on geospatial analysis with GeoPandas and Folium

After completing Kaggle’s “30 Days of Machine Learning” challenge, I decided to keep the momentum going by systematically taking the rest of the courses in their catalogue.

One particular course that caught my attention was the one on Geospatial Analysis, especially the lessons on creating interactive maps and manipulating geospatial data.

Source: Kaggle Learn

After completing the course, I decided to try out my newly acquired skills by applying them on a dataset outside of Kaggle. Since I’m such a coffee addict, why not combine my two interests and do some geospatial visualisation on coffee production?

I went digging around and found a good dataset with annual coffee production data from 1991 to 2000 broken down by country, published by the International Coffee Organisation.

Importing Packages

First things first, I added the standard imports os, numpy, pandas as well as chardet and fuzzywuzzy for data processing, followed by geopandas and folium required for geospatial analysis.

Loading Data

Then, I saved the ICO dataset into a CSV file which I then uploaded to Kaggle Datasets, and added that dataset to the Kaggle Notebook where I’d do my coding.

Using pd.read_csv() to read the CSV file should have been pretty straight-forward, but I quickly ran into my first error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 4088: invalid continuation byte

Thankfully, this was covered in the Character Encodings lesson and I was able to determine the correct encoding to use when reading in the CSV file, and used it explicitly when calling pd.read_csv()

Downloading Country Boundaries

In order to draw Choropleth maps that show how much coffee each country produces, I needed the boundaries of each country and this was easily available within GeoPandas. The boundaries can be found in the geometry column as either Polygon or MultiPolygon data types.

Fixing Missing Countries

Since I was only interested in those countries that produce coffee, I only extracted the boundaries for countries in the coffee dataset.

However, I realised that eight countries were missing and my best guess was that the spelling of those country names were not consistent between the coffee and geopandas datasets.

Matching Using Fuzzy Text

I could have manually compared those missing countries against the full list of countries available in the geopandas dataset, and try to pick them out visually.

But then I remembered the fuzzy text matching package fuzzywuzzy (what a cute name!) introduced in the Inconsistent Data Entry lesson, and managed to narrow down the closest matches for the missing country names.

The matching wasn’t 100% complete, but since there were only two missing countries, I simplified the names and managed to quickly locate those as well.

Replacing Inconsistent Names

I created a dict that contained key:value pairs for the countries that had to be re-named, and used it to create a new name column in the coffee dataset that would be consistent with the corresponding name column in the geopandas dataset.

I decided to keep the Country column in the coffee dataset as-is, in case I needed to use it somewhere down the line.

After merging the two datasets, I double-checked that none of the 55 countries were missing.

Adding Latitude and Longitude

The geopandas dataset didn’t include the latitude and longitude data for each country, which would be needed later when generating folium bubble maps, so I used the geocode functionality taught in the Manipulating Geospatial Data lesson to fetch those using the nominatim provider.

My first attempt using the name column didn’t achieve 100% retrieval, but this was easily solved by switching to the Country column, which contained longer names which seemed to work better. It was a good thing that I kept that column previously.

Completed Dataset with Geospatial Info

Now that I had the complete dataset with coffee production, country boundaries and latitude/longitude data, I could start doing some simple data visualisation, starting with a simple (sorted) bar chart using 2020 data.

It was very obvious that Brazil is the world’s largest producer of coffee by far, with Vietnam coming in second and Colombia third.

Generating Folium Choropleth and Circle Plots

To visualise the data even better, I generated Choropleth and Circle plots that were introduced in the Interactive Maps lesson.

The Choropleth plot used the country boundary geometry data, while the Circle plot used the Latitude and Longitude data, and I could overlap both plots on the same base map.

Generating these plots were easy and straight-forward, and once they were rendered, I could zoom in/out and change locations. Adding a dynamic tooltip to the Circle plot was easy, though doing the same for the Choropleth plot required creating and configuring a GeoJSON file, which I didn’t attempt.

Conclusion

This was just a simple exercise using a dataset outside of Kaggle, but I was happy that I got to try out some of the useful skills that I picked up from various Kaggle Learn lessons.

I’ll probably do more analysis on this coffee production dataset, but this initial start doing just geospatial visualisation was already quite interesting.

Note: The Kaggle Notebook with complete Python code can be found here, together with the interactive Folium map. You don’t need a Kaggle account to view the code and interact with the map, but you will need an account if you want to fork a copy of the code and run it.

%d bloggers like this: