After completing Kaggle’s “30 Days of Machine Learning” challenge, I decided to keep the momentum going by systematically taking the rest of the courses in their catalogue.
One particular course that caught my attention was the one on Geospatial Analysis, especially the lessons on creating interactive maps and manipulating geospatial data.
After completing the course, I decided to try out my newly acquired skills by applying them on a dataset outside of Kaggle. Since I’m such a coffee addict, why not combine my two interests and do some geospatial visualisation on coffee production?
First things first, I added the standard imports
pandas as well as
fuzzywuzzy for data processing, followed by
folium required for geospatial analysis.
pd.read_csv() to read the CSV file should have been pretty straight-forward, but I quickly ran into my first error message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf4 in position 4088: invalid continuation byte
Thankfully, this was covered in the Character Encodings lesson and I was able to determine the correct encoding to use when reading in the CSV file, and used it explicitly when calling
Downloading Country Boundaries
In order to draw Choropleth maps that show how much coffee each country produces, I needed the boundaries of each country and this was easily available within
GeoPandas. The boundaries can be found in the
geometry column as either
MultiPolygon data types.
Fixing Missing Countries
Since I was only interested in those countries that produce coffee, I only extracted the boundaries for countries in the coffee dataset.
However, I realised that eight countries were missing and my best guess was that the spelling of those country names were not consistent between the coffee and geopandas datasets.
Matching Using Fuzzy Text
I could have manually compared those missing countries against the full list of countries available in the geopandas dataset, and try to pick them out visually.
But then I remembered the fuzzy text matching package
fuzzywuzzy (what a cute name!) introduced in the Inconsistent Data Entry lesson, and managed to narrow down the closest matches for the missing country names.
The matching wasn’t 100% complete, but since there were only two missing countries, I simplified the names and managed to quickly locate those as well.
Replacing Inconsistent Names
I created a
dict that contained
key:value pairs for the countries that had to be re-named, and used it to create a new
name column in the coffee dataset that would be consistent with the corresponding
name column in the geopandas dataset.
I decided to keep the
Country column in the coffee dataset as-is, in case I needed to use it somewhere down the line.
After merging the two datasets, I double-checked that none of the 55 countries were missing.
Adding Latitude and Longitude
The geopandas dataset didn’t include the latitude and longitude data for each country, which would be needed later when generating
folium bubble maps, so I used the geocode functionality taught in the Manipulating Geospatial Data lesson to fetch those using the
My first attempt using the
name column didn’t achieve 100% retrieval, but this was easily solved by switching to the
Country column, which contained longer names which seemed to work better. It was a good thing that I kept that column previously.
Completed Dataset with Geospatial Info
Now that I had the complete dataset with coffee production, country boundaries and latitude/longitude data, I could start doing some simple data visualisation, starting with a simple (sorted) bar chart using 2020 data.
It was very obvious that Brazil is the world’s largest producer of coffee by far, with Vietnam coming in second and Colombia third.
Generating Folium Choropleth and Circle Plots
To visualise the data even better, I generated Choropleth and Circle plots that were introduced in the Interactive Maps lesson.
The Choropleth plot used the country boundary
geometry data, while the Circle plot used the
Longitude data, and I could overlap both plots on the same base map.
Generating these plots were easy and straight-forward, and once they were rendered, I could zoom in/out and change locations. Adding a dynamic tooltip to the Circle plot was easy, though doing the same for the Choropleth plot required creating and configuring a
GeoJSON file, which I didn’t attempt.
This was just a simple exercise using a dataset outside of Kaggle, but I was happy that I got to try out some of the useful skills that I picked up from various Kaggle Learn lessons.
I’ll probably do more analysis on this coffee production dataset, but this initial start doing just geospatial visualisation was already quite interesting.
Note: The Kaggle Notebook with complete Python code can be found here, together with the interactive Folium map. You don’t need a Kaggle account to view the code and interact with the map, but you will need an account if you want to fork a copy of the code and run it.