# Data Visualisation - Lab 5 - Geographical data

---

**Authors: Claire Rocks and Richard Kirk**

---

Welcome to the fifth lab for Data Visualisation.

In this lab we are going to look at visualising geographical data. Lots of datasets involve features with a spatial or geographical dimension. We'll look at Chloropleth maps, scatter plots and bubble plots.  Cloropleth maps present aggregate statistics across geographical regions. Scatter plots are effective when you want to show specific locations and bubble plots are useful when you want to show count data per region on a map.

There are a number of Python libraries we can use for this e.g. [altair](https://altair-viz.github.io/) or [geopandas](https://geopandas.org/en/stable/) but in this lab we will use [plotly](https://plotly.com/python/). Plotly is a Data Visualisation Library that has interfaces with JavaScript, Python, and R and lets us produce nice visualisations quickly.

## Setup for the Lab

Let's start with installing and importing the required libraries. The key one we will be using is a version of **Plotly** called Plotly Express. This is an easy-to-use, high-level interface to Plotly, which operates on a variety of types of data and produces easy-to-style figures.
  * `pandas` - stores our data efficiently
  * `scipy` - includes scientific calculation functions
  * `plotly` - draws our maps

In [None]:
%pip install pandas scipy plotly

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## For VSCode to see the images, we need to alter where images are rendered. This line should sort it out. Comment out these 2 lines if you are running this in Colab
import plotly.io as pio
pio.renderers.default = "notebook_connected"

## Chloropleth Maps

A Chloropleth Map is a map of a region with different regions coloured to indicate the value of a feature for that division e.g. population per country.

In this example, we will use internet usage statistics from the [Our World in Data dataset](https://ourworldindata.org/internet). Our World in Data is an interesting site and worth a look at - their mission is to publish the “research and data to make progress against the world’s largest problems”. The site has 3280 charts across 297 topics.

In [None]:
## Load the data set
internet_usage_url = "https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/share-of-individuals-using-the-internet.csv"

internet_usage_df=pd.read_csv(internet_usage_url)
internet_usage_df.head()

The Code Feature in the dataset refers to a code assigned to each country by a standard called [ISO 3166-1](https://www.iso.org/iso-3166-country-codes.html). It is widely used so that developers across the world have a common way to refer to and access country names. It is also used by plotly to map data to the appropriate location on a world map.

In [None]:
## Create a subset specific to one year - 2016
internet_usage_2016 = internet_usage_df.query("Year == 2016")
internet_usage_2016.head()

To plot the cloropleth map we need to provide a number of parameters

  * The data set we want to use.
  * The column name of the column providing the ISO 3166-6 codes.
  * The column by which to colour-code the countries.
  * The column containing the information we want to display when we hover over the country.
  * Which one of the [built-in colour scales](https://plotly.com/python/builtin-colorscales/) we want to use.

In [None]:
## Display the cloropleth map
fig = px.choropleth(internet_usage_2016, locations = "Code", color = "Individuals using the Internet (% of population)", hover_name = "Country", color_continuous_scale = px.colors.sequential.Sunsetdark)
fig.show()

Hover over the map - there are a few things to note

  1. When you hover over a country, the country name appears, so does the Code and the Individuals using the Internet (% of population) value.
  2. At the top right of the plot there is a menu bar that gives you options for selection types, zooming, resetting the plot and taking a snapshot.

It is worth taking some time to play around with this map.

### Refining the map

We can also start to refine the layout

In [None]:
## We can add a title
fig.update_layout(title_text = "Internet usage as a percentage of population (2016)")

In [None]:
## Set geo_scope to asia to zoom into asia. This can be set to { world | north america | south america | africa | asia | europe | usa }.
fig.update_layout(title_text = "Internet usage as a percentage of population (2016)", geo_scope = "asia")

In [None]:
## Set the projection type to natural earth. By default this is "equirectangular"
fig.update_layout(title_text = "Internet usage as a percentage of population (2016)", geo_scope = "world", geo = dict(projection={"type":"natural earth"}))

Try dragging the map now and notice the rotation.

You will find other projections [here](https://plotly.com/python/reference/#layout-geo-projection).

What if we wanted to see other years?  This is very easy in plotly by adding a slider to move through different values of a feature e.g. year. To do this we add a parameter called `animation_frame = 'column we wish to animate for'`.

In [None]:
fig = px.choropleth(internet_usage_df, locations = "Code", color = "Individuals using the Internet (% of population)", hover_name = "Country", animation_frame = "Year", color_continuous_scale = px.colors.sequential.Sunsetdark)
fig.show()

Notice that the years on the slider are not in the right order, which is annoying!

The easiest way to fix this is to sort the dataframe by year...

In [None]:
## Sort the dataset by year
internet_usage_df.sort_values(by=["Year"], inplace=True)

## Generate the figure again
fig = px.choropleth(internet_usage_df, locations = "Code", color = "Individuals using the Internet (% of population)", hover_name = "Country", animation_frame = "Year", color_continuous_scale = px.colors.sequential.Plasma)
fig.show()

## Working with other maps

Making choropleth maps requires two main types of input:

  1. Geometry information
  2. A list of values with a feature identifier

So far we have been working with one of Plotly's built-in geometries (US states and world countries) but what do we do when we want to plot on other maps?

We can supply a GeoJSON file. GeoJSON encodes the properties and and coordinates of features such as polygons and can be used to define the map space.  We will also need to ensure that our data has an identifier that will map onto the required field in the map file.

You can find lots of GeoJSON files online.

Lets have a look at how we might create the map above using geoJSON

First we need our map. [GeoJSON](https://geojson-maps.ash.ms/) lets you select parts of the world and a resolution and build a custom GeoJSON. The file [world_geoJSON.json](world_geoJSON.json) stores all 7 regions of the world in low resolution. It comes with lots of properties for each country including the ISO 3166-1 alpha-2 and alpha-3 codes (the 2 or 3 letter codes for a country). We can use these codes to link the dataset and the map but here we will show you how to do this when you don't have the codes.

We need to modify the geoJSON file to add a new key named ID - this ID value will refer to a key in the dataset i.e. the Country name in the Country column of our data.

In [None]:
import json

## Load in the geoJSON file
world_path = "world_geoJSON.json"
with open(world_path) as f:
    geo_world = json.load(f)

## Load the data set
internet_usage_url = "https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/share-of-individuals-using-the-internet.csv"
internet_usage_df=pd.read_csv(internet_usage_url)

## Create a temporary dataframe
tmp = internet_usage_df.set_index('Country')
tmp.head()

## Instantiating necessary lists
found = []
missing = []
countries_geo = []

## Looping over the GeoJSON file
for country in geo_world['features']:
    
    ## Checking if that country is in the dataset
    country_name = country['properties']['name']
    if country_name in tmp.index:
        
        ## Adding country to our "Matched/found" countries
        found.append(country_name)
        
        ## Getting information from both GeoJSON file and dataFrame
        geometry = country['geometry']
        
        ## Adding 'ID' information for match between map and data 
        countries_geo.append({
            'type': 'Feature',
            'geometry': geometry,
            'id':country_name
        })
        
    ## Else, adding the country to the missing countries
    else:
        missing.append(country_name)

## Displaying metrics
print(f'Countries found    : {len(found)}')
print(f'Countries not found: {len(missing)}')
geo_world_ok = {'type': 'FeatureCollection', 'features': countries_geo}

print(missing)

What this tells us is that the name information from the geoJSON properties in our map corresponds pretty well to the Country data in our dataset.  However there are 20 countries not found - this is likely down to the spelling of countries differing e.g. *United States*, *The United States*, *US* or *USA* all refer to the same country, or *Côte d'Ivoire* and *Cote d'Ivoire*. To deal with this we will need to create a conversion dictionary to make sure that we are using the spelling of the Country in the dataset when we create the geoJSON file.

In [None]:
country_conversion_dict = {
   'Dominican Rep.' :  'Dominican Republic',
   'N. Cyprus' : 'Cyprus',
   'Lao PDR' : 'Laos',
   'Korea' : 'South Korea',
   'Syria' : 'Syrian Arab Republic',
   'Timor-Leste' : 'Timor',
   'Central African Rep.' : 'Central African Republic',
   "Côte d'Ivoire" : "Cote d'Ivoire",
   'Dem. Rep. Korea' : 'North Korea',
   'Dem. Rep. Congo' : 'Democratic Republic of Congo',
   'Eq. Guinea' : 'Equatorial Guinea',
   'Equatorial Guinea' : 'Equatorial Guinea',
   'S. Sudan' : 'South Sudan',
   'Bosnia and Herz.': 'Bosnia and Herzegovina',
   'Somaliland':'Somalia',
   'Czech Rep.' : 'Czech Republic',
   'Solomon Is.' : 'Solomon Islands' 
}

We use the dictionary to create the correct geoJSON file for our map.

In [None]:
## Load in the geoJSON file
world_path = "world_geoJSON.json"
with open(world_path) as f:
    geo_world = json.load(f)

## Load the data set
internet_usage_url = "https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/share-of-individuals-using-the-internet.csv"
internet_usage_df=pd.read_csv(internet_usage_url)

## Create a temporary dataframe
tmp = internet_usage_df.set_index('Country')
print(tmp.head())

## Instantiating necessary lists
found = []
missing = []
countries_geo = []

## Looping over the custom GeoJSON file
for country in geo_world['features']:
    
    ## Country name detection
    country_name = country['properties']['name'] 
    
    ## Eventual replacement with our transition dictionnary
    country_name = country_conversion_dict[country_name] if country_name in country_conversion_dict.keys() else country_name
    go_on = country_name in tmp.index
    
    ## If country is in original dataset or transition dictionnary
    if go_on:
        
        ## Adding country to our "Matched/found" countries
        found.append(country_name)
        
        ## Getting information from both GeoJSON file and dataFrame
        geometry = country['geometry']
        
        ## Adding 'id' information for further match between map and data 
        countries_geo.append({
            'type': 'Feature',
            'geometry': geometry,
            'id':country_name
        })
        
    ## Else, adding the country to the missing countries
    else:
        missing.append(country_name)

## Displaying metrics
print(f'Countries found    : {len(found)}')
print(f'Countries not found: {len(missing)}')
geo_world_ok = {'type': 'FeatureCollection', 'features': countries_geo}

print(missing)

There are now just 4 countries not found.  Lets go ahead and plot this providing

*   The DataFrame
*   The geoJSON file
*   The column in the DataFrame that links to the geoJSON file
*   The column of the DataFrame that will determine the colour
*   The colour scale we want to use
*   The column of the DataFrame we want to animate over

In [None]:
## Sort the dataset by year
internet_usage_df.sort_values(by=["Year"], inplace=True)

## Create figure
fig = px.choropleth(internet_usage_df, geojson = geo_world_ok, locations='Country', 
                    color = internet_usage_df['Individuals using the Internet (% of population)'],  color_continuous_scale = px.colors.sequential.Plasma, animation_frame = "Year")
fig.show()


## Scatter plots and Bubble plots on Maps

Scatter plots allow us to pinpoint locations on a map rather than provide trends by an area.

### UK pubs Example

In this example, we are going to map the locations of all the pubs within the UK, according to the dataset found in [open_pubs.csv](open_pubs.csv). Lets start by looking at and analysing the data.

In [None]:
df = pd.read_csv('open_pubs.csv')
print(df.describe())
print("====================")
print(df.info())

As we can see, whilst some data has been properly assigned teh correct type, the latitude and longitude have not. As such, lets correct and check it again

In [None]:
df = pd.read_csv('open_pubs.csv')

df['latitude'] = pd.to_numeric(df['latitude'],errors = 'coerce')
df['longitude'] = pd.to_numeric(df['longitude'],errors = 'coerce')
df['name'] = df['name'].astype('string')  ## May need to replace 'string' with 'str', depending on your version of pandas and Python

print(df.describe())
print("====================")
print(df.info())

Now that our data has been cleaned up, we can plot this data. Plotly has a handy function that allows you to plot points on a map, called `Scattergeo`. We can therefore add this to a blank figure, and get the graph to plot a map based on this. In order to make it easier to see the data, we are also going to zoom in and move the map to the mean point of all the pubs

In [None]:
## Generate the scatter map
fig = go.Figure()
fig.add_trace(go.Scattergeo(
    lon=df["longitude"],
    lat=df["latitude"],
    text =df['name'],
    marker = dict(
        size = 1, color = 'rgb(217,0, 0)')
))

## Move and zoom the map into place
fig.update_layout(
    geo = dict(
        scope = 'europe',
        projection={"type" : "equirectangular", "scale": 4},
        center = {'lat': df.latitude.mean(),
                'lon': df.longitude.mean()},
        resolution = 50,
        landcolor = 'rgb(217, 217, 217)',
        showocean = True,
    )
)

## Lets see what we have so far!
fig.show()

If we want a more detailed map, we can import a UK GeoJSON file (found [here](https://github.com/martinjc/UK-GeoJSON), we have downloaded a copy of the required file for you called [uk_geoJSON.json](uk_geoJSON.json)) and overlay this alongside the pubs. This uses the same techniques employed when loading in the world map and checking for countries, but this time we are looking for local authorities! Due to the amount of data within the UK GeoJSON file (12+ MB), this may take a fair amount of time to generate!

In [None]:
## Import UK GeoJSON file
uk_path = "uk_geoJSON.json"
with open(uk_path) as f:
    geo_uk = json.load(f)

## Clean out N/A values and get list of local authorities
df = df.dropna()
tmp = df.set_index('local_authority')

county_geo = []

## For each county...
for county in geo_uk['features']:
    county_name = county['properties']['LAD13NM']

    ## If the county name is present in our data, store the Geo data
    if county_name in tmp.index:
        county_geo.append({
            'type': 'Feature',
            'geometry': county['geometry'],
            'id': county_name
        })
    else:
        print(county_name + " NOT present")

## Package up the Geo data into the required format...
geo_uk_ok = {'type': 'FeatureCollection', 'features': county_geo}

## Draw the cloropleth first, this will make sure it is under everything else
fig = px.choropleth(
    title="Map of Pubs in the UK",
    geojson=geo_uk_ok,
    locations=df['local_authority'],
    scope='europe',
    center = {'lat': df.latitude.mean(),
            'lon': df.longitude.mean()},
    height=1250,
    width=1500
)
fig.data[0].name='Detailed Map'
 
## Add our trace for the pubs
fig.add_trace(go.Scattergeo(
    name='Pubs',
    lon=df["longitude"],
    lat=df["latitude"],
    text =df['name'],
    marker = dict(
        size = 1, color = 'rgb(217,0, 0)')
))

## Lets centre the image to our data
fig.update_geos(fitbounds="locations")

## Show us what you've got!
fig.show()

You can see how it might be easy to also draw the map using another feature to determine the size of the marker. You may also notice that some countries were not displayed. We can use the dictionary process as discussed previously to fix this.


### Covid cases Example

Lets have a look at another example. As much as we may want to forget about it, Covid-19 affected every country, with some countries being hit harder than others. We can plot the number of cases per country using the dataset given. As always, lets have a look at the data.

In [None]:
covid_df = pd.read_csv("time_series_covid19_confirmed_global.csv")
covid_df.head() 

As you can see, we have a country/state per row, along with the position of the marker and the number of cases on a given range of dates. Much like the previous data, we can place each marker. However, in this case, we can say that the size of the marker is a function of the number of cases at a given date. Using all the knowledge you have gained this lab, you can see how we could alter this to iterate over date, or add a more detailed map underneath.

In [None]:
## Generate the figure and add our markers to it
fig = go.Figure()
fig.add_trace(go.Scattergeo(
    lon=covid_df["Long"],
    lat=covid_df["Lat"],
    text =covid_df['Country/Region'],
    marker = dict(
        size = covid_df['6/16/22']*.000001, color = 'rgb(217,0, 0)')  ## This sets the size of the marker
))

## Update the layout to make it look a bit nicer...
fig.update_layout(
    geo = dict(
        scope = 'world',
        projection={"type" : "equirectangular"},
        resolution = 50,
        landcolor = 'rgb(217, 217, 217)',
        showocean = True,
    )
)

## Display our map
fig.show()

## Exercise

Using the Renewable Energy Consumption and production dataset available [here](https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/share-of-electricity-production-from-renewable-sources.csv) and [here](https://raw.githubusercontent.com/TrainingByPackt/Interactive-Data-Visualization-with-Python/master/datasets/renewable-energy-consumption-by-country.csv) are produced and managed by the *Our World in Data* team (more info [here](https://ourworldindata.org/renewable-energy#)).

Your task is to create animated chloropleth maps for the total renewable energy production and consumption across different countries in the world between 2007 and 2017.

Take the time to refine your visualisations

In [None]:
## Exercise code block here! ##