# Data Visualisation - Lab 2 - Data Passing

---

**Authors: Claire Rocks, Richard Kirk and Saif Anwar**

---

Welcome to the second lab for Data Visualisation.

In this lab we are going to look at bit more at acquiring data and pre-processing it before visualisation.

The first step is to load some data into your application.  Unless you own the data and its stored in a certain way, things can get a little messy.  

  * How do you find good sources?
  * Do you have the right to use the data?
  * Is the data extracted in a format that is easy for the application to use?
  * What to do when the data set is very large?

In this lab, we will introduce some possible data sources.  We'll also have a look at a few different techniques we can use to both get data, and how to treat the data to make sure that it is clean and ready to be utilised.


## Acquiring Data

Using good search terms in your search engine is a good start when looking for data sets, specifying the file format or whether it is a feed/downloadable file might yield better search results than more general search terms.

There are also many sites that are very good for acquiring datasets, e.g.

  * [Kaggle](https://www.kaggle.com/)
  * [UK government](https://data.gov.uk/)
  * [Google Cloud](https://cloud.google.com/datasets)
  * [NASA](https://www.earthdata.nasa.gov/)

Other organisations make their data available through APIs, e.g.

  * [OECD API](https://data.oecd.org/api/)
  * [Kaggle API](https://www.kaggle.com/docs/api)
  * [Twitter API](https://developer.twitter.com/en/docs/twitter-api)

Where there existing data sets and no APIs to help you generate one, you might find yourself scraping data from the web.

## Setup for the Lab
First up, lets set up some key libraries that we will require for the rest of the lab. This includes:
  * ```pandas``` - used to store the data
  * ```matplotlib``` - used to plot some graphs
  * ```numpy``` - used to import common maths functions
  * ```seaborn``` - used to import a collection of data

In [3]:
# %pip install numpy pandas seaborn matplotlib

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

You should consider upgrading via the '/Users/clairefleischmann/Documents/CS2D7/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.



## Parsing data

Data comes in all shapes and sizes, whether that be from large data sets that are used to [picture black holes](https://www.extremetech.com/extreme/289423-it-took-half-a-ton-of-hard-drives-to-store-eht-black-hole-image-data) to small datasets like the [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). However, in nearly all cases, the data needs to be preprocessed in order to avoid anomalies from ruining our visualisations. 

Parsing data coverts a raw stream of data into a structure that can be manipulated in our application - for the most part we are looking to get data into a pandas DataFrame

## Data cleanup
In many cases, data coming from raw sources can require cleaning up. Examples of this occurring includes:
  * Sensors fail or need recalibrating
  * Users enter one too many/few 0's when entering how many of an item they want
  * Reviews writing things in subtly different ways but meaning the same thing
  * Data availability etc.

As such, we need to make some decisions about what to do with that data when we are analysing and visualising the data. There are multiple stages to cleaning the data, some of which will be described here.

### Invalid data
When collecting data that will later be analysed, there will be occasions where the data is not available or is not a valid value. As such, we need to be able to detect these, and correct as appropriate.

Lets take a hypothetical movie database. In this database, the cast and crew of a large body of films have been stored. These films span the full range, from massive blockbusters to small indie projects. For many larger films, the cast and crew will be properly documented due to the film's popularity. However, for smaller films, it is less likely the cast and crew will be provided. This could be for multiple reasons, maybe it is an old film and nobody knows now, maybe there was a dispute about the film, maybe a person is embarrassed by the film and wants to remove their association.

Invalid data may not just be the absence of data. Say, for example, the finances and box office earnings were also stored on this database. Again, for larger films, these are made public, validated and (often) boasted about by production companies. On the other hand, small films are not tracked anywhere near as thoroughly, so may not report these values so could get scored as 0.

We therefore need to take action to show that, in the above example, there are a considerable amount of films both made by and starring nobody that also cost an amount of money to produce.

Checking for invalid data is very much dependant on the type of data you are examining. Therefore, we need different techniques for different situations. In some cases, the data can be stored as N/A. Where there are many N/As we need to make a decision about what we are going to do with them.  

We might want to delete them and Python has a handy function for this! In the example shown in the function definition, we have a synthetic DataFrame which contains a N/A field (represented by ```np.nan``` and ```pd.NaT```. These can be used interchangeability in this case). By using the function ```dropna``` on a given dataframe, we can get rid of all the rows with these in them.

In [4]:
?pd.DataFrame.dropna

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mDataFrame[0m[0;34m.[0m[0mdropna[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mself[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis'[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhow[0m[0;34m:[0m [0;34m'AnyAll | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mthresh[0m[0;34m:[0m [0;34m'int | lib.NoDefault'[0m [0;34m=[0m [0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msubset[0m[0;34m:[0m [0;34m'IndexLabel | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mignore_index[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0

We might also want to replace missing data with another value such as replacing the value with the mean of the dataset, or the mean of the previous 3 or values, or the median of the data.  

Python also has a handy function for this, `replace`!

  * `df['column name'] = df['column name'].replace(['old value'],'new value')`

This function replaces an old value with a new value in a single column. We can also do this with multiple values or across the entire dataset:

  * `df['column name'] = df['column name'].replace(['1st old value','2nd old value',...],'new value')`

  * `df = df.replace(['old value'],'new value')`

What you do will be dependant on the data, and the insights you are trying to gain or show.


### Passing numerical data
There are always outliers and exceptions to every case, and again we need to decide if they are genuine or mistakes, and whether including them helps us. For example, consider the following data set:

In [5]:
taxis = sns.load_dataset('taxis')
taxis

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn


If you have a look at entry 6429, you can see that the distance is a lot larger than the others surrounding it. We can see how far it is outside the mean through Python, as well as calculating the number of standard deviations away the entry is.

In [6]:
## Finding the mean
print("Mean:", np.mean(taxis['distance']))
print("Entry 6429 distance from mean: ",
      np.abs(taxis['distance'][6429] - np.mean(taxis['distance'])))

print()

## Finding the standard deviation
print("Standard Deviation:", np.std(taxis['distance'], ddof=1))
print("Number of standard deviations entry 6429 is from mean: ",
      np.abs(taxis['distance'][6429] - np.mean(taxis['distance']))/np.std(taxis['distance'], ddof=1))

Mean: 3.024616819524328
Entry 6429 distance from mean:  15.715383180475671

Standard Deviation: 3.827867001011754
Number of standard deviations entry 6429 is from mean:  4.105519647449061


In this case, we could for example choose to leave the value alone, replace the value with the mean or delete the row entirely before performing the visualisation. Lets try replacing it with the mean...

In [7]:
taxis.loc[6429,'distance']=np.mean(taxis['distance'])
taxis.tail()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,3.024617,58.0,0.0,0.0,58.8,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.0,0.0,17.3,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.0,0.0,6.8,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn
6432,2019-03-13 19:31:22,2019-03-13 19:48:02,1,3.85,15.0,3.36,0.0,20.16,green,credit card,Boerum Hill,Windsor Terrace,Brooklyn,Brooklyn


Or deleting the row entirely

In [8]:
taxis = taxis.drop(index=[6429])
taxis.tail()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
6427,2019-03-23 18:26:09,2019-03-23 18:49:12,1,7.07,20.0,0.0,0.0,20.0,green,cash,Parkchester,East Harlem South,Bronx,Manhattan
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.0,0.0,17.3,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.0,0.0,6.8,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn
6432,2019-03-13 19:31:22,2019-03-13 19:48:02,1,3.85,15.0,3.36,0.0,20.16,green,credit card,Boerum Hill,Windsor Terrace,Brooklyn,Brooklyn


You can also remove multiple rows by specifying the index range.

In [9]:
taxis1 = taxis.drop(taxis.index[6427:6431])
taxis1.tail()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
6423,2019-03-12 08:10:47,2019-03-12 08:35:35,1,4.3,18.5,0.0,0.0,19.3,green,credit card,Saint Albans,Hillcrest/Pomonok,Queens,Queens
6424,2019-03-30 20:52:15,2019-03-30 20:59:55,1,1.7,8.0,0.0,0.0,9.3,green,cash,Central Harlem,Central Harlem North,Manhattan,Manhattan
6425,2019-03-07 15:34:30,2019-03-07 16:31:06,1,9.12,26.32,0.0,0.0,26.82,green,credit card,Park Slope,East New York,Brooklyn,Brooklyn
6426,2019-03-28 08:04:47,2019-03-28 08:07:46,1,0.71,4.5,0.5,0.0,5.8,green,credit card,Central Park,Upper West Side North,Manhattan,Manhattan
6432,2019-03-13 19:31:22,2019-03-13 19:48:02,1,3.85,15.0,3.36,0.0,20.16,green,credit card,Boerum Hill,Windsor Terrace,Brooklyn,Brooklyn


You may also have to convert the data types. This may involve writing custom functions for example to replace *$* or *£* signs before using a function like `pd.to_numeric()` - we'll see an example of this a bit later in the lab.

### Rearranging Data Frames

In some cases, we may need to rearrange the data frames themselves. This allows for better readability, as well as selecting only particular columns and rows. This is referred to as *melting* the data. This can get complex quite quick, so the Pandas library in Python has a handy `melt` function. 

The `melt()` function is used to unpivot a given DataFrame from wide format to long format. It is also useful to reformat a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are unpivoted to the row axis.

In [10]:
# creating a dataframe
df = pd.DataFrame({'Name': {0: 'Claire', 1: 'Richard', 2: 'Bob'},
                   'Favourite Food': {0: 'Sushi', 1: 'Pizza', 2: 'Sandwiches'},
                   'Age': {0: 27, 1: 23, 2: 21}})
df

Unnamed: 0,Name,Favourite Food,Age
0,Claire,Sushi,27
1,Richard,Pizza,23
2,Bob,Sandwiches,21


In [11]:
# Name is id_vars and Favourite Food is value_vars
pd.melt(df, id_vars =['Name'], value_vars =['Favourite Food'])

Unnamed: 0,Name,variable,value
0,Claire,Favourite Food,Sushi
1,Richard,Favourite Food,Pizza
2,Bob,Favourite Food,Sandwiches


In [12]:
# multiple unpivot columns
pd.melt(df, id_vars =['Name'], value_vars =['Favourite Food', 'Age'])

Unnamed: 0,Name,variable,value
0,Claire,Favourite Food,Sushi
1,Richard,Favourite Food,Pizza
2,Bob,Favourite Food,Sandwiches
3,Claire,Age,27
4,Richard,Age,23
5,Bob,Age,21


More information on this function can be found by running the code block below.

In [13]:
?pd.melt

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mmelt[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mframe[0m[0;34m:[0m [0;34m'DataFrame'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid_vars[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalue_vars[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvar_name[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalue_name[0m[0;34m:[0m [0;34m'Hashable'[0m [0;34m=[0m [0;34m'value'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcol_level[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mignore_index[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'DataFrame'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one

## Data scraping

**PLEASE NOTE: Because this is using a live website, the website may change its underlying structure, so may break the code presented in this section at any moment. However, the principles covered should still apply**

The web is full of data, most of which cannot be gained through API's or pre-prepared datasets. In these cases, we need to gather this data ourselves through the use of data scrapping. In this section, we will be going through a simple example to collect some data from a public website.

For this section, we will be using the following libraries:
  * ```beautifulsoup4``` - HTML and XML parser (more info can be found [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/))
  * ```pandas``` - stores data in a nice format for data analysis
  * ```requests``` - allows for requests to be made to external websites for files

As always, the first thing we need to do is install and import the libraries we need

In [18]:
# %pip install beautifulsoup4 pandas requests
from bs4 import BeautifulSoup
import pandas as pd
import requests

### Downloading a webpage

We can then specify a webpage and use a combination of **BeautifulSoup** and **requests** to pass the the data using the `html.parser` parser. We can use `prettify` to see all the content on the page.

Here, w are going to extract a list of largest companies in the US by revenue from https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue

In [None]:
url = "https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
print(soup.prettify())

Look at the output.  You can see the html tags.  We can use those tags, with ```find()``` and ```find_all()``` to extract the information we want. ```find()``` gets first response, ```find_all()``` gets a list of all responses - seperated by commas.

The information we want is inside a ```<table>``` tag.  But there are several tables.  To get at the information we could use indexing, but we can also use the class


In [20]:
table = soup.find_all("table")[1]

In [None]:
print(table)

Notice that the table headings have a ```th``` tag.  We can use that to extract the table titles from ```table```

In [22]:
titles = table.find_all('th')

In [23]:
titles

[<th>Rank
 </th>,
 <th>Name
 </th>,
 <th>Industry
 </th>,
 <th>Revenue <br/>(USD millions)
 </th>,
 <th>Revenue growth
 </th>,
 <th>Employees
 </th>,
 <th>Headquarters
 </th>]

We can tidy this up further to help us get it into a format ready for a DataFrame.

We create a new list ```table_titles``` by iterating over each element in the titles list, extracting the text content of the element, and stripping any leading and trailing whitespace 

print(table_titles) prints the resulting list

In [25]:
table_titles = [title.text.strip() for title in titles]
print(table_titles)

['Rank', 'Name', 'Industry', 'Revenue (USD millions)', 'Revenue growth', 'Employees', 'Headquarters']


Now we can start to think about getting our data infro a DataFrame called ```df```.  The following code creates an empty Pandas DataFrame with the column names specified by table_titles.

In [28]:
df = pd.DataFrame(columns = table_titles)
df

Unnamed: 0,Rank,Name,Industry,Revenue (USD millions),Revenue growth,Employees,Headquarters


Now we need to get to the data.  Looking back at ```soup```, we can see that the data is inside a ```<td>``` tag, encapsulated in a ```<tr>``` tag
- d for data, r for rows.

The first job is to find all of the ```<tr>``` tag text and store that as a new list ```column_data```
.

In [34]:
column_data = table.find_all('tr')

We then need to 

* loop through the rows in ```column_data```, skipping the heading row.
* As we loop through each row we look for the ```<td>``` tags (cells) within that row.
* We then iterate over each cell (data is an individual <td> element) in row_data. For each cell, we extract the text content (data.text) and strip any leading and trailing whitespace (strip()).


In [37]:
for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]
    print(individual_row_data)
    

['1', 'Walmart', 'Retail', '611,289', '6.7%', '2,100,000', 'Bentonville, Arkansas']
['2', 'Amazon', 'Retail and cloud computing', '513,983', '9.4%', '1,540,000', 'Seattle, Washington']
['3', 'ExxonMobil', 'Petroleum industry', '413,680', '44.8%', '62,000', 'Spring, Texas']
['4', 'Apple', 'Electronics industry', '394,328', '7.8%', '164,000', 'Cupertino, California']
['5', 'UnitedHealth Group', 'Healthcare', '324,162', '12.7%', '400,000', 'Minnetonka, Minnesota']
['6', 'CVS Health', 'Healthcare', '322,467', '10.4%', '259,500', 'Woonsocket, Rhode Island']
['7', 'Berkshire Hathaway', 'Conglomerate', '302,089', '9.4%', '383,000', 'Omaha, Nebraska']
['8', 'Alphabet', 'Technology and cloud computing', '282,836', '9.8%', '156,000', 'Mountain View, California']
['9', 'McKesson Corporation', 'Health', '276,711', '4.8%', '48,500', 'Irving, Texas']
['10', 'Chevron Corporation', 'Petroleum industry', '246,252', '51.6%', '43,846', 'San Ramon, California']
['11', 'Cencora', 'Pharmacy wholesale', '238

We're nearly there - we just need to get this into the DataFrame, it is easier to do this as we loop through.

In [38]:
for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data ]
    
    length = len(df)
    df.loc[length] = individual_row_data
    
df


Unnamed: 0,Rank,Name,Industry,Revenue (USD millions),Revenue growth,Employees,Headquarters
0,1,Walmart,Retail,611289,6.7%,2100000,"Bentonville, Arkansas"
1,2,Amazon,Retail and cloud computing,513983,9.4%,1540000,"Seattle, Washington"
2,3,ExxonMobil,Petroleum industry,413680,44.8%,62000,"Spring, Texas"
3,4,Apple,Electronics industry,394328,7.8%,164000,"Cupertino, California"
4,5,UnitedHealth Group,Healthcare,324162,12.7%,400000,"Minnetonka, Minnesota"
...,...,...,...,...,...,...,...
95,96,Best Buy,Retail,46298,10.6%,71100,"Richfield, Minnesota"
96,97,Bristol-Myers Squibb,Pharmaceutical industry,46159,0.5%,34300,"New York City, New York"
97,98,United Airlines,Airline,44955,82.5%,92795,"Chicago, Illinois"
98,99,Thermo Fisher Scientific,Laboratory instruments,44915,14.5%,130000,"Waltham, Massachusetts"


So as above

* Skips the header row and iterates over the remaining rows.
* Finds all cell elements in the current row.
* Extracts and clean the text from each cell.

Then 
* Use ```length = len(df)```to determine the current length of the DataFrame.
* Use ```df.loc[length] = individual_row_data``` to append the cleaned data as a new row to the DataFrame.

Finally, we can export the DataFrame as a ```.csv``` file if that is useful. 

In [39]:
df.to_csv("companies.csv", index = False)

## Summary

In summary we have:
* Brought in the required libraries and packages
* Specified the url for the page we want to scrape
* Made our soup
* Tried to find the table we wanted and specify it
* Working with the table we
    * Created a dataframe using the column titles
    * Got the individual row data and added that to the DataFrame
    * Exported the DataFrame to csv

## Exercises

### Exercise 1 - Analysing the scraped data
Using the data provided above, find the following information:
  * The average revenue
  * The average revenue growth
  * The average number of employees
  * The company with the largest number of employees
  * Any anomalous results (there may be none present)
  * Produce an interesting plot of data in the DataFrame

In [40]:
## Exercise 1 code here! ##

### Exercise 2
The BBC weather website has a connected RSS feed, that allows users to pull information about the current weather for a given region (depending on a 7 digit ID number). For example, if you go [here](https://weather-broker-cdn.api.bbci.co.uk/en/observation/rss/2652221), you get the weather data for Coventry in a handy XML format. By changing the last 7 digits to a different ID number, you can get the weather somewhere else in the country.

Examine the weather for the following places:
 * Kenilworth (ID: 2645822)
 * Washington DC (ID: 4140963)
 * Sydney (ID: 2147714)

In [41]:
## Exercise 2 code here! ##

### Exercise 3 - Scraping weather data
Using the knowledge from data scrapers and data passing, produce a Data Frame storing the information for 1000 ID values (for example, all ID's between 2645000 and 2646000 or 1000 random unique ID's):

  * ID
  * Place Name
  * Place Latitude
  * Place Longitude
  * Time
  * Temperature
  * Wind direction
  * Wind speed
  * Humidity
  * Pressure
  * Visibility

  **NB:** Some IDs may not be valid, so you may not get 1000 valid entries. An example of non-valid ID's include [2652230](https://weather-broker-cdn.api.bbci.co.uk/en/observation/rss/2652230). In these cases, no data should be stored for these IDs.

In [42]:
## Exercise 3 code here! ##