IM904 Labs Week 3

Aims

By the end of this session you should be able to:

Critically consider how to choose a research question.
Scrape a web page data or copy website information into Excel.
Consider the elements of most empirical research projects.
Perform basic queries within the TCAT tool.

Preparation

Please do the following before attending the lab session:

Watch the below presentation.
Look briefly at the driverless cars and brexit of the Guardian website. Our topics for the workshops are the post-referendum brexit period and driverless cars. Please only look at a few articles to familiarise yourself with the topics. You are not expected to be experts on the topics.
Install Data Miner in Chrome.
Install Recipe Creator in Chrome
Read through the Data Miner help documentation.

You are not required to do the following before the labs sessions, but it might be useful to do so:

Consider what would be interesting to find out about Brexit or Driverless cars. Write down a research question and consider how you could answer the research question with online data. Bring this with you. It will be useful for tasks 1 and 2.
Try to download raw data from a web page using Data Miner. You may want to refer to the w3s tutorials about HTML and the Data Miner tutorial videos.

Presentation

Session

The structure for the lab sessions is a flipped classroom. Please try and complete the preperation above and you are very welcome to try the session tasks (below) before the session. The lab sessions are an opportunity to carry out the tasks within small groups and engage in group discussions about the topics and skills.

Task 1 - The question

The research question is a core component of data collection, analysis and visualisation. Your question helps you to choose the raw data you wish to collect, decide how to carry out your analysis and consider which visualisation to present in order to answer the question.

Your task is to:

Consider the two lab topics: Driverless cars and Brexit.
Think about possible reserach questions. What would be an interesting element of the issues for you to examine?
Write down a reserach question about Driverless cars or Brexit. For example, 'Are "liberal" newspapers a key actor in the Brexit debate?' or 'How are Driverless Cars discussed in the wider public?'.
How could you answer these questions? What would an answer look like in an academic paper? Discuss this in your groups.
Identify the data sources you would use. Which web pages or other online sources could you use?

Task 2 - Data Miner

'Web scraping' allows us to collect rich data from a web page. For example, we can collect the comments sections from blogs or pricing information from Amazon. One method is to copy and paste data from the web page into an Excel file. However, this process is tedious, time consuming and prone to mistakes. Instead, we can use programs languages such as Python and R or Internet browser extensions such as Data Miner to automate data collection.

You will use Data Miner to collect information from a few simple web pages. If you are having difficulty with this task then please raise your hand and James will come over and help. You should also review the tutorial web pages on HTML and the Data Miner documentation. In particular, watch the video about making recipes and read the tutorial section. You may also need to create a Google account if you wish to download a large amount of data using Data Miner.

The below video shows you how to use data miner to collect data from the Gateway Timeline web page.

Your task is to download comments from one of two webpages. James will go through how to complete the task. Please choose one of the following pages to scrape data from:

Data miner offers an easy way to collect data from the internet. Though you need a google account to access the service.

In groups, consider:

1. Can your data help you address your question?

2. Is additional processing needed?

Task 3 - Mining an existing data set

Research often involves an existing data set. Here and in later workshops you will be using a twitter tool called TCAT. On TCAT there are tweets from the London, Coventry and Birmingham area. The following is intended to be an introduction to this tool

James will demonstrate a simple query. The data sets we are using are the driverless and brexit data sets. Please refer to the TCAT page for more details on TCAT and these data sets.

Your task is to:

1. Consider if you would examine data from all three locations. If you only choose one, why?

2. Query TCAT for data. Choose a few days of time and then enter your query (e.g. Brexit AND Theresa). How much data do you get?

3. Download a random sample of tweets. Look through them. What do you see and what can they tell you?

Going further

Data Miner is one method for downloading data from web pages. There are others. A more common method is to use a programming language. Below are tutorials for downloading web data in Python and R:

Python
R

Both of these options are quite advanced. We will consider how to use R throughout these workshops. A simpler alternative is to use Excel's power query option:

Power Query

Power query is simpler than R but not as flexible as Data Miner. Also, it can get very complicated - more so than R - if the data is not a table (for example, see here).