IM904 Labs Week 7

Aims

The aim of this session is to learn how to: (a) use exploratory visualisations to analyse events/topics, (b) critically evaluate strengths and limitations of the Twitter data in event analytics with help of several visualisation methods (such as pie charts, Sankey diagrams and scatter plots).

Preparation

This session does not require any advance preparation.

Part 1: Exploratory data visualisations in Excel

During this session you will be working with one of the geolocated datasets (Coventry), 6-months archive for which (01/06/17 - 31/12/17) is uploaded to the TCAT. To start with, we are going to construct the exploratory query in order to identify the most prominent topics (reflected in the hashtags) for that city and during that specific time period. This can be done with help of TCAT option 'Hashtag-user activity' (the pre-downloaded dataset is here). After initial data exploration, we can notice that the most prominent hashtag is #ukweather. We therefore we are going to explore in more details topic 'weather', and for this purpose we will be using visualisation techniques.

First of all, we run the query 'weather', which yields ~ 12.547 tweets. The data has been extracted and uploaded here. When open the csv file in Excel, we can observe one intriguing property of the dataset, such as uneven distribition of the location data. As you probably aware, Twitter data can have two types of associated location data: (1) user-defined location (which is specified by the user, column 'Location') and (2) device-enabled location (when users allow Twitter to read the exact location, from where tweet is being sent from, columns 'lat', 'lng'). We therefore are going to look into how location structure is represented across the tweets, which have precise georeferencing (i.e., entries in the columns 'lat'-'lng') and those ones, without enabled geographic coordinates (files for both scenarios can be downloaded from here). For our analysis, we are going to use Excel 'Pivot Table' function and pie chart as a visualization option. When analysing the results, try to consider the following question:

Which of the scenarios has more diverse structure of locations? Why do you think this is the case?

weather1

Next, we are going to look into the structure of the actors, engaged with the topic 'weather'. We are going to use the same techniques as above. Please take a look at the results and try to answer the following questions:

Can you see the noticeable difference in the structure of actors for both scenarios? How do they differ and/or compare to the previous visualization step?
What can you say about scenario, where precise geolocations are dominated by very few (literally a couple) of actors?
What would be the next logical step into this type of data inquiry?

weather2

Part 2: Visualizing additional topic categories in RAWGraphs

In the second part of our session we will continue to dig into the structure of the topic 'weather' on Twitter, in order to understand the following: (a) what is the structure of hashtags associated with this topic and whether it covers one or several events; (b) what are the structure of most active actors and their temporal activity and (c) whether there is segregation between types of hashtags used during the event, and how it is reflected in the properties of actors' profiles.

First of all we will look into the structure of the top 10 hashtags, associated with this topic. For this purpose, we will use the dataset, which has been extracted from TCAT, using 1-month (01/08/17-31/08/17) interval query and 'Hashtag-user activity' option, and subset to the top 10 hashtags as defined by the total number of tweets, associated with each of them. To visualise the hashtag distribution, we will use one of the RAWGraphs 'Alluvial Diagram'.
Secondly, we will look into the temporal (daily) activity of the most active actors of the 'weather' topic in Coventry. For this purpose, we will use the dataset, which has been extracted from TCAT, using 1-month (01/08/17-31/08/17) interval query and 'User stats (individual)' option, and subset to the min 10 entries per day as posted in the extracted dataset. To visualise user activity, we will use 'Bump Chart'.
And finally, building on the previous findings, we will try to find out whether actors' profile activity (num. followers vs num. friends) on Twitter can define the hashtags they use in topical conversation ('weather' in our case). For this purpose, we will use 'Scatterplot'.