IM904 Labs Week 5

Aims

By the end of this session you should be able to:

Import online data sources (Twitter data) into NVivo.
Extract useful bits of information by coding the relevant parts.
Query the created semantic database with help of three main NVivo modules: Word Frequency, Search and Coding.

Before the session...

Download and install NVivo 11 for Mac, following the University ITS instructions.
Watch the "Explore NVivo for Mac" video. (5mins 45s)
Watch the "Code documents" video. (2mins 43s)

Lab Steps:

Construct the query and extract datasets from the TCAT.
You will be working with Twitter data, which have been collected during the period from 01/01/2017 to 31/12/17, within the bounding boxes surrounding 3 UK cities: London (~16mln tweets), Birmingham (~3mln tweets) and Coventry (~2mln tweets). During this session we will be analysing the event known as 'Uber Ban' by the Transport for London on the 20th of September 2017, when Uber was "stripped off London licence due to lack of corporate responsibility".

Once you logged in to TCAT workspace, you will need to construct 3 queries (for each city), using the following parameters: Query: uber AND ban; Startdate (UTC): 2017-09-20; Enddate (UTC): 2017-12-31. Once your filtering parameters have been specified, scroll down to the 'Tweet exports' section and make sure that the empty box next to 'hashtags' has been ticked before proceeding to the data download from the 'Export all tweets from selection' section.

The files for this section can be downloaded here.
Start NVivo and Import data into your workspace.
In your Applications folder find NVivo icon and double-click on it, then select an option 'Create new project'. Using the next dialog window, save your project under the name 'UberBanLondon', and click on 'Create'. Now you should be able to see NVivo main workspace window. From here, select tab Data and click on 'Dataset' option, which is used for importing tabular datasets and spreadsheets. Navigate to your datasets, which have been saved in .xlsx format and click on open -> Skip through the second step by clicking on 'Next' -> During step 3 in the Import Dataset Assistant you will need to click on 'Deselect All' first, and then select 3 columns, which will be used in our analysis: 'from_user_name (Text)', 'text (Codable Text)' and 'hashtags (Text)'. You can make them active by selecting them first and then ticking the box 'Import Field'. Click Next and then Import. Repeat the same procedure for other two datasets.
Exploring thematic categories in the datasets, using Word Cloud and Text Search Queries.
Click on Query tab and select Word Frequency option. In the new dialog window click on 'Selected item' and select one of the datasets, next leave the default option for 'Finding matches' ('Exact match only') and define the minimum word length as 4. Then click on 'Run Query'. What happens next, you should be able to see the list of the words (with their associated % in the dataset) in the 'Summary' dialog window and the word cloud in 'Word Cloud' dialog window. Run the query for all three datasets and compare the results.

Please consider the following questions:
- Can you see the difference in the structure of the word clouds for all three cities?
- Are you able to detect the words, which could be linked to a specific topic (e.g., Sentiment (positive/negative), Economy (jobs, profession), Actions (support/appeal), Politics (Sadiq Khan, @sadiqkhan)?
Once you made the list of potentially useful words, under the same Query tab, navigate to Text Search Query option and try to identify their proportinal presence in all three data sources (example keywords: 'petition', 'support', 'jobs', 'ban', 'cabs', 'taxi', etc). Compare the proportional presence of each keyword across all three datasets.

When performing this analysis, try to consider the following questions:
- Are all keywords present in each dataset? If so, how similar their proportions (estimated in %) are?
- Which conclusions can you draw at this stage of analysis, when looking at how sub-topics ('problems') are being represented in each city? Are Londoners more concerned about Uber ban than people in Birmingham or Coventry?
Auto-coding.
During this step you will be working with the Analyze module, specifically with Auto-coding option. This option is designed to subset information by the categories (or 'cases') of entities (i.e., names, gender, age or language spoken, etc.). Under the Sources -> Select Internals -> and Select one of 3 datasets (so it is highlighted). Then, click on Analyze tab -> Select Auto Code -> pick the option 'Code at cases for each value in a column' -> Next -> Choose the column 'hashtag' -> Next -> Make sure that in the list of 'Selected columns' you have column 'text' -> click Auto Code. Wait for a few seconds for the code to execute.

On the left-hand side navigation menu, under the NODES, select Cases and click on the drop-down arrow under the newly created dataset. Observe the proportional distribution of each hashtag (or combination of the hashtags) across the dataset.

Consider the following questions:
- Do hashtags (or their combination) names reflect the content of the text they are associated with?
- Does text reveal any additional (useful) information about the problem as compared to the hashtag(s) only?

BREAK

The second half of the lab focuses on the process of inductively creating categories, coding data by hand in NVivo, and drawing conclusions. You will be working with the London dataset only from now on. As inductive coding is an immersive, iterative and time-intensive process, this half of the lab functions like a 'taster' session.

5. Explore the data

The research question for this half of the lab is:

What are the concerns about the London Uber ban expressed by Twitter users?'

Content analysis which uses inductive coding is most effective when researchers are very familiar with their data. You will begin by exploring the data in order to inductively create coding categories. Spend 10 minutes working on your own reading tweets and start to get a feel for your data. Make notes as you go about some of the concerns expressed by users about the London Uber ban.

6. Develop categories

In order to increase reliability, you will work in pairs or threes for this step. Discuss your prelimiary notes with them and come up with an initial list of categories for coding (these might be categories such as free-market economy, anti-leftism, individual finances, classism, public transportation, cultural heritage and so on). Create these categories as nodes in NVivo. Rememeber that this is not your final list of categories - as you code it is highly likely that more categories will emerge.

Consider the following questions:

What are the relations between these initial categories?
Can you identify possible umbrella categories or sub-categories at this stage?

7. Code the text

Working together, begin to code your data. Unlike in the previous half of the lab, you will not take as your unit of analysis not individual seminatic units (words, sentences, etc). Rather, your unit of analysis will be the theme of the text. As such, some tweets or sections of tweets may contain several relevant themes and might therefore be coded into more than one category.

Read through the tweets in your pairs/threes and discuss your coding together as you work. If you determine that particular tweets are not relevant to your research, they can be omitted.

As you code consider the following questions:

Do you need to add new categories?
Are you coding consistently or are the categories changing as you work?
Can you begin to identify relationships and patterns between categories at this stage?

8. Drawing conclusions

There will not be time for you to go through your whole dataset during the lab. As such, you cannot draw firm conclusions from your coding so far. If you were to undertake this research as part of a future project, you would need to do several 'passes' of the data whereby you reach the end of coding your whole dataset, and then go back to the beginning. This iterative approach helps you achieve a nuanced, immersive relationship with the data and ensures consistency with categories.

Nevertheless, in pairs/threes, open one of the nodes you have created and look over the text coded to it. Consider the following:

How do users discuss this particular concern? Do they draw upon particular modes of expressions, types of language, cultural commonsenses etc?
Can you identify possible sub-categories in data coded to this node?
Are there links between this node and the other nodes coded to the same data? Why might this be?