Skip to main content Skip to navigation

IM904 Labs Week 8

AIMS
  • Learn to use Google’s Ngram to visualize n-grams frequency in Google’s books corpus

  • Learn to use Voyant tools to analyse and visualize a text corpus

  • Learn to use the Stanford Natural Language Processing (NLP) tool to identify entities within a text corpus

  • Learn to export the previous tools for later analysis

BEFORE THE SESSION

Although not compulsory, It is recommended to:

1. Have a public and shareable text prepared for the class (it can be fiction writing, a compilation of news, etc., any .txt file like this one).

2. Install Java Development Kit 8.1 from here

3. Install Stanford’s NLP (instructions will be provided at the end of the lab) from https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip

1. Ngram

Ngram is a web tool to visualize words and phrases (called n-grams, e.g. “United Kingdom” is a 2-gram, while “) frequency within Google’s Books corpus, which, consists 5.2 million printed book sources from 1500 to 2008 in 6 languages. The corpus does not consider every word In the dataset, but all that appear more than 40 times.
When using Ngram remember that:

  • capitalization (“Ngram” and “ngram”) and inflection (“play cards” and “playing cards”) may present different results
  • graph’s y-axis shows the percentage of use in the corpus, and the x-axis shows time
  • It is possible to make queries in any of the corpus’ languages
  • it is possible to make queries in different periods of time (from 1500 to 2008)
  • It is also possible to download parts of the whole corpus (or even all the datasets), however, this requires a big amount of disk space and bandwidth.
Instructions
  1. Select and compare 2 or more 1-grams

  2. Select and compare 2 or more 2-grams, change the capitalization and inflection of your queries, and observe if the outcome changes

  3. Consider the following:

    Think about the limits that these queries entail. What are the advantages of having at hand a massive corpus like this?

    What are the conundrums of querying old media archives through new media techniques?
2. Voyant Tools

Voyant tools is a great off-the-shelf online tool to analyse and visualize text datasets. It works particularly well with long corpora (e.g. William Shakespeare’s complete oeuvre). It is also useful to share datasets online (easily accessible through an URL). However, unlike other services and tools previously used in the module (TCAT, Nvivo, RAW) Voyant Tools keeps all data that you upload to the platform. Due to this do not use Voyant if you have sensitive or private data, or anything that you will prefer not to share.

Instructions

1. Download the corpus “Bitcoin_academic_research_2008-22jun2015_ABSTRACTS.txt” consisting of Bitcoin-related academic abstracts from 2008 to 2016 (originally compiled by Brett Scott).
2. Go to http://voyant-tools.org/ and upload your corpus (the .txt file).
3. A general dashboard will open. You can explore most of the tools in this initial dashboard, but you can also open your corpus with a specific tool by hovering on the upper-left corner of the dashboard

voyant-menu.gif

4. You can export specific tools (e.g. an image of the wordcloud), or data (a correlation table), by hovering on the menu of each tool.

voyant-export.gifYou can also share your whole corpus (i.e. The Bitcoin abstracts corpus is available at this URL: http://voyant-tools.org/?corpus=95f3246103cb5b07c96f5a11b5bcc76e), or embed the tool on a webpage


5. Use the Voyant tools to analyse the bitcoin abstracts or a corpus of your own, and consider the following:

Can you find any interesting patterns? Which tool was more useful to analyse your dataset?
Is this text analysis tool useful for all datastes?

Are some kinds of text data better suited? What about youtube comments, twitter data, or news/blogs?

BREAK
3. CoreNLP

Stanford CoreNLP is a Natural Language Processing tool. Among other things, it will help you to quantitatively recognize entities (up to 7 classes), and export your findings for further analysis. CoreNLP has 6 language models: English (very complete), German, Spanish, Arabic, French, Chinese (requires word-segmentations, but this is available through other means, you can find a tutorial here). The tool is free, Open Sourced, and does not collect any of your data. However, one of the downsides of this tool is that the there is no documentation on how the models were/are constructed.

Instructions (online teaser)

1. Download “single_abstract.txt”, a small sample of the corpus.
2. Go to the online teaser version of NLP (http://nlp.stanford.edu:8080/ner/)
3. Paste the text in the “small_abstract.txt” file and press “submit Query”. You’ll see some organizations and locations highlighted.
4. Select a different piece of text, and play with the different classifiers. On the online version you’ll have the following options for English:

3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Location, Person, Organization, Money, Percent, Date, Time

And for Chinese:

7 class: Location, Person, Organization, Facility, Demonym, Misc, GPE

5. Paste a small text of your selection and consider the following:

How accurate the entity recognition technique seems to be?
What are the shortcomings and advantages of using NLP tool/technique? (e.g. figures of authority already defined?)

Instructions (offline CoreNLP)

The online teaser of CoreNLP is limited to a few paragraphs. In order to use the full potential of the tool, is necessary to download the program. Some functionalities are provided by a GUI (graphical user interface), but others can be extended by using the command line.

James has created a video taking you through using the terminal to run the Standard NLP. The video is shown below. Note: You may need to make the video fullscreen and set the quality to 720HD to see all the details.

The GUI tool does not appear to work on macOS High Sierra. Therefore please use the command line interface. Guidance on using this interface will be given in the lab and in the below instructions.

1. Unzip and open the folder of CoreNLP
2. Double click the file “ner-gui.sh”
3. On the “Classifier” tab, select “Load crf from file”, navigate towards the folder called “classifiers” and select the one you prefer (in English)
4. On the “File” tab, open the Bitcoin corpus used in the previous exercise
5. Press the “Run NER” button (this may take a moment)
6. Consider
7. Export the tagged corpus to a tsv file:

i. Open a Terminal (on OS X, open your Applications folder, then open the Utilities folder. Open the Terminal application)
ii. Navigate to the CoreNLP folder using:

cd [ FOLDER e.g. ~/Desktop/Downloads/CIM/nlp ]

iii. Use the following command. Modify the classifier, input file, and output file according to your needs:

java -mx600m -cp "*:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.muc.7class.distsim.crf.ser.gz -outputFormat tabbedEntities -textFile bitcoin_in.txt > bitcoin_out.tsv

8. Open your .tsv file in excel:

i. Open a blank worksheet
ii. On the “Data” tab, click “From Text”
iii. Change type of file to “All files”, and browse your file
iv. Choose “delimited” on the first step of the import wizard
v. Make sure “tab” is marked as delimiter on the second step
vi. No changes should be needed on the last step


9. Open your .tsv file in excel and modify as needed (e.g. add a first row for the titles of each column)
10. Open your edited .tsv file in RAW for visualisation
11. Try any of the previous exercises with a corpus of your own (a political discourse, an article on the news, a twitter dataset, etc) and consider the following:

Is this automated technique effective to understand the discourse in your chosen corpus?

How can we use this tool in conjunction with previously seen methods (e.g. issue mapping, content analysis, network analysis)?


*Extra corpora: Clinton and Trump discourses compiled by David Brown