Skip to main content

Online Corpora Workshop 2

Hands-on Workshop Continued

Investigating lexis using online tools

Sketch Engine was created by Adam Kilgariff, originally for making Word Sketches for Macmillan Dictionary (analyzing the BNC according to parts of speech and showing how individual words behave). Now it is able to process a number of other corpora and even create your own custom corpus from web pages. It uses the tagging of the BNC to offer very detailed word information as well as comparisons between words which include syntax information.

Log in to Sketch Engine http://www.sketchengine.co.uk/ (take a 30 day free trial)

Task 1 Looking at a multiword unit

  1. You need to choose a corpus. The choices for English are BNC (96 million words) or enTenTen12 (1.73 billion words). The latter is huge so requires a fast computer processor and good Internet connection. For speed today choose BNC.
  2. Type in a query ‘as a matter of’
  3. Choose ‘Phrase’ and click ‘Search’. Clearly this is a common string in English. Which are the most common nouns after ‘of’? To see this we can sort the list to the right.

Task 2 Sorting

  1. Click ‘Sort’ and choose ‘Simple Sort’. Click ‘Right Context’ and wait. You will see that immediately to the right are punctuation marks. Find the ‘Jump to’ box and click on a letter eg ‘l’ which nouns are frequent here?

  2. We can look at all the gerunds with this phrase. Click back to ‘ Sort’. On Mulitlevel sort, click first level: Attribute ‘word’, check ‘Ignore case’ and ‘Backward’ and choose position 1R (= one to the right of the node).

  3. At second level, click 1R (this will sort alphabetically) and then click on third level 2R (this will sort the following words alphabetically).

  4. Look at the words grouped on the right by ending – ing, –al, -tion.

Task 3 Checking Collocation

  1. To look at the nouns that follow this phrase more efficiently we can use the collocation tool. Click ‘Collocation’. Attribute ‘word’ in the range from 0 to 5 (the default looks at 5 each side but we are only interested in words coming after the phrase).

  2. Leave the rest and click ‘Make Candidate List’.

  3. Why is ‘urgency’ higher than ‘fact’ even though ‘fact’ is more frequent? This is because it is sorted by logDice (similar to MI score) rather than raw frequency (T-score).

  4. Click on each column heading to sort by that item – what is the most frequent item? Why?

  5. Click on MI score – see that ‘fact’ goes down to 5th place. This is because some of the words above (prudence, inexorable) are relatively infrequent in the corpus so their collocation with this phrase is more significant.

Task 4 Looking at a word sketch

  1. Click on Word Sketch. Type in ‘matter’ and ‘Show Word Sketch’. Notice in the pp_of-p we have 5169 examples – this is because of the multiword unit we have been looking at. In this box are many of the nouns we have been looking at. Why do we have ‘seconds’ in this list?

  2. Click on the number next to the word to see the related concordance lines.

Task 5 Looking at differences between two words

  1. Click on Sketch-Diff. Type ‘rummage’ in the the ‘Lemma’ box and choose ‘verb’.

  2. Click on ‘lemma’ and type in ‘search’

  3. Click ‘Show Diff’ and examine the green (rummage) and red (search) collocates. What differences do you notice? Are these all predictable from intuition?

Task 6 Search for similar / related words

  1. Click on Thesaurus
  2. Type 'upcoming' in the word box
  3. Click Search - notice that the words are not synonyms but they share semantic features (ie denoting a time to come or in the past).

Task 7 Semantic Prosody

  1. Type in 'academic'
  2. In what circumstances does this have a negative prosody?
  3. Check the prosody of the phrasal verb 'set in' - what do you notice? What do you notice about the position of this phrase in a sentence?

Task 8 Make your own corpus

Click on Home and choose ‘WebBootCat’

  1. Give your corpus a name (no spaces)

  2. Choose a language and click get seed words from Wikipedia.

  3. Choose English and type a sport or other topic you are interested in.

  4. Choose words from the list given which you recognize as typical of that topic and click Use WebBootCat with selected words.

  5. Give the corpus a name and click next. Wait for Sketch Engine to process the texts.

  6. Choose the texts you want to use for your corpus.

  7. Search the corpus for significant phrases related to that topic.
  8. In what circumstance is a customised corpus useful?

Task 9 Other useful Websites

  1. Explore The Compleat Lexical Tutor. This website has a huge range of free tools related to corpus research. There is access to a number of small to medium size corpora. You can check the profile of any text - copy and paste the text we looked at in the lecture (below) into the Vocab Profiler:
  2. Log in to Collins Wordbanks Online - take a month's free trial. Choose any of the words and phrases we have been discussing and explore the results.


There are many ways in which we can implement technologies in English teaching and learning. Most of us have become used to using the term e-learning for this purpose. What do we actually imagine when we hear or read the word e-learning? For some of us, e-learning has a positive connotation. We believe it is something that helps us and the students in the learning process. For others, however, e-learing can be a nightmare they are forced to take part in although they do not believe in its effect, an equivalent for time consuming, lonely activities for their own sake.

I think the approach differs according to how narrowly we understand the notion of e-learning and also how user friendly and manageable the devices we use are for both teachers and students. How broadly are we allowed to treat ‘e-learning’? We have heard many times that what we – between Masaryk University in Brno, Czech Republic and Aberystwyth University Wales (see Project INVITE 2008) – do is not real e-learning because we do not seat students in front of individual computer screens in a room where all you can hear are the sounds of typing and the buzz of a local server.