Skip to main content

Online Corpora Workshop 1

Hands-on Workshop

Investigating lexis using online tools - online corpora with built-in concordancers

Online corpora vary hugely in their make-up, access tools and general user-friendliness. We will start with an easily accessed small corpus, and then look at a much larger one using two different tools. Finally we will explore another large corpus with the tools provided at the Brigham Young University website.

ELISA English Language Interview Corpus as a Second-Language Application

According to their website, the ELISA corpus was developed at the University of Tuebingen (Dept of Applied English Linguistics, AEL) and the University of Surrey (Dept of Languages and Translation Studies, LTS) "as a resource for language learning and teaching, and interpreter training". For our purposes it shows what a small corpus can and can't do. The corpus is made up of transcribed interviews with native speakers of English who talk about their professional career (e.g. in tourism, politics, the media or environmental education). As such it is a relatively rare example of a purely spoken corpus. All the words have been made into a word list which runs down the left hand side and are clickable links. The list has been lemmatised so that related words are made into sublists. There is also a clickable alphabet along the top for quicker access to a specific word.

Task 1 Exploring a small corpus

1) Work in pairs, discussing points of interest as you go. Raise your hand if you need help. Go to www.uni-tuebingen.de/elisa/html/elisa_index.html

2) Click on Browse all Interviews - here you can see the list of 24 interviews in the corpus. Underneath, the interview sections are grouped by topic.

3) Click on the first interview 'Working as an arts therapist' and you see a range of activities (unfortunately the video does not play). Quickly browse the Exercises based on this text.

4) Click on View Wordlists - here the vocabulary of the interview has been listed by frequency. Content words are listed separately. Why might this be useful? Look a the most frequent words (overall) - why isn't 'the' the most frequent word (as it is in most other corpora)?

5) Click on Browse words. The word list is at the side. Notice that it is lemmatised. The lines for each word (concordance lines) are displayed on the right with the text underneath. Clicking on a word in the list will bring up the relevant lines - have a look at 'have' and 've' - what do you notice about the choice of pronunciation here?

6) Go back to ELISA home (click top right corner). Click on Browse All Words with Web Concordance. Click on G in the alphabet at the top and then choose ‘ grow’. Look at the concordance lines – do you notice any difference in meaning according to the different forms of the word? This supports the case for maintaining that each word form should be searched in its own right – it may have a specific semantic or syntactic profile.

7) Click on M and choose 'mean'. The word list does not differentiate by meaning. What is the most frequent use of the word here? How predictable was this, considering the type of corpus this is?

8) Click on L and look at ‘like’ – how many instances of this word are there? How many are from the verb LIKE?

9) Click on R - is 'rummage' in the list? What does this tell you about small corpora?

From this small corpus, we will now move to looking at a much larger one.

BNC- British National Corpus

This corpus of 100 million words was created from texts published in Britain in the 1980s and early 1990s and includes some transcribed spoken texts. It was intended to be a balanced representation of British English. It is somewhat dated now (eg. there are no instances of ‘web page’) but it remains a large, well researched resource. The website uses a particular concordancer called XAiRA (formerly SARA). For each search word, you are offered a maximum of 50 instances in full sentences taken randomly from the corpus. Each time you search you may get different results. The small number of hits is a good way in to a word as it is not overwhelming to read each one. The fact that you have each instance in a complete sentence is useful for context, and is different from other concordancers where you are only given a concordance of sentence fragments with the node word in the middle (ie a KWIC concordance). However it is harder to review the word quickly since it is not highlighted in the sentences given.

Task 2 A large corpus basic search - looking at numbers and contexts

1) Go to the website http://www.natcorp.ox.ac.uk and type in ‘upcoming’

2) How many are there in total in the corpus? …………………..

3) Look at the contexts especially the nouns after ‘upcoming’. What field(s) do these nouns come from?

4) Now type in ‘imminent’ which means roughly the same as ‘upcoming’. How many instances are there?..............

5) Do you notice any differences in collocation?

6) Do the same for 'forthcoming' - what other meaning is involved here?

7) With the person next to you, agree a word of interest to you both and think of a synonym. One of you look for the search word and the other look for the synonym.

8) Compare results. How many in the whole corpus are there? Glance through the contexts and note down anything of interest and differences between the two searches.

Brigham Young University BYU-BNC

This website has exactly the same corpus (sets of texts) as its data ie the British National Corpus. However it uses a completely different concordancer which allows a number of different searches to be carried out. The fact that the BNC was designed with roughly equal sets of texts in a similar genre makes it easier to compare across genres. All the texts date from the 1990s so it is not suitable for comparison across time or for looking at recently emerging words / phrases. It has been completely lemmatised so it can be searched for grammatical information as well as lexical.

Task 3 Comparing two words

1) Go to the Brigham Young University BYU-BNC website BNC http://corpus.byu.edu/bnc/

(Register if you need to).

2) Click ‘Compare’. Type in ‘upcoming’ and ‘imminent’.

3) What do you notice about the results?

4) Do the same with 'upcoming' and 'forthcoming'; also 'forthcoming and imminent'

4) Type in the two words you investigated earlier and click 'Compare'. Do these results confirm or challenge what you found out using the BNC's own website?

5) Choose 2 other words to compare – perhaps / maybe; just / equitable; mad / insane; deeply / closely

Task 4 Comparing per genre

  1. Stay on the BYU-BNC website. Click on ‘Chart’. Type in ‘upcoming’

  2. What do you notice about the use of this word?

  3. Choose another word (eg henceforth; funky; lovely) and check the frequency per genre.

Task 5 Key word in Context

  1. Choose KWIC and type in ‘upcoming’

  2. Look at the output – what do the colours seem to signify?

  3. Type in ‘just’ and look at the KWIC. This shows the limits of the colour coding.

Task 6 Corpus of Contemporary American English – complex searches

The acronym for this corpus is COCA - it is 450 million words of American origin since 1990. It can thus be searched for the change in use of a word over the last few decades as well as by genre. As it is regularly updated, it offers a snapshot of contemporary English as used in the USA.

1) Click the BYU- BNC heading

2) Look at the Introduction and click the link for COCA http://corpus.byu.edu/coca/

3) At the bottom of the page click the ‘Five minute guided tour’.

4) Click each link and check how the information on the left side changes.

Task 7 Your Own query

1) Choose one of these or make your own search:

a) 'kith and kin'

b) 'house' as a verb

c) all the forms of 'bring'