Skip to main content

Basics of Online Corpus Research - definitions

Corpus

(plural = corpora) a collection of machine-readable texts which can be searched. Many corpora are designed to be representative of a language or genre. In reality, no corpus can reflect the entire language. It is important to be aware of the types of texts that make up the corpus you are using.

Concordancer

A tool for searching corpora. Online corpora have their own concordancers built in. For example the British National Corpus’s concordancer is called XAIRA, the Cobuild Bank of English concordancer was called ‘Look Up’. Stand alone concordancers can be free such as ‘AntConc’ or commercial such as ‘ WordSmith Tools’.

Concordance

The output from a concordancer

Node

The search word when you are using a corpus and a concordance

Key Word In Context (KWIC)

A concordance which shows the search word (node) in the middle with an equal number of words of context either side.

Word

A term which can be divided in to a number of different concepts in linguistics and lexicography. Although we all ‘know’ what a word is, in linguistics it is vague because it covers too many meanings. It is better to use more precisely defined terms.

Word form

A sequence of characters between two spaces or punctuation marks.

A computer can count word forms easily.

Token

A word form counted as an individual item.

A computer can tell you the number of tokens in a text i.e. all the words.

Type

A word form counted as representing all those with the same sequence of characters.

A computer can tell you the number of types in a text i.e. all the different words.

Lexeme

An abstract concept which covers the word forms that are related in meaning.

A computer cannot handle meaning easily unless the corpus and context are extremely restricted.

Lemma

(plural = lemmas or lemmata) The representative word for the family of related word forms of a lexeme eg. the lemma BE includes is, was, are, were, being, been. This can also be called the citation form or dictionary head word.

Once it has been programmed to recognize them, a computer can sort word forms into groups under different lemmata.


Comparison of a selection of tools

Concordancer

Corpus

Can we use our own corpus?

Platform

XAIRA

British National Corpus

No

Online

Lookup

Bank of English, Wordbanks online

No

Online

Wordcruncher (?)

Brigham Young University corpora

?

 

Sketch Engine

BNC and other corpora

Yes,

Online

WordSmith Tools

Any corpus

Unlimited

PC and Mac

AntConc

Any corpus

Unlimited

PC and Mac

Concordance

Any corpus

Unlimited

PC only