Basics of Online Corpus Research - definitions
Corpus |
(plural = corpora) a collection of machine-readable texts which can be searched. Many corpora are designed to be representative of a language or genre. In reality, no corpus can reflect the entire language. It is important to be aware of the types of texts that make up the corpus you are using. |
Concordancer |
A tool for searching corpora. Online corpora have their own concordancers built in. For example the British National Corpus’s concordancer is called XAIRA, the Cobuild Bank of English concordancer was called ‘Look Up’. Stand alone concordancers can be free such as ‘AntConc’ or commercial such as ‘ WordSmith Tools’. |
Concordance |
The output from a concordancer |
Node |
The search word when you are using a corpus and a concordance |
Key Word In Context (KWIC) |
A concordance which shows the search word (node) in the middle with an equal number of words of context either side. |
Word |
A term which can be divided in to a number of different concepts in linguistics and lexicography. Although we all ‘know’ what a word is, in linguistics it is vague because it covers too many meanings. It is better to use more precisely defined terms. |
Word form |
A sequence of characters between two spaces or punctuation marks. A computer can count word forms easily. |
Token |
A word form counted as an individual item. A computer can tell you the number of tokens in a text i.e. all the words. |
Type |
A word form counted as representing all those with the same sequence of characters. A computer can tell you the number of types in a text i.e. all the different words. |
Lexeme |
An abstract concept which covers the word forms that are related in meaning. A computer cannot handle meaning easily unless the corpus and context are extremely restricted. |
Lemma |
(plural = lemmas or lemmata) The representative word for the family of related word forms of a lexeme eg. the lemma BE includes is, was, are, were, being, been. This can also be called the citation form or dictionary head word. Once it has been programmed to recognize them, a computer can sort word forms into groups under different lemmata. |
Comparison of a selection of tools
Concordancer |
Corpus |
Can we use our own corpus? |
Platform |
XAIRA |
British National Corpus |
No |
Online |
Lookup |
Bank of English, Wordbanks online |
No |
Online |
Wordcruncher (?) |
Brigham Young University corpora |
? |
|
Sketch Engine |
BNC and other corpora |
Yes, |
Online |
WordSmith Tools |
Any corpus |
Unlimited |
PC and Mac |
AntConc |
Any corpus |
Unlimited |
PC and Mac |
Any corpus |
Unlimited |
PC only |