Skip to main content

Free online Corpora for Lexical Research

This is a list of the most commonly used corpora that are totally free to research.

ENGLISH LANGUAGE CORPORA HOSTED BY BRIGHAM YOUNG UNIVERSITY - free access although they will monitor your usage and ask you to register if you continue to use them (it is still free).

1) Corpus of Contemporary American English http://corpus.byu.edu/coca/

This 450 million word corpus of American English hosted on the Brigham Young University website allows you to compare a word according to its genre and see the changes in its use from 1990 to 2012.

2) Corpus of Historical American English (COHA) http://corpus.byu.edu/coha/

This is a 400 million corpus of American English from 1810-2009 which will allow you to see the changes in word use over a long period of time.

3) TIME Magazine Corpus of American English http://corpus.byu.edu/time/

This is a 100 million corpus of American English from 1923-2006 where you can check the progress or decline of the use of a term in TIME magazine over the decades of the twentieth century.

4) Corpus of American Soap Operas http://corpus2.byu.edu/soap/

This is a 100 million word corpus of American English drawn from popular TV soap operas from 2001 to 2012

5) BYU-BNC: British National Corpus http://corpus.byu.edu/bnc/

This is the Brigham Young University interface for searching the 100 million word corpus of British English collected around 1980s-1993

SOME OF THE MANY ENGLISH LANGUAGE CORPORA HOSTED BY THE COMPLEAT LEXICAL TUTOR all corpora accessed by a drop-down menu here: http://www.lextutor.ca/concordancers/concord_e.html

1) Brown Corpus (description from website)

The Brown is the classic early corpus that many of those that followed are based on. American, late 1970s, developed by Kucera and Francis at Brown University (NJ), this corpus comprised 500 written texts of 2,000 words each in three main divisions (press, journalism, and academic) and several subdivisions.

2) BNC Written (1 million), BNC Spoken (1 million)

After the compilation of the 100 million word British National Corpus, Oxford University Press publicized the achievement in two BNC Sampler corpora of roughly 1 million words each on CD-Rom, one of spoken English and one of written English, These were modified for work on Lextutor by having their tags removed, and they have served in applied linguistics classes to explore differences between written and spoken English (e.g. at http://www.lextutor.ca/range/.)

3) Brown + BNC Written (2+ m)

These corpora are described above. The purpose of joining the Brown and the Written Sampler into a single corpus was threefold: to form a corpus large enough to give at least 10 examples of most medium frequency items; to create a corpus small enough to run over the Web on a phone line; to combine British and American linguistic features.

3) 2k Graded Corpus (920,000)

This corpus is formed of hundreds of graded readers, scanned and digitized over 10 years. They have about 2000 word families = 95% of the running words overall (not counting proper nouns). This corpus answers a major need in pedagogical concordancing, that in order for learners top perceive lexical or other patterns in a corpus, the corpus must be largely composed of items they are familiar with.

3) 1k Graded Corpus (530,000)

This is derived from the 2K graded corpus and is an even more simplified corpus (1000 Word families). It is the closest to a corpus suitable for absolute beginners.

OTHER FREE ACCESS CORPORA

1) English Language Interview Corpus as a Second Language Application http://www.uni-tuebingen.de/elisa/html/elisa_index.html

This is a small corpus of interviews with native speakers of English which have been transcribed. The search interface is easy to use.

2) The Scottish Corpus of Texts and Speech (SCOTS) Project http://www.scottishcorpus.ac.uk/

This is a corpus of spoken Scottish with recordings and transcriptions available to listen to. You can search for a word, choose one of the concordance lines and hear it in context. The focus of many of the recordings is discussion of Scots dialect so there are many unusual words in the corpus.

3) The Corpus of Modern Scottish Writing http://www.scottishcorpus.ac.uk/cmsw/

This is a companion to the SCOTS project. It is a corpus of 5.5 million words of written Scottish texts from 1770 to 1945

4) Michigan Corpus of Academic Spoken English http://micase.elicorpora.info/

This is 1.8 million words of transcribed speech from lectures, seminars and other academic situations. You can restrict your search by speaker attributes (eg gender, age, etc.) or by transcript attributes (eg speech act type, academic discipline etc).

5) British Academic Spoken English http://www.coventry.ac.uk/research/research-directory/art-design/british-academic-spoken-english-corpus-base/search-the-base-corpus/

This is a corpus of transcribed lectures and seminars taken in UK universities, many of them at Warwick University. It can be downloaded and searched with your own concordancer, or a more restricted access is given via Sketch Engine - see the page above for more information.

6) British Academic Written English http://www.coventry.ac.uk/research/research-directory/art-design/british-academic-written-english-corpus-bawe/contents-of-the-bawe-corpus/search/

This corpus was developed as a research project at the Universities of Warwick, Reading and Oxford Brookes. It has just over 6.5 milliion words of well-written mostly undergraduate essays. It is downloadable but the above page has advice for different ways to search it online.

7) English