Corpora


A corpus can broadly be defined as a ‘’principled collection of texts available for qualitative and quantitative analysis’’ (O’Keefe, McCarthy, and Carter, 2007, p.1). It is principled in that it is built according to specific design criteria that have to do with the size, balance and representativeness of the language in the texts. The analyses of these principled collections of texts may be quantitative (frequency of words in the texts) or qualitative (beyond the word level using corpus techniques such as concordancing and cluster analysis)

The different design criteria of corpora render different types of corpora. Thus, based on the medium or language form there are spoken, written, or mixed corpora. With reference to the number of languages (and varieties of language), there are monolingual (one language often representing the national corpora such as the BNC) comparable (two or more languages or varieties) or parallel (two or more languages with their equivalent translations) corpora. Based on the date of origin there are synchronic or monitor (using contemporary language) and diachronic or historical (tracking the evolution of language). Finally, with reference to the types of texts, there are general (including many types of texts) and specialized (texts of particular type) corpora. Examples of specialized corpora are the pedagogic (all the language produced within classrooms including teacher, textbooks) and learner corpora (language produced by learners).

Corpora are considered invaluable for a number of reasons that have to do with the quality, quantity, and ease of processing of the language included in them.  Thus, one of the advantages of corpora is that the texts included in them refer to genuine instances of language; a learner corpus, for instance, includes the language as produced by learners in the classroom and not fictitious examples. Furthermore, corpora offer more and better samples of language; the concordancing of a native corpus, for instance, offers many more and better examples of language use than a dictionary.  Finally, the digital form of corpora allows high speed searches and analyses of the language included in them.

The advantages of corpora have rendered them indispensable in a number of disciplines.  Thus, corpora are used in (a) lexicography, (b) grammar, (c) translation, (d) discourse analysis, (e) forensic linguistics, (f) sociolinguistics, and (g) pedagogy (directly through the Data Driven Learning (DDL) or indirectly through material designing)

Links for Corpora

Corpus query system: Sketch Engine

Sketch Engine for Language Learning

English corpora
Greek corpora
Software tools
AUTh on corpus linguistics
Links to other universities
Associations
Journals