Statistical natural language processing

LDC (Linguistic Data Consortium) and it is catalogue by year. Email: Offers the biggest selection of corpora on Compact disc-ROM. Cost varies from cheap (e.g. ACL-DCI disk) to pricey. Compact disks could be bought individually institutions may become people and receive discount rates on Compact disks. Likely to LDC Online service for searches over the internet (mainly meant for people, but you will find samplers available). European Language Assets Association and it is catalogue. Distribution agency is ELDA. Quickly growing assortment of materials in European languages. ICAME (Worldwide Computer Archive of contemporary British) Sells various corpora (including Brown and London-Lund). Info on corpora on the internet. by delivering the content assistance to by ftp to Also, manuals of these corpora. Reuters @ NIST Reuters corpora are actually written by NIST. TRACTOR TELRI Research Archive of Computational Tools and Resource. Corpora, many multilingual, in European community languages. Fee for joining to be able to have the ability to get corpora (unless of course you’ve led corpora). CLR (Consortium for Lexical Research) Email: Focuses more about language processing tools and lexicons, but comes with some corpora. By February 1996, you will get many of their stuff by anonymous ftp to Their catalog can be obtained like a postscript file. OTA (Oxford Text Archive) Provides mainly literary texts. Includes a vibrant new site. Email: kingdom. Most materials are available on the internet or by anonymous ftp to kingdom. Some require discussions using the companies. Leipzig Corpora Collection Sentence collections in MySQL database for 17 mainly European languages. BNC (British National Corpus) One Hundred million word corpus of British British. Searching it on the internet using their simple web interface or via View. a far greater interface by Mark Davies, and there’s a catalog to genres by David Lee. And today, an XML edition. European Corpus Initiative Multilingual Corpus I (ECI/MCI) A 98 million word corpus, covering the majority of the major European languages, in addition to Turkish, Japanese, Russian, Chinese, and Malay. Cheap. Have to sign permission agreement offered at either the World wide web site. Available too in the LDC. Survey of British Usage In the Department of British Language and Literature at College College London. Includes the British a part of ICE. the Worldwide Corpus of British project. Available these days labeled, and parsed for function. 83,419 sentences. Includes ICECUP, devoted retrieval software. Also, Diachronic Corpus of Present-Day Spoken British (800,000 words, labeled and parsed, half from ICE-GB and half from London-Lund). Worldwide Corpus of British (ICE) Million word collections of British from various world Englishes: ICE-NZ, ICE-HK, ICE-East Africa, etc. Several seem to be downloadable out of this site. Corpora held by Lancaster College This link provides its very own annotations. The Ecu Language Activity Network Promises a uniform query language for being able to access corpora in most EU languages — but is not quite there yet. Talkbank. Wealthy video and transcripts.

Particular languages

British language corpora offered by the websites above aren’t repeated here.

Corpora by Geoffrey Sampson’s team The SUSANNE corpus and also the CHRISTINE corpus (SUSANNE markup of the speech corpus). Michigan Corpus of educational Spoken British (MICASE). million words from 1997-2001. Penn-Helsinki Parsed Corpus of Middle British A syntactically annotated corpus from the Middle British prose samples within the Helsinki Corpus of Historic British, with additions. 1.3 million words. $200. Corpus of Professional, Spoken American-British (CPSA) two million words from faculty and committee conferences and Whitened House press conferences (50K work sample free on the internet). Lancaster Parsed Corpus Dialogue Diversity Corpus (Bill Mann) American National Corpus

British language corpora offered by the websites above aren’t repeated here.


Acquisition data

CHILDES database. Database of kid language transcriptions in British and lots of other languages. Texts can also be found by ftp. Certain usage needs. Manuals and programs for being able to access the information (the CLAN concordancer) can also be found online. Now in Unicode XML.


Robin Cover’s SGML/XML Web Site This can be a wonderful compendium of knowledge on SGML and XML, including info on the written text Encoding Initiative (TEI). This document is another help guide to many text collections (ones using SGML). Details about the written text Encoding Initiative (TEI). (The Pizza Chef functions like a TEI tag set selector.) Xaira XML Aware Indexing and Retrieval Application. The successor of SARA. Microsoft’s XML page W3C XML page. The Corpus Encoding Standard. An SGML instance created for language engineering programs. Even the XML version.

Statistical natural language processing


Dictionaries of subcategorization frames

The next dictionaries all list surface subcategorization frames (each having a different annotation plan). They’re also all obtainable in electronic form in the marketers (not free).

COBUILD Collins Cobuild British Language Dictionary. London: Collins, 1987. The COBUILD site allows you search their Bank of British corpus (but you have to pay to obtain more than the usual trial. LDOCE Longman Dictionary of recent British. Burnt Mill, Kent: Longman, 1978. OALD Oxford Advanced Learner’s Dictionary of Current British. Oxford: Oxford College Press, 4th Edition, 1989. The 3rd edition also had info on subcategorization frames, although inside a different incompatible format. However, an incomplete form of the 3rd edition (with this particular information) can be obtained online for free in the Oxford Text Archive.

Levin (1993) Janet Levin. 1993. British Verb Classes and Alternations: An Initial Analysis. Chicago. Talks about linguistic distinctions (like unergative/unaccusative verbs, dative change, etc. not provided through the above dictionaries). The index of verbs is online. British subcategorization evaluation assets Defacto standard data, from Cambridge College (Anna Korhonen)

See also COMLEX and CELEX offered by the LDC.

Dictionaries of varied languages on the internet

That old form of Robert Beard’s Web of internet Dictionaries sometime ago mutated into I am told the IPO continues to be postponed. Nonetheless, it is the very indepth index of dictionaries available on the internet.