Most of our corpora are provided by the Linguistic Data Consortium, but there are also several non-LDC corpora. The corpora are available on CD/DVDs, and some also online, on AFS. We are currently in the process of moving more corpora online, and this page will be updated with location details as new datasets are added.
LDC Corpora
If a corpus is stored on AFS, the table below shows its directory under /afs/ir/data/linguistic-data/ — for example, the English Gigaword is stored at /afs/ir/data/linguistic-data/EnglishGigaword. If no AFS location is given, the corpus is not available online and you must borrow the disc in order to use it. A limited subset of (older) corpora are available for download from LDC Online, so if you don't find a corpus in disk format or on AFS, check with the corpus TA. See Get Access for details on how to register to use corpora.
ID | Name | AFS |
---|---|---|
LDC2009S05 | 2007 NIST Language Recognition Evaluation Supplemental Training Set | LanguageRecognitionTraining |
LDC2009T25 | Web 1T 5-gram, 10 European Languages | Web1T5gramEuropean |
LDC2009T26 | NXT Switchboard Annotations | |
LDC2009T24 | OntoNotes Release 3.0 | |
LDC2009T28 | French Gigaword Second Edition | |
LDC2009T30 | Arabic Gigaword Fourth Edition | |
LDC2009T29 | ACL Anthology Reference Corpus | |
LDC2009T12 | 2008 CoNLL Shared Task Data | |
LDC2009T05 | 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data | |
LDC2009L01 | An English Dictionary of the Tamil Verb Second Edition | |
LDC2007A09 | Arab Armored Forces Egyptian Dialect Edition | |
LDC2009T22 | Arabic Treebank English Translation | |
LDC2009E72 | Arabic Treebank Part 5 V1.0 | |
LDC2009V01 | Audiovisual Database of Spoken American English | |
LDC2009T04 | BioProp 1.0 | |
LDC2009E39 | CoNNL 2009 Shared Task Chinese Development Set | |
LDC2009E39A | CoNNL 2009 Shared Task Chinese Development Set | |
LDC2009E39B | CoNNL 2009 Shared Task Chinese Development Set | |
LDC2009E37 | CoNNL 2009 Shared Task Chinese Test Set | |
LDC2009E38 | CoNNL 2009 Shared Task Chinese Training Set | |
LDC2009E36 | CoNNL 2009 Shared Task Chinese Trial Data Set | |
LDC2009E36A | CoNNL 2009 Shared Task Chinese Trial Data Set | |
LDC2009E36D | CoNNL 2009 Shared Task Chinese Trial Data Set | |
LDC2009E35 | CoNNL 2009 Shared Task Czech Development Set | |
LDC2009E35B | CoNNL 2009 Shared Task Czech Development Set | |
LDC2009E33 | CoNNL 2009 Shared Task Czech Test Data Set | |
LDC2009E34 | CoNNL 2009 Shared Task Czech Training Set | |
LDC2009E34A | CoNNL 2009 Shared Task Czech Training Set | |
LDC2009E32 | CoNNL 2009 Shared Task Czech Trial Set | |
LDC2009E32A | CoNNL 2009 Shared Task Czech Trial Set | |
LDC2009E31 | CoNNL 2009 Shared Task English Development Set | |
LDC2009E29 | CoNNL 2009 Shared Task English Test Data Set | |
LDC2009E30 | CoNNL 2009 Shared Task English Training Set | |
LDC2009E30A | CoNNL 2009 Shared Task English Training Set | |
LDC2009E28A | CoNNL 2009 Shared Task English Trial Set | |
LDC2009S01 | CSLU: Numbers Version 1.3 | |
LDC2009T20 | Czech Broadcast Conversation MDE Transcripts | |
LDC2009S02 | Czech Broadcast Conversation Speech | |
LDC2009T01 | English CTS Treebank with Structural Metadata | |
LDC2009T13 | English Gigaword Fourth Edition | |
LDC2009R30 | Fisher Spanish Speech and Transcripts | |
LDC2009T03 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 | |
LDC2009T09 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 | |
LDC2009T02 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 | |
LDC2009T06 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 | |
LDC2009T15 | GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 | |
LDC2009R54 | Greybeard Eval | |
LDC2009T08 | Japanese Web N-gram Version 1 | |
LDC2009T10 | Language Understanding Annotation Corpus | |
LDC2009E44 | LDC Standard Arabic Morphological Analyzer (SAMA) version 3.0 | |
LDC2009A01 | NIST 2008 Speaker Recognition Evaluation Followup | |
LDC2009E42 | NIST LRE 2009 CTS Training Data, Indian English Development Data | |
LDC2009P01 | NorthAmerican News Corpus, WSJ Subset | |
LDC2009T11 | REFLEX Entity Translation Training/DevTest | |
LDC2009T21 | Spanish Gigaword Second Edition | |
LDC2009E73 | Standard Arabic Morphological Analyzer (SAMA) Version 3.1 | |
LDC2009T14 | Tagged Chinese Gigaword Version 2.0 | |
LDC2009T07 | Unified Linguistic Annotation Text Collection | |
LDC2009R27 | VOA Dari & Pashto Audio Archive | |
LDC2008A02 | 2004 An Nahar News Archive | |
LDC2007A25 | 2004 Nove Broadcast Video | |
LDC2008A03 | 2005 An Nahar News Archives | |
LDC2008A01 | 2006 An Nahar Archives | |
LDC2008A08 | 2007 Al Hayat Newswire Archives | |
LDC2008E30 | Aquaint Download | |
LDC2008T25 | AQUAINT-2 Information-Retrieval Text Research Collection | |
LDC2008S09 | CHAracterizing INdividual Speakers(CHAINS) | |
LDC2008E32 | CoNNL 2008 Shared Task Training Set | |
LDC2008E32 | CoNNL 2008 Shared Task Training Set | |
LDC2008T22 | Czech Academic Corpus 2.0 | |
LDC2008E36 | Fisher Phanotics Calls | |
LDC2008E38 | GALE Phase 3 Release 2 - Broadcast Audio | |
LDC2008E41 | GALE Phase 3 Release 2 - Web Text | |
LDC2008L03 | Global Yoruba Lexical Database v. 1.0 | |
LDC2008E29 | LCTL Bengali Language Pack 2.1 | |
LDC2008E27 | LCTL Bengali v2.0 | |
LDC2008S08 | LDC Spoken Language Sampler | |
LDC2007A24 | NIST Pilot Meeting Corpus, Training Data | |
LDC2008E01 | NTCIR-7 Advanced Cross-Lingual Information Access Task | |
LDC2008T20 | PennBioIE CYP 1.0 | |
LDC2008T21 | PennBioIE Oncology 1.0 | |
LDC2008T19 | The New York Times Annotated Corpus | NYT-Annotated-Corpus |
LDC2007A21 | TRECVID Nov 2004 | |
LDC2008S05 | 2005 NIST Language Recognition Evaluation | |
LDC2008T13 | BLLIP North American News Text, Complete | |
LDC2008T14 | BLLIP North American News Text, General Release | |
LDC2008T17 | CALLHOME Mandarin Chinese Transcripts - XML version | |
LDC2008T24 | COMNOM v 1.0 | |
LDC2008S06 | CSLU: Alphadigit Version 1.3 | |
LDC2008S07 | CSLU: ISOLET Spoken Letter Database Version 1.3 | |
LDC2008T09 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 | |
LDC2008T08 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 | |
LDC2008T18 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 | |
LDC2008T23 | NomBank v 1.0 | |
LDC2008T15 | North American News Text, Complete | |
LDC2008T16 | North American News Text, General Release | |
LDC2008T03 | ACE 2005 English SpatialML Annotations | |
LDC2008L01 | An English Dictionary of the Tamil Verb | |
LDC2008T07 | Chinese Proposition Bank 2.0 | |
LDC2008S02 | CSLU: National Cellular Telephone Speech Release 2.3 | |
LDC2008S01 | CSLU: Portland Cellular Telephone Speech Version 1.3 | |
LDC2008T02 | GALE Phase 1 Arabic Blog Parallel Text | |
LDC2008T06 | GALE Phase 1 Chinese Blog Parallel Text | |
LDC2008L02 | Hindi WordNet | |
LDC2008T01 | Hungarian-English Parallel Text, Version 1.0 | |
LDC2008T04 | OntoNotes Release 2.0 | |
LDC2008T05 | Penn Discourse Treebank Version 2.0 | |
LDC2008S03 | STC-TIMIT 1.0 | |
LDC2008S04 | West Point Brazilian Portuguese Speech | |
LDC2007T22 | 2001 Topic Annotated Enron Email Data Set | Topic-Annotated-Enron-Email |
LDC2007S10 | 2003 NIST Rich Transcription Evaluation Data | |
LDC2007S12 | 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data | |
LDC2007S11 | 2004 Spring NIST Rich Transcription (RT-04S) Development Data | |
LDC2007T40 | Arabic Gigaword Third Edition | |
LDC2007S03 | ARL Urdu Speech Database, Training Data | |
LDC2007T38 | Chinese Gigaword Third Edition | ChineseGigaword |
LDC2007T36 | Chinese Treebank 6.0 (CTB6.0) | Chinese-Treebank |
LDC2007S08 | CSLU: Foreign Accented English Release 1.2 | |
LDC2007S18 | CSLU: Kids' Speech Version 1.1 | |
LDC2007S13 | CSLU: Apple Words and Phrases | |
LDC2007S05 | CSLU: Yes/No Version 1.2 | |
LDC2007T02 | English Chinese Translation Treebank v 1.0 | |
LDC2007T07 | English Gigaword Third Edition | EnglishGigaword |
LDC2007S02 | Fisher Levantine Arabic Conversational Telephone Speech | |
LDC2007T04 | Fisher Levantine Arabic Conversational Telephone Speech, Transcripts | |
LDC2007T24 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 | GALE |
LDC2007T23 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 | GALE |
LDC2007T20 | GALE Phase 1 Distillation Training | GALE |
LDC2007T08 | ISI Arabic-English Automatically Extracted Parallel Text | |
LDC2007T09 | ISI Chinese-English Automatically Extracted Parallel Text | |
LDC2007S01 | Levantine Arabic Conversational Telephone Speech | |
LDC2007T01 | Levantine Arabic Conversational Telephone Speech, Transcripts | |
LDC2007S09 | Mandarin Affective Speech | |
LDC2007T19 | MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE) | MandarinBroadcastNews |
LDC2007S15 | Nationwide Speech Project | |
LDC2007T21 | OntoNotes v 1.0 | OntoNotes |
LDC2007T03 | Tagged Chinese Gigaword | |
LDC2007V02 | TRECVID 2003 Keyframes & Transcripts | |
LDC2007V01 | TRECVID 2005 Keyframes & Transcripts | |
LDC2006S44 | 2004 NIST Speaker Recognition Evaluation | |
LDC2006T06 | ACE 2005 Multilingual Training Corpus | |
LDC2006S46 | Arabic Broadcast News Speech | ArabicBroadcastNews |
LDC2006T20 | Arabic Broadcast News Transcripts | ArabicBroadcastNews |
LDC2006T02 | Arabic Gigaword Second Edition | |
LDC2006S15 | CSLU: Spelled and Spoken Words | |
LDC2006S14 | CSLU: Stories v 1.2 | |
LDC2006S35 | CSLU: Multilanguage Telephone Speech Version 1.2 | |
LDC2006S39 | CSLU: Names Release 1.3 | |
LDC2006S26 | CSLU: Speaker Recognition Version 1.1 | |
LDC2006S16 | CSLU: Spoltech Brazilian Portuguese Version 1.0 | |
LDC2006S01 | CSLU: Voices | |
LDC2006T10 | English-Arabic Treebank v 1.0 | |
LDC2006T17 | French Gigaword First Edition | FrenchGigaword |
LDC2006S43 | Gulf Arabic Conversational Telephone Speech | |
LDC2006T15 | Gulf Arabic Conversational Telephone Speech, Transcripts | |
LDC2006S45 | Iraqi Arabic Conversational Telephone Speech | |
LDC2006T16 | Iraqi Arabic Conversational Telephone Speech, Transcripts | |
LDC2006S42 | Korean Broadcast News Speech | |
LDC2006T14 | Korean Broadcast News Transcripts | |
LDC2006T03 | Korean Propbank | |
LDC2006T09 | Korean Treebank Annotations Version 2.0 | |
LDC2006S29 | Levantine Arabic QT Training Data Set 5, Speech | |
LDC2006T07 | Levantine Arabic QT Training Data Set 5, Transcripts | |
LDC2006S33 | Middle East Technical University Turkish Microphone Speech v 1.0 | |
LDC2006T04 | Multiple-Translation Chinese (MTC) Part 4 | |
LDC2006S13 | N4 NATO Native and Non-Native Speech | |
LDC2006S31 | NIST 2003 Language Recognition Evaluation | |
LDC2006T01 | Prague Dependency Treebank 2.0 | |
LDC2006S34 | Russian through Switched Telephone Network (RuSTeN) | |
LDC2006T12 | Spanish Gigaword First Edition | |
LDC2006S30 | Speech Controlled Computing | |
LDC2006T18 | TDT5 Multilingual Text | |
LDC2006T19 | TDT5 Topics and Annotations | |
LDC2006T08 | TimeBank 1.2 | TimeBank |
LDC2006T13 | Web 1T 5-gram Version 1 (AFS has 1,2,3-grams only) | Web1T5gram |
LDC2006S37 | West Point Heroico Spanish Speech | |
LDC2006S36 | West Point Korean Speech | |
LDC2005T09 | ACE 2004 Multilingual Training Corpus | ACE2004-Training |
LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v1.0 | TERN |
LDC2005T35 | American National Corpus (ANC) Second Release | |
LDC2005S07 | Arabic CTS Levantine Fisher Training Data Set 3, Speech | |
LDC2005T03 | Arabic CTS Levantine Fisher Training Data Set 3, Transcripts | |
LDC2005T02 | Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) | Arabic-Treebank |
LDC2005T20 | Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) | Arabic-Treebank |
LDC2005T30 | Arabic Treebank: Part 4 v 1.0 (MPG Annotation) | Arabic-Treebank |
LDC2005S22 | Articulation Index | |
LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | BBN-PCET |
LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
LDC2005T13 | CCGbank | |
LDC2005T34 | Chinese <-> English Name Entity Lists v 1.0 | |
LDC2005T10 | Chinese English News Magazine Parallel Text | ChineseEnglishNewsText |
LDC2005T14 | Chinese Gigaword Second Edition | ChineseGigaword |
LDC2005T06 | Chinese News Translation Text Part 1 | |
LDC2005T23 | Chinese Proposition Bank 1.0 | Chinese-PropBank-1.0 |
LDC2005T01 | Chinese Treebank 5.0 | Chinese-Treebank |
LDC2005T01U01 | Chinese Treebank 5.1 | Chinese-Treebank |
LDC2005S26 | CSLU: 22 Languages Corpus | |
LDC2005T08 | Discourse Graphbank | |
LDC2005T12 | English Gigaword Second Edition | EnglishGigaword |
LDC2005S13 | Fisher English Training Part 2, Speech | |
LDC2005T19 | Fisher English Training Part 2, Transcripts | Fisher |
LDC2005T28 | HARD 2004 Text | |
LDC2005T29 | HARD 2004 Topics and Annotations | |
LDC2005S15 | HKUST Mandarin Telephone Speech, Part 1 | |
LDC2005T32 | HKUST Mandarin Telephone Transcript Data, Part 1 | HKUST-Mandarin |
LDC2005S14 | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | |
LDC2005L01 | Mawukakan Lexicon | |
LDC2005T05 | Multiple-Translation Arabic (MTA) Part 2 | |
LDC2005S16 | RT-04 MDE Training Data Speech | |
LDC2005T24 | RT-04 MDE Training Data Text/Annotations | |
LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | SantaBarbara/4 |
LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
LDC2005T16 | TDT4 Multilingual Text and Annotations | |
LDC2005S30 | The West Point Company G3 American English Speech Data Corpus | |
LDC2005S28 | West Point Croatian Speech Corpus | |
LDC2004T15 | 2000 Communicator Dialogue Act Tagged | |
LDC2004T16 | 2001 Communicator Dialogue Act Tagged | |
LDC2004S04 | 2002 NIST Speaker Recognition Evaluation | |
LDC2004S11 | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | |
LDC2004T18 | Arabic English Parallel News Part 1 | |
LDC2004T17 | Arabic News Translation Text Part 1 | |
LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | |
LDC2004T11 | Arabic Treebank: Part 3 v 1.0 | |
LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 | BuckwalterArabicMA |
LDC2004T05 | Chinese Treebank 4.0 | |
LDC2004S01 | Czech Broadcast News Speech | |
LDC2004T01 | Czech Broadcast News Transcripts | |
LDC2004S13 | Fisher English Training Speech Part 1 Speech | |
LDC2004T19 | Fisher English Training Speech Part 1 Transcripts | |
LDC2004V01 | FORM1 Kinematic Gesture | |
LDC2004T08 | Hong Kong Parallel Text | |
LDC2004S02 | ICSI Meeting Speech | |
LDC2004T04 | ICSI Meeting Transcripts | ICSI-Transcripts |
LDC2004S05 | ISL Meeting Speech Part 1 | |
LDC2004T10 | ISL Meeting Transcripts Part 1 | |
LDC2004L01 | Klex: Finite-State Lexical Transducer for Korean | |
LDC2004T03 | Morphologically Annotated Korean Text | |
LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | |
LDC2004S09 | NIST Meeting Pilot Corpus Speech | |
LDC2004T13 | NIST Meeting Pilot Corpus Transcripts and Metadata | |
LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | |
LDC2004T25 | Prague Czech-English Dependency Treebank 1.0 | |
LDC2004T14 | Proposition Bank I | |
LDC2004S08 | RT-03 MDE Training Data Speech | |
LDC2004T12 | RT-03 MDE Training Data Text and Annotations | |
LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | SantaBarbara/3 |
LDC2004S07 | Switchboard Cellular Part 2 Audio | |
LDC2004S12 | TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls | |
LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | |
LDC2003T03 | 1997 HUB5 German Transcripts | |
LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
LDC2003T02 | 1998 HUB5 English Transcripts | |
LDC2003S01 | 2001 Communicator Evaluation | |
LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
LDC2003T11 | ACE-2 Version 1.0 | |
LDC2003T20 | American National Corpus(ANC) First Release | |
LDC2003T12 | Arabic Gigaword | |
LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation | |
LDC2003T06 | Arabic Treebank: Part 1 v 2.0 | |
LDC2003T09 | Chinese Gigaword | ChineseGigaword |
LDC2003T05 | English Gigaword | EnglishGigaword |
LDC2003V01 | FORM2 Kinematic Gesture | |
LDC2003L01 | Grassfields Bantu Fieldwork: Dschang Lexicon | |
LDC2003S02 | Grassfields Bantu Fieldwork: Dschang Tone Paradigms | |
LDC2003P01 | Korean Telephone Conversations Complete Set | |
LDC2003L02 | Korean Telephone Conversations Lexicon | |
LDC2003S03 | Korean Telephone Conversations Speech | |
LDC2003T08 | Korean Telephone Conversations Transcripts | |
LDC2003T13 | Message Understanding Conference (MUC) 6 | |
LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 | MTA |
LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | |
LDC2003T10 | SAID | SAID |
LDC2003S06 | Santa Barbara Corpus of Spoken American English Part II | |
LDC2003T15 | SLX Corpus of Classic Sociolinguistic Interviews | |
LDC2003T16 | SummBank 1.0 | |
LDC2003S05 | West Point Russian Speech | |
LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
LDC2002S24 | 1997 HUB5 German Evaluation | |
LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
LDC2002S10 | 1998 HUB5 English Evaluation | |
LDC2002S56 | 2000 Communicator Evaluation | |
LDC2002S13 | 2001 HUB5 English Evaluation | |
LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
LDC2002S34 | 2001 NIST Speaker Recognition Evaluation Corpus | |
LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | |
LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | |
LDC2002S28 | Emotional Prosody Speech and Transcripts | |
LDC2002T26 | Korean English Treebank Annotations | |
LDC2002T01 | Multiple-Translation Chinese Corpus | |
LDC2002T07 | RST Discourse Treebank | RST_discourse_treebank |
LDC2002S06 | Switchboard-2 Phase III Audio | |
LDC2002T31 | The AQUAINT Corpus of English News Text | AQUAINT |
LDC2002S04 | Translanguage English Database (TED) Speech | |
LDC2002T03 | Translanguage English Database (TED) Transcripts | |
LDC2002S35 | Voicemail Corpus Part II | |
LDC2002S02 | West Point Arabic Speech Corpus | |
LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
LDC2001S97 | 2000 NIST Speaker Recognition Evaluation | NIST2000 |
LDC2001T55 | Arabic Newswire Part 1 | |
LDC2001T61 | CALLHOME Spanish Dialogue Act Annotation | |
LDC2001T62 | CETEMpublico | |
LDC2001T11 | Chinese Treebank 2.0 | Chinese-Treebank |
LDC2001S16 | Grassfields Bantu Fieldwork: Ngomba Tone Paradigms | |
LDC2001T02 | Message Understanding Conference (MUC) 7 | MUC_7 |
LDC2001T10 | Prague Dependency Treebank 1.0 | PDT1.0 |
LDC2001S04 | Speech in Noisy Environments (SPINE2) Part 1 Audio | |
LDC2001T05 | Speech in Noisy Environments (SPINE2) Part 1 Transcripts | |
LDC2001S06 | Speech in Noisy Environments (SPINE2) Part 2 Audio | |
LDC2001T07 | Speech in Noisy Environments (SPINE2) Part 2 Transcripts | |
LDC2001S08 | Speech in Noisy Environments (SPINE2) Part 3 Audio | |
LDC2001T09 | Speech in Noisy Environments (SPINE2) Part 3 Transcripts | |
LDC2001S99 | Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio | |
LDC2001S13 | Switchboard Cellular Part 1 Audio | |
LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
LDC2001T60 | Syllable-Final /s/ Lenition | |
LDC2001S93 | TDT2 Mandarin Audio Corpus | |
LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | TDT/TDT2-Multilingual |
LDC2001S94 | TDT3 English Audio | |
LDC2001S95 | TDT3 Mandarin Audio | |
LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | TDT/TDT3-Multilingual |
LDC2000S86 | 1998 HUB4 Broadcast News Evaluation English Test Material | |
LDC2000S88 | 1999 HUB4 Broadcast News Evaluation English Test Material | 1999-HUB4-Test |
LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 | BLLIP-WSJ |
LDC2000T50 | Hong Kong Hansards Parallel Text | Hansard-Hong-Kong |
LDC2000T47 | Hong Kong Laws Parallel Text | Hong-Kong-Laws |
LDC2000T46 | Hong Kong News Parallel Text | Hong-Kong-News |
LDC2000T45 | Korean Newswire | |
LDC2000S85 | Santa Barbara Corpus of Spoken American English Part I | SantaBarbara/1 |
LDC2000S96 | Speech in Noisy Environments (SPINE) Evaluation Audio | |
LDC2000T54 | Speech in Noisy Environments (SPINE) Evaluation Transcripts | |
LDC2000S87 | Speech in Noisy Environments (SPINE) Training Audio | SPINE |
LDC2000T49 | Speech in Noisy Environments (SPINE) Training Transcripts | SPINE |
LDC2000S92 | TDT2 Careful Transcription Audio | |
LDC2000T44 | TDT2 Careful Transcription Text | TDT/TDT2-Careful |
LDC2000T52 | TREC Mandarin | |
LDC2000T51 | TREC Spanish | |
LDC2000T53 | Voice of America (VOA) Broadcast News Czech Transcript Corpus | |
LDC2000S89 | Voice of America (VOA) Czech Broadcast News Audio | |
LDC99S80 | 1997 Speaker Recognition Benchmark | |
LDC99S81 | 1999 Speaker Recognition Benchmark | |
LDC99L23 | American English Spoken Lexicon | |
LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
LDC99T34 | Japanese Business News Text Supplement | |
LDC99T40 | Portuguese Newswire Text | |
LDC99T41 | Spanish Newswire Text, Volume 2 | |
LDC99S78 | SUSAS | |
LDC99T33 | SUSAS Transcripts | |
LDC99S79 | Switchboard-2 Phase II | |
LDC99S83 | Tactical Speaker Identification Speech Corpus (TSID) | |
LDC99S84 | TDT2 English Audio | |
LDC99T42 | Treebank-3 | Treebank/3 |
LDC99S82 | USC Marketplace Broadcast News Speech | |
LDC99T36 | USC Marketplace Broadcast News Transcripts | |
LDC98T31 | 1996 CSR HUB4 Language Model | 1996-CSR-Hub-4-LM |
LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | English-Broadcast-News |
LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
LDC98S74 | 1997 Spanish Broadcast News Speech (HUB4-NE) | |
LDC98T29 | 1997 Spanish Broadcast News Transcripts (HUB4-NE) | Spanish-Broadcast-News |
LDC98S76 | 1998 Speaker Recognition Benchmark | |
LDC98L21 | COMLEX English Syntax Lexicon | |
LDC98S67 | HTIMIT | |
LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
LDC98T26 | HUB5 Mandarin Transcripts | |
LDC98S70 | HUB5 Spanish Telephone Speech Corpus | |
LDC98T27 | HUB5 Spanish Transcripts | Hub5-Spanish-Transcripts |
LDC98T32 | JURIS | |
LDC98S68 | LLHDB | |
LDC98T30 | North American News Text Supplement | |
LDC98S75 | Switchboard-2 Phase I | |
LDC98S72 | Taiwanese Putonghua Speech and Transcripts | |
LDC98T25 | TDT Pilot Study Corpus | TDT/TDT-Pilot-Corpus |
LDC98S77 | Voicemail Corpus Part I | |
LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | CALLHOME |
LDC97S42 | CALLHOME American English Speech | |
LDC97T14 | CALLHOME American English Transcripts | CALLHOME |
LDC97S45 | CALLHOME Egyptian Arabic Speech | |
LDC97T19 | CALLHOME Egyptian Arabic Transcripts | CALLHOME |
LDC97L18 | CALLHOME German Lexicon | CALLHOME |
LDC97S43 | CALLHOME German Speech | |
LDC97T15 | CALLHOME German Transcripts | CALLHOME |
LDC97T12 | DSO Corpus of Sense-Tagged English | |
LDC97S62 | Switchboard-1 Release 2 | |
LDC97S63 | The CMU Kids Corpus | |
LDC96S61 | 1996 Speaker Recognition Benchmark | |
LDC96S36 | Boston University Radio Speech Corpus | Boston-University-Radio |
LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
LDC96S47 | CALLFRIEND American English-Southern Dialect | |
LDC96S48 | CALLFRIEND Canadian French | |
LDC96S49 | CALLFRIEND Egyptian Arabic | |
LDC96S50 | CALLFRIEND Farsi | |
LDC96S51 | CALLFRIEND German | |
LDC96S52 | CALLFRIEND Hindi | |
LDC96S53 | CALLFRIEND Japanese | |
LDC96S54 | CALLFRIEND Korean | |
LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
LDC96S57 | CALLFRIEND Spanish-Caribbean Dialect | |
LDC96S58 | CALLFRIEND Spanish-Non-Caribbean Dialect | |
LDC96S59 | CALLFRIEND Tamil | |
LDC96S60 | CALLFRIEND Vietnamese | |
LDC96L17 | CALLHOME Japanese Lexicon | CALLHOME |
LDC96S37 | CALLHOME Japanese Speech | |
LDC96T18 | CALLHOME Japanese Transcripts | CALLHOME |
LDC96L15 | CALLHOME Mandarin Chinese Lexicon | CALLHOME |
LDC96S34 | CALLHOME Mandarin Chinese Speech | |
LDC96T16 | CALLHOME Mandarin Chinese Transcripts | CALLHOME |
LDC96L16 | CALLHOME Spanish Lexicon | CALLHOME |
LDC96S35 | CALLHOME Spanish Speech | |
LDC96T17 | CALLHOME Spanish Transcripts | CALLHOME |
LDC96L14 | CELEX2 | CELEX |
LDC96T11 | COMLEX Syntax Text Corpus Version 2.0 | |
LDC96S33 | CSR-IV HUB3 | |
LDC96S31 | CSR-IV HUB4 | |
LDC96S30 | CTIMIT | |
LDC96S38 | DCIEM/HCRC | |
LDC96S32 | FFMTIMIT | |
LDC96S29 | Frontiers in Speech Processing 93 | |
LDC96S40 | Frontiers in Speech Processing 94 | |
LDC96S64-1 | JEIDA/JCSD-Channel 0 City Names | |
LDC96S64 | JEIDA/JCSD-Channel 0 Complete | |
LDC96S64-2 | JEIDA/JCSD-Channel 0 Control Words | |
LDC96S64-4 | JEIDA/JCSD-Channel 0 Four Digit Sequences | |
LDC96S64-3 | JEIDA/JCSD-Channel 0 Isolated Digits | |
LDC96S64-5 | JEIDA/JCSD-Channel 0 Mono Syllables | |
LDC96S65-1 | JEIDA/JCSD-Channel 1 City Names | |
LDC96S65 | JEIDA/JCSD-Channel 1 Complete | |
LDC96S65-2 | JEIDA/JCSD-Channel 1 Control Words | |
LDC96S65-4 | JEIDA/JCSD-Channel 1 Four Digit Sequences | |
LDC96S65-3 | JEIDA/JCSD-Channel 1 Isolated Digits | |
LDC96S65-5 | JEIDA/JCSD-Channel 1 Mono Syllables | |
LDC96T10 | Message Understanding Conference (MUC) 6 Additional News Text | |
LDC96S41 | VAHA (POLYPHONE II) | |
LDC95T20 | Hansard French/English | Hansard-French |
LDC95T8 | Japanese Business News Text | Japanese-Business-News |
LDC95T21 | North American News Text Corpus | North-American-News |
LDC95S25 | TRAINS Spoken Dialog Corpus | TRAINS |
LDC95T7 | Treebank-2 | Treebank/2 |
LDC94S14A | Air Traffic Control Complete | Air-Traffic-Control |
LDC94T5 | ECI Multilingual Text | ECI-Multilingual |
LDC94S15 | SPIDRE | |
LDC94T4A | UN Parallel Text (Complete) | |
LDC93T1 | ACL/DCI | |
LDC93S4A | ATIS0 Complete | |
LDC93S4B | ATIS0 Pilot | |
LDC93S4B-2 | ATIS0 Read | |
LDC93S4B-3 | ATIS0 SD Read | |
LDC93S5 | ATIS2 | |
LDC93S6A | CSR-I (WSJ0) Complete | |
LDC93S6C | CSR-I (WSJ0) Other | |
LDC93S6B | CSR-I (WSJ0) Sennheiser | |
LDC93S12 | HCRC Map Task Corpus | HCRC-Maptask-Transcripts |
LDC93S2 | NTIMIT | |
LDC93S3A | Resource Management Complete Set 2.0 | |
LDC93S3B | Resource Management RM1 2.0 | |
LDC93S3C | Resource Management RM2 2.0 | |
LDC93S11 | Road Rally | |
LDC93S8 | Switchboard Credit Card | |
LDC93S7-T | Switchboard-1 Transcripts | |
LDC93S9 | TI 46-Word | |
LDC93S10 | TIDIGITS | TIDIGITS |
LDC93S1 | TIMIT Acoustic-Phonetic Continuous Speech Corpus | TIMIT |
LDC93T3A | TIPSTER Complete | |
LDC93T3B | TIPSTER Volume 1 | TREC/TREC-1 |
LDC93T3C | TIPSTER Volume 2 | TREC/TREC-2 |
LDC93T3D | TIPSTER Volume 3 | TREC/TREC-3 |
Non-LDC Corpora
If a corpus is stored on AFS, the table below shows its directory under /afs/ir/data/linguistic-data/. Corpora marked with an asterisk require you to agree an additional usage license. See Get Access for details.
Name | Annotation | Language | AFS | |
---|---|---|---|---|
Aleksova's corpus | Bulgarian (spoken) | |||
American Heritage Talking Dictionary (3rd edition) | English | |||
ATIS | Syntax, POS, some argument structure (use TIGERSearch) | English | ||
Bavarian Archive of Speech Corpora (only annotations) | Prosody, syntax, POS, transcribed | German, English, Japanese | ||
British National Corpus (BNC) World Edition | (use gsearch) | English | BNC-world | |
British National Corpus (BNC) Web Version 2.0 | On disk, easy-to-use interface | English | ||
Brown Corpus | Syntax, POS, some argument structure (use TIGERSearch) | English | Brown | |
Census 1990 Names | English | IE/census1990names | ||
CHRISTINE | English | CHRISTINE | ||
CMU Pronouncing Dictionary | English | CMU-Pronouncing-Dict | ||
Cornell SMART Archive | English | SMART-Archive | ||
Corpus Gesproken Nederlands | Contemporary Dutch (spoken) | |||
Corpus of Spoken Professional American English | POS (use MonoConc) | American English (spoken) | ||
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) | English | |||
EMILLE/CIIL | Monolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech | Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu | ||
Enron Email Corpus | English | Enron-Email-Corpus | ||
Excite log | English | IR | ||
International Computer Archive of Modern and Medieval English | diachronic corpus | English | ICAME | |
International Corpus of English - British Component | (use tgrep2) | English | ICE-GB | |
International Corpus of English - Singapore Component | (use tgrep2) | English | ICE-Singapore | |
IViE | Prosody, phonetic, etc. | British dialects | ||
John Rylands Univ Corpus of late 18c prose | Early Modern English | Rylands18cProse | ||
Kristie Seymore's Information Extraction Data | English | IE/Kristie-Seymore-IE | ||
LUCY | English | LUCY | ||
Mooney Job Data | English | IE/Mooney-Job-Data | ||
MuchMore Springer Bilingual Corpus | Part-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations | English, German | MuchMore | |
MULTEXT-East | lexica, annotated translations of Orwell's 1984 | Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene | MULTEXT | |
NEGRA | Syntax (LFG-based), POS, some argument structure (use TIGERSearch) | German | NEGRA | |
Nihon Kokugo Daijiten | Japanese | KokugoDaijiten | ||
PPCME2* | diachronic corpus | PPCME2 | ||
PropBank | predicate structure enriched treebank | English | Proposition-Bank-1 | |
Remedia Story Comprehension* | English | QA | ||
Reuters Corpus | English | Reuters-Corpus | ||
RNC German radio news (Nachrichten) corpus | Prosodically annotated & transcribed speech files | German (spoken) | ||
Switchboard Corpus | Syntax, POS, some argument structure (use TIGERSearch) | English (spoken) | Switchboard | |
Switchboard LINK Project Corpus* | Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2) | English (spoken) | Treebank/LINK-swbd | |
SUSANNE Corpus, Release 5 | English | SUSANNE | ||
TIGER Treebank | Syntax (LFG-based), POS, some argument structure (use TIGERSearch) | German | ||
TIGER sample corpora | Syntax, POS, some argument structure (use TIGERSearch) | English | TIGERCorpus | |
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997) | English | |||
Unified Medical Language System (UMLS) | English | UMLS | ||
Verbmobil Dialogs | German, English, Japanese | Verbmobil-Dialogs | ||
Wall Street Journal | Syntax, POS, some argument structure (use TIGERSearch) | English | Treebank | |
Wolverhampton Coreference | coreference and anaphora | English | Wolverhampton-Coreference | |
WordNet | lexical information database | English | WordNet | |
YCOE* | Syntax, POS, CAT, lemma (use TIGERSearch) | English | ||
Yomiuri Shinbun | Japanese | YomiuriShinbun |