Text Corpora
Speech Corpora
  • LDC93S1 - TIMIT Acoustic-Phonetic Continuous Speech Corpus
  • LDC93S3A - Resource Management Complete Set 2.0
  • LDC93S6A - CSR-I (WSJ0) Complete
  • LDC95S24 - WSJCAM0 Cambridge Read News
  • LDC98S73 - 1997 Mandarin Broadcast News Speech (Hub-4NE)
  • LDC98T24 - 1997 Mandarin Broadcast News Transcripts (Hub-4NE)
  • LDC98T26 - Hub-5 Mandarin Transcripts
  • LDC98T28 - 1997 English Broadcast News Transcripts (Hub-4)
  • LDC2000S86 - 1998 HUB-4 Broadcast News Evaluation English Test Material
  • LDC2001S99 - Speech in Noisy Environments (SPINE) Vo-Coder Training Audio
  • LDC2001S91 - 1997 HUB-4 Broadcast News Evaluation Non English Test Material
  • LDC2002S12 - 2001 HUB5 Mandarin Evaluation
  • LDC2003T01 - 2001 HUB5 Mandarin Transcripts
  • LDC2004S02 - ICSI Meeting Speech
  • LDC2004T04 - ICSI Meeting Transcripts
  • LDC2004S05 - ISL Meeting Speech Part 1
  • LDC2004T10 - ISL Meeting Transcripts Part 1
  • LDC2004S11 - 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
  • LDC2005S16 - MDE RT04 Training Data Speech
  • LDC2005T24 - MDE RT-04 Training Data Text/Annotations
  • LDC2007S09 - Mandarin Affective Speech
  • LDC2007S10 - 2003 NIST Rich Transcription Evaluation Data
  • LDC2007S11 - 2004 Spring NIST Rich Transcription (RT-04S) Development Data
  • LDC2007S12 - 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
  • LDC2013S09 - CSC Deceptive Speech
  • LDC2013S05 - Greybeard
  • LDC2013S03 - Mixer 6 Speech
  • LDC2015S04 - Mandarin-English Code-Switching in South-East Asia
  • LDC2015S05 - Mandarin Chinese Phonetic Segmentation and Tone

Lexicon
  • LDC96L15 - CALLHOME Mandarin Chinese Lexicon
  • LDC98L21 - COMLEX English Syntax Lexicon
  • LDC2002L27 - Chinese-English Translation Lexicon Version 3.0
  • LDC2004L01 - Klex: Finite-State Lexical Transducer for Korean

Speaker Recognition Evaluation
  • LDC98S76 - 1998 Speaker Recognition Benchmark
  • LDC2006S26 - CSLU: Speaker Recognition Version 1.1
  • LDC2001S97 - 2000 NIST Speaker Recognition Evaluation
  • LDC2002S34 - 2001 NIST Speaker Recognition Evaluation Corpus
  • LDC2004S04 - 2002 NIST Speaker Recognition Evaluation
  • LDC2010S03 - 2003 NIST Speaker Recognition Evaluation
  • LDC2006S44 - 2004 NIST Speaker Recognition Evaluation
  • LDC2011S01 - 2005 NIST Speaker Recognition Evaluation Training Data
  • LDC2011S04 - 2005 NIST Speaker Recognition Evaluation Test Data
  • LDC2011S09 - 2006 NIST Speaker Recognition Evaluation Training Set
  • LDC2011S10 - 2006 NIST Speaker Recognition Evaluation Test Set Part 1
  • LDC2012S01 - 2006 NIST Speaker Recognition Evaluation Test Set Part 2
  • LDC2011S05 - 2008 NIST Speaker Recognition Evaluation Training Set Part 1
  • LDC2011S07 - 2008 NIST Speaker Recognition Evaluation Training Set Part 2
  • LDC2011S11 - 2008 NIST Speaker Recognition Evaluation Supplemental Set
  • LDC2011S08 - 2008 NIST Speaker Recognition Evaluation Test Set
  • LDC2017S06 - 2010 NIST Speaker Recognition Evaluation Test Set

Language Recognition Evaluation
  • LDC94S17 - OGI Multilanguage Corpus
  • LDC2006S35 - CSLU: Multilanguage Telephone Speech Version 1.2
  • LDC2006S31 - 2003 NIST Language Recognition Evaluation
  • LDC2008S05 - 2005 NIST Language Recognition Evaluation
  • LDC2009S04 - 2007 NIST Language Recognition Evaluation Test Set
  • LDC2009S05 - 2007 NIST Language Recognition Evaluation Supplemental Training Set
  • LDC2009E40 - NIST LRE 2009 BN Training Data
  • LDC2009E41 - NIST LRE 2009 CTS Training Data
  • ■ LDC2006E102: 1996 NIST Language Recognition Evaluation Data
    ■ LDC2006E103: 1996 NIST Language Recognition Development Data
    ■ LDC2006E107: 2003 NIST Language Recognition Evaluation
    ■ LDC2006E104: 2005 NIST Language Recognition Development Data
    ■ LDC2006E105: 2005 NIST Language Recognition Evaluation Data
  • LDC2009E42 - NIST LRE 2009 CTS Training Data, Indian English Development Data
  • LDC2009E43 - NIST LRE 2009 BN Training Label Data
  • 2009 NIST Language Recognition Evaluation Data

Topic Detection and Tracking
  • LDC98T25 - TDT Pilot Study Corpus
  • LDC2001T57 - TDT2 Multilanguage Text Version 4.0
  • LDC2001S93 - TDT2 Mandarin Audio Corpus
  • LDC2001T58 - TDT3 Multilanguage Text Version 2.0
  • LDC2001S95 - TDT3 Broadcast News Mandarin Corpus (Audio)
  • LDC2005S11 - TDT4 Multilingual Broadcast News Speech Corpus
  • LDC2005T16 - TDT4 Multilingual Text and Annotations

Korean
Video Retrieval
  • LDC2007V01 - TRECVID 2005 Keyframes & Transcripts
  • LDC2007V02 - TRECVID 2003 Keyframes & Transcripts