The MATBN Mandarin Chinese broadcast news corpus contains a total of 198 hours of broadcast news from Public Television Service Foundation (Taiwan) with corresponding transcripts in Big5-encoded form and SGML tagging to annotate acoustic conditions, background conditions, story boundaries, speaker turn boundaries and audible acoustic events, such as hesitations, repetitions, vocal non-speech events, external noises, etc. The DGA&LDC Transcriber is used in this project. The primary motivation for this collection is to provide training and testing data for continuous speech recognition evaluation in the broadcast news domain. We expected to collect and process 220 hours of Mandarin Chinese broadcast news speech over 3 years (2001/08/01-2004/07/31). However, we were able to process only 198 hours finally.

For details of the MATBN corpus, please refer to MATBN: A Mandarin Chinese Broadcast News Corpus, an article published in International Journal of Computational Linguistics and Chinese Language Processing, special issue on Annotated Speech Corpora.


You can download the 40 transcription files here. (modified 2005/02/16)


You can download the 80 transcription files here. (modified 2005/02/16)


You can download the 78 transcription files here. (modified 2005/02/16)


You can query the sections using this tool. (modified 2012/06/16)
You can query the speakers using this tool. (modified 2012/06/16)


Hsin-min Wang (principle investigator)
Berlin Chen (co-principle investigator)
Mei-Li Chang (transcription and annotation)
Tzau-Fang Yan (transcription and annotation)
Shih-Sian Cheng (statistical corpus analysis)
Yi-Hsiang Chao (speech data processing)
Jen-Wei Kuo (corpus tool & speech recognition evaluation)
Kuan-jung Chen (transcription and annotation, former staff)


This project is funded by the National Science Council of the Republic of China under grants NSC 90-2213-E-009-109, NSC-91-2219-E-009-039, and NSC-92-2213-E-009-021. We would like to thank Public Television Service Foundation (Taiwan) for sharing their broadcast news with us and their employees for helping us to set up the recording machines in their broadcasting studio and operating them regularly. Acknowledgments go to Dr. Chiu-yu Tseng and Dr. Shu-chuan Tseng for their valuable assistance and comments on the transcription and annotation, Prof. Sadaoki Furui and his colleagues for sharing their experiences with us.