The MATBN Mandarin Chinese broadcast news corpus contains
a total of 198 hours of broadcast news from
Public Television Service Foundation (Taiwan)
with corresponding transcripts in Big5-encoded form and SGML tagging to annotate
acoustic conditions, background conditions,
story boundaries, speaker turn boundaries and audible acoustic events,
such as hesitations, repetitions, vocal non-speech events, external noises, etc.
The DGA&LDC Transcriber is used in this project.
The primary motivation for this collection is to provide
training and testing data for continuous speech recognition evaluation in the broadcast news domain.
We expected to collect and process 220 hours of Mandarin Chinese broadcast news speech over 3
However, we were able to process only 198 hours finally.
You can download the 40 transcription files here.
You can download the 80 transcription files here.
You can download the 78 transcription files here.
Hsin-min Wang (principle investigator)
This project is funded by the National Science Council of the Republic of China
under grants NSC 90-2213-E-009-109, NSC-91-2219-E-009-039, and NSC-92-2213-E-009-021.
We would like to thank Public Television Service Foundation (Taiwan) for sharing
their broadcast news with us and their employees for helping us to set up the
recording machines in their broadcasting studio and operating them regularly.
Acknowledgments go to Dr. Chiu-yu Tseng and Dr. Shu-chuan Tseng for their valuable
assistance and comments on the transcription and annotation,
Prof. Sadaoki Furui and his colleagues for sharing their experiences with us.