The Website for CAL500 Expansion (CAL500exp)

Authors: Ju-Chiang Wang*, Shou-Yang Wang, Yi-Hsuan Yang and Hsin-Min Wang

* Corresponding Email: asriver dot wang at gmail dot com



Music auto-tagging refers to automatically assigning semantic labels (tags) such as genre, mood and instrument to music so as to facilitate text-based music retrieval. Although significant progress has been made in recent years, relatively little research has focused on semantic labels that are time-varying within the track. Music is an art of time. It is not unusual for a performer/composer to express different emotions and timbre in a piece of music. Therefore, there are strong theoretical and empirical reasons to model the high-level semantics of music as time-varying. On the other hand, instead of labeling a song with a number of song-level tags, it would be more interesting to specify the second-by-second variation of high-level musical semantics. These ideas, however, have not been widely studied so far, possibly because the research on music auto-tagging is relatively young (i.e. emerged in 2007 or so), or/and because collecting second-by-second annotations from scratch is extremely time-consuming and labor-intensive.

Here we present the CAL500 Expansion (CAL500exp) dataset, which is an enriched version of the well-known CAL500 dataset [1]. CAL500exp is designed to facilitate music auto-tagging in a smaller temporal scale, or specefically say, time-varying music tag tracking applications such as Play-with-Tagging music player [2]. Details can be found here. There are several different aspects from other related work in constructing CAL500exp.

  • The tag labels are annotated in the segment level instead of track level. We perform audio-based segmentation to extract acoustically homogenous segments with variable length and intersegment clustering to select representative segments for annotation, making the connection between tags and music better-defined.
  • Instead of annotating each segment from scratch using free response (i.e., the so-called "folksonomy"), we initialize the annotation of each segment based on the fixed vocabulary of tags and the track-level labels of CAL500, and ask annotators to check and refine the labels by insertion or deletion, which improves the quality of annotation
  • Instead of resorting to crowdsourcing, we recruit eleven annotators with strong music background for better annotation quality. Exemplar tracks and clear instructions are given to the subjects, and a new user-interface is carefully designed for annotation.
  • Comprehensive performance study has vaidated the high quality labels of CAL500exp and demonstrated its capability of improving the performance of time-varying music auto-tagging.

We hope that CAL500exp can call for more research towards understanding the temporal context of musical semantics.



  1. D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, "Towards musical query-by-semantic-description using the CAL500 data set," in Proc. International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 439 - 446, 2007.
  2. J.-C. Wang, H.-M. Wang, and S.-K. Jeng, "Playing with tagging: A real-time tagging music player," in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 77 - 80, 2012.



To use the CAL500exp dataset, please cite the following paper:

S.-Y. Wang, J.-C. Wang, Y.-H. Yang, and H.-M. Wang, “Towards Time-varying Music Auto-tagging Based on CAL500 Expansion,”  in Proceedings of IEEE International Conference on Multimedia and Expo, Barcelona, Spain, 2014.



Here is the link to the original sound files of CAL500.

Time-varying Terms: There are 67 tags in CAL500 that are considered as time-varying.

Segment-level Hard (Binary) Label: For each segment, we includ the beginning and end times in second on the original sound files of CAL500, and the {0,1} labels obtained by majority voting among at least 5 subjects.

Segment-level Soft (Probability) Label: For each segment, we includ the beginning and end times in second on the original sound files of CAL500, and the probability (0-1) derived by the number of positives divided by the number of subjects.

Note that we can provide the MP3 files (22.5KHz, mono) for all the segments "upon request". Please contact Ju-Chiang Wang via the e-mail above.