Discover the Second Hand Songs Dataset: Your Ultimate Resource for Cover Song Research

Are you delving into the world of Music Information Retrieval (MIR) or conducting academic research on song versions and adaptations? The SecondHandSongs dataset is an invaluable resource, offering the most extensive collection of cover songs meticulously curated within the Million Song Dataset.

This dataset is the fruit of a proud partnership between the Million Song Dataset team and the dedicated experts at Second Hand Songs. It’s designed to empower researchers like you with a robust and readily accessible dataset to further your studies in cover song identification, analysis, and more.

We strongly encourage you to explore and contribute to the Second Hand Songs website. It stands as a premier online hub for the MIR community and anyone fascinated by the rich tapestry of song covers and musical adaptations.

What is the Second Hand Songs Dataset?

The SecondHandSongs dataset comprises a total of 18,196 tracks sourced from the Million Song Dataset, intelligently organized into “cliques.” These cliques represent groups of different versions of the same fundamental musical work. Whenever feasible, these cliques are linked to specific “works” on the SecondHandSongs website, accessible via URLs like: http://www.secondhandsongs.com/work/. In cases where a work is not found on SecondHandSongs, a negative number is used as a placeholder. Similarly, when available, the dataset includes performance numbers, with detailed information available at: http://www.secondhandsongs.com/performance/.

Each line in the dataset file adheres to the following format:

# - comment, ignore
%a,b,c, title - beginning of a clique. a,b,c are work IDs (negative if not available)
TIDAIDperf - track ID from the MSD (plus artist ID and SHS performance)

It’s important to note that the Million Song Dataset sometimes includes multiple versions of the same song by the same artist. To maintain the integrity of the dataset and focus on genuine cover songs, these near-duplicate tracks have been excluded. Therefore, it’s recommended to disregard known duplicates when conducting tests and analyses using this dataset.

Recognizing the research community’s need for standardized evaluation, the SecondHandSongs dataset is thoughtfully divided into “train” and “test” sets, mirroring common practices in machine learning tasks. For rigorous evaluation, performance metrics should be reported exclusively on the “test” set, while the “train” set is intended for system development and parameter tuning.

Accessing the Second Hand Songs Dataset

Ready to dive in? You can access the dataset through these direct links:

These datasets are also conveniently included within the main GitHub repository. The training set encompasses 4,128 cliques out of a total of 5,854, and 12,960 tracks out of 18,196.

Frequently Asked Questions (FAQs) – General

What is the connection between this dataset and the Million Song Dataset?

The SecondHandSongs dataset is a separate entity but specifically references songs that are part of the Million Song Dataset (MSD). The majority of its data originates from the extensive Second Hand Songs website. Its creation was a collaborative effort between SecondHandSongs.com and the Million Song Dataset project team.

How was this dataset compiled?

The dataset’s foundation is the Second Hand Songs (SHS) database, the very backbone of their online platform. Additionally, the MSD team contributed further known cover songs to enrich the dataset. As a result, you might discover cover songs within this dataset that are not yet publicly listed on the SHS website.

Are there any potential limitations or inaccuracies?

Your feedback is crucial to improving this resource! We’ve already identified some potential areas for scrutiny:

Instances where covers might actually be duplicates of the same song by the same artist.
Possible inaccuracies arising from string matching processes on artist names or song titles sourced from SecondHandSongs.
Omissions of cover songs (some intentional, some unintentional).
Potential inaccuracies in the information drawn from SecondHandSongs.com.

For more in-depth technical details on these points, please consult the Technical FAQ section below.

What are the licensing terms for using this dataset?

In essence, its use is strictly limited to research purposes. Commercial applications, such as creating a cover song website for profit, or any similar endeavors, are prohibited without explicit written consent from the Second Hand Songs team (SHS). This licensing is akin to the terms governing The Echo Nest data within the MSD. However, SHS retains the right to promote and reference any research or publications that utilize this dataset.

How can I contribute to the growth and accuracy of this dataset?

If you identify a new cover song within the MSD that’s not yet in our dataset, first check if it’s already documented on the SHS website and add it there if necessary. Then, share the relevant details with us: the Echo Nest track ID from the MSD and the SHS performance ID or URL. Crucially, please verify that it is indeed a cover song – songs sharing the same title are not always covers.

How should I cite the SecondHandSongs dataset in my research?

Please cite the following publication: [bib]. Additionally, you are encouraged to mention or link to this web resource:

SecondHandSongs dataset, the official list of cover songs within the Million Song Dataset, available at: <a href="index.html" title="http://millionsongdataset.com/secondhand">http://millionsongdataset.com/secondhand</a>

Who should I contact for further assistance?

Thierry Bertin-Mahieux remains your best initial point of contact. Alternatively, you can reach out via the MSD mailing list. For questions specifically about secondhandsongs.com, please contact them directly.

Frequently Asked Questions (FAQs) – Technical

(For MIR practitioners) “If all else fails, read the instructions.” – Donald Knuth

What is the recommended procedure for training, testing, and evaluating models with this dataset?

During the training phase, you have access to the complete feature set, but only for the training portion of the dataset.
In the testing phase, for each track within a clique in the test set, query the MSD to identify and rank the most similar songs.
Evaluation should be performed using the ground-truth tracks designated for each clique, employing metrics like average precision (AP) or mean reciprocal rank (MRR).

When using a song (song A) as a query, it’s crucial to exclude all songs by the same artist as song A, along with any known duplicates of the cover songs (refer to the official MSD duplicate list).

This setup mirrors a realistic scenario: Imagine you are a platform like YouTube or iTunes with a vast music library and a training set of identified covers. When an artist identifies their track within your library and asks you to find covers, this is the task you’re addressing.

How reliable and accurate is the dataset?

The dataset is remarkably clean and reliable. SecondHandSongs.com maintains high data quality through staff-verified data entry. Most potential errors are traced back to string matching issues (artist names, song titles) and inherent inaccuracies within the MSD and The Echo Nest databases – for instance, bands sharing names or artists with slightly different spellings being treated as distinct entities. However, a more significant consideration is the potential for omissions – covers that are not yet included.

Why are work IDs missing for some cliques?

Prior to our collaboration with SecondHandSongs, the MSD team had already identified a number of cover songs independently. We opted to include these in the dataset. Regrettably, manually adding all of them to secondhandsongs.com is an extensive undertaking.

Why are many obvious cover songs seemingly absent from the dataset?

Our primary focus was to exclude MSD duplicates. Within each clique, the majority of tracks are by different artists. When tracks from the same artist are present, we ensured they had distinct song IDs and titles. Including all such instances would have significantly expanded the dataset’s size. We also intentionally omitted some “medleys” due to the complexities they introduce in having a single track belong to multiple cliques (both in evaluation and clique creation). Furthermore, SecondHandSongs’ database is not exhaustive of every cover song ever recorded. Consequently, during your evaluations, your top match might indeed be a valid cover, even if it’s not explicitly listed in our dataset. However, achieving high accuracy on a million-song dataset down to minor positional errors is a significant achievement in itself!

Why were duplicate tracks excluded?

Identifying covers that are essentially duplicates is a straightforward task. Our aim is to avoid repeatedly testing with nearly identical tracks, which would not provide meaningful insights into cover song detection algorithms.

Extreme Cover Songs

Is your algorithm performing exceptionally well? Challenge it further with these tracks! (These were intentionally excluded from the main list – can you determine why?)

TRVMOOV128F92E4D89 TRKVYPP128F9337A83 TRLJJLR128E07927B6 TREGTQA128F426C0C2 TRDJEFP128F933925A TRJZRKL128F931369B TRZXLTH128E078AF43 TRRVJWB128F426C0A9 TRFTSLG128F92F7204 TROFRYV128F1482A5E TRVATYS128F9339241 TRSFNHG128F427DF40 TRKIDJH128F4298840 TRZYXXX128E078CE4A TRHSZPV128F1459651 TRPAXPU12903CAA835 TRZRYDQ128F9341FDC TRRKQAS128F42AE480 TRCZJGM128F4261493

Publications

Below is an informal compilation of published research results utilizing the SecondHandSongs dataset. It represents a subset of the broader MSD publications page. We generally include works that present results based on a substantial portion of the million songs. If you believe your research should be included in this list, please don’t hesitate to contact us via email. This collective effort helps the community stay informed about the current state-of-the-art in the field.