The Echo Nest and Columbia University Announce 'Million Song Dataset'

Mar 4, 2011

SOMERVILLE, MA — March 4, 2011 - The Echo Nest, a music intelligence platform powering smarter music apps across the web and various devices, announced on Tuesday that it has provided music analysis and metadata to the Million Song Dataset, a collaboration between The Echo Nest and Columbia University's LabROSA (Laboratory for the Recognition and Organization of Speech and Audio) department, with hosting by Infochimps and funding from the National Science Foundation.

The driving principle behind this massive release of data is to give back to the MIR (Music Information Retrieval) community in the hope of helping researchers pursue their ideas without having to "reinvent the wheel" each time by painstakingly building their own smaller sets of data.

Commercial software developers, academic researchers, and data scientists alike can use the Million Song Dataset to test theories and build and refine algorithms for music recommendation, cultural analysis, and countless other purposes.

Million Song Dataset team members Thierry Bertin-Mahieux and Dan Ellis hope to accomplish the following goals with the metadata and music analysis freely provided by The Echo Nest:
To encourage research on algorithms that scale to commercial sizes
To provide a reference dataset for evaluating research
As a shortcut alternative to creating a large dataset with The Echo Nest's API
To help new researchers get started in the MIR field

For too long, music software developers have lacked a freely-available dataset like this to bridge the gap between theoretical research and commercial applications.

"One of the long-standing criticisms of academic music information research from our colleagues in the commercial sphere is that the ideas and techniques we develop simply aren't practical for real services, which must offer hundreds of thousands of tracks at a minimum," said Columbia University associate professor of electrical engineering and head of LabROSA Ellis. "But, as academics, how can we develop scalable algorithms without the large-scale datasets to try them on? The idea of a 'million song dataset' started as a flippant suggestion of what it would take to solve this problem. But the idea stuck -- not only in the form of developing a very large, common dataset, but even in the specific scale of one million tracks."

The core of the Million Song Dataset consists of detailed data about one million songs, but no audio files. However, it includes mapping to 7digital's library of 30-second samples, allowing researchers to test their technologies in the real world. This large dataset (approximately 200GB, depending on which files the developer chooses) is hosted by Infochimps.

"There are a lot of compelling music applications that haven't been built because of the heavy lifting involved with the infrastructure," said Infochimps CEO Nick Ducoff. "Between The Echo Nest's platform and the Million Song Dataset available on Infochimps, the only thing keeping a developer from building a compelling music-focused app is his or her imagination."

Interested parties can visit for the code, instructions on how to use it, benchmark results for example tasks (such as automatic song tagging and artist recognition), artist mapping to Yahoo's user ratings, and demonstrations of how to fetch audio snippets from 7digital and represent artists on a world map using the data, as well as a forum and FAQ.

The Million Song Dataset is a collaboration between The Echo Nest and Columbia University's LabROSA department, with funding from the National Science Foundation and hosting provided by Infochimps.

About The Echo Nest
The Echo Nest is a music intelligence company that connects the greatest application developers to the best data and music to enable the next generation of music experiences. Powered by the world's only machine learning system that actively reads about and listens to music everywhere on the web, The Echo Nest opens up a massive repository of dynamic music data to application developers ranging from one-person operations to multinational corporations. Over 100 music applications have been built on its platform to date.

The Echo Nest was co-founded by two MIT Media Lab PhDs. Winner of three National Science Foundation SBIR grants, The Echo Nest's investors include Matrix Partners and Commonwealth Capital Ventures, Argos Management and three co-founders of MIT Media Lab. For more information, visit