Data Acquisition and Cleaning
- Genre Oracle
- Oct 22, 2018
- 2 min read
Updated: Oct 23, 2018
We explored many possible datasets and finally decided to use the million song dataset produced by Columbia University. This dataset contains the essential databases that we will need in order to produce an effective genre classification model:
Song metadata (from musixmatch): this database contains basic information about each song, including song name, album name, artist name, artist familiarity, year, etc.
Lyrics (from musixmatch): this database contains lyrics in each songs in bag of words format. For example, in each rows of the lyrics database, it has “track_id”, “word” and “count”. This means in one song that’s represented by the track_id, a specific word appeared certain amount of times.
Tags (from lastfm): this database contains the tags of each song. The tags including many aspects such as music type, feeling, generation, etc. For example blues, romantic, 00’s music are all tags in the database. There are about 50000 unique tags.
The databases are stored in “db” files, which are in SQL format. We used sqlite3 model in python to write SQL queries to merge the database together. Each song has a unique key called “track_id” or “tid”, which allows us to link databases together.
For cleaning, we performed the following:
Changed all of the strings in the database to lowercase to allow for future matching
There were many duplicated songs. For example, some songs had “album version” and “explicit version.” Since their lyrics were essentially the same, we had to remove one of them.
Some songs did not have complete information. For example, some songs lacked a “year” or “lyric” feature ; these entries were removed. Since we have a million songs to build our model, we will only select a subset of them to train our model. Given the size of our dataset, obtaining enough songs with complete information will not be an issue.
Removed songs with non-english lyrics.
Some words in the lyrics were not useful. For example, the presence of “if” or “the” does not reveal much about a song. We downloaded a list of stopwords and removed them from the lyrics database.
We also found that many genre tags were essentially the same. For example, “hip-hop” and “hip hop” are the same, but they were stored as two different tags in the database. Tags of this nature were merged.
Comentarios