Exploratory Statistics
- Genre Oracle
- Oct 23, 2018
- 2 min read
Our cleaned dataset is separated into two parts: tags and lyrics
Dataset introduction
tag dataset

We have 21167 rows * 9 columns
The meaning of some important columns:
‘Track_id’: string; key id, connect to the “lyrics” dataset. One song has only one track_id.
‘genre_merged’: string; cleaned tag. We obtained 14 genres after our initial cleaning.
‘Year’: int; the song’s year.
For example: The rap song “#1 fan” by Frankie J was published in 2006 and has a track_id “TRYASIL128F147089D.”
b . lyrics dataset

We have 12833341 rows * 3 columns.
The meaning of each columns:
‘Tack_id’: string; key id, connects to the “tag” dataset. Each song has only one track_id.
‘Word’: string; a word in the lyrics of a song. One track_id has many rows, since a song’s lyrics often contains many words.
‘Count’: int; means the number of times a ‘word’ appears in the song (‘track_id’).
For example, in the lyrics of the song whose track_id is ‘TRAAAAV128F421A322’, ‘like’ is one of the most frequent words, as well as ‘which’ and ‘poor’(which appearing twice).
2. Explore the dataset
tag dataset
In 21167 songs, the genre unevenly distributed:

In later data analysis, we will take this asymmetrical distribution into consideration. We will evenly abstract songs from each genre. In addition to the blanket genre distribution, we found the prevalence of gene by year:

This is an interesting finding, and we will take it into consideration in our future analysis.
lyrics dataset
We have 12833341 rows in the lyrics dataset contributed by 237648 songs’ lyrics. For each song, we collected about 54 words (excluding the stopwords).
All of the songs in tag dataset can be connected to the lyrics dataset.
In summary, our analysis is based on 21167 songs with about 54 words per song’s lyrics, in 14 genres.
For each tag, the average number of words in the songs:

Since rap & hiphop has huge gap in number of of words compared to jazz or blues, we may use the number of words in each song as a feature to help us classify genre in the future.
Upon closer inspection, we found that the most frequently used words in each genre overlapped:

Words ‘like’, ’know’, ’love’, ’go’ have higher frequency in the genres of ‘hiphop’, ’rock’ and ‘pop.’ This creates a problem: We cannot classify the genre only by the top words since they are very similar.
Therefore, in later models, we will introduce more than 100 words per genre in order to maximize the difference between each genre.
Furthermore, we will consider other features like artist, year, beats per minute, average word length, average number of words, and the average number of time a word repeats in a song.
Bình luận