Baseline Model-Naive Bayes
- Genre Oracle
- Nov 2, 2018
- 2 min read
We constructed our own Naive Bayes model.
Take the lyrics of about 350 songs for each genre for training
Note that some songs has multiple genre
Clear stop words
For each word wj in genre Gi, calculate p(wj|Gi), which is probability of seeing the word wi given genre Gi
For each genre, calculate p(Gi), which is the probability of the genre Gi in our training set
When doing prediction, assume the song has lyrics W = {w1, w2, …}, we will calculate p(Gi|W) for each genre Gi and return the top three genres that has the highest p(G|W).
We used a classic Bayesian classification algorithm for prediction. For each song to predict, assume it has lyrics W = {w1, w2, …}, we try to find the top three genre that has the largest p(G|W), which is proportional to Σlog(p(wi|Gi)) + log(p(Gi)), assuming the words in lyrics are independent.
Because many songs has a lot of genres, it’s too difficult to get everything right. To test accuracy, we call it a success if
2 out of 3 of our predictions are correct
If the song have less than 3 genres, we need to get more than half right (1 out of 2 or 1 out of 1)
Based on this accuracy measure, we achieved an accuracy about 55%
Some genres are subset of each other, for example, classic rock is a subset of rock. And there are some genres are very close to each other such as jazz and blues. Therefore, we added a layer to the original bayes model. For example, we first predict whether a song is “rock”, then we run another prediction to see whether its normal rock, classic rock or metal, etc. Similarity for categories such as blues, jazz and soul. This improved our accuracy to 65%
Comments