top of page

Baseline Model-Naive Bayes

  • Writer: Genre Oracle
    Genre Oracle
  • Nov 2, 2018
  • 2 min read

We constructed our own Naive Bayes model.

  • Take the lyrics of about 350 songs for each genre for training

    • Note that some songs has multiple genre

  • Clear stop words

  • For each word wj in genre Gi, calculate p(wj|Gi), which is probability of seeing the word wi given genre Gi

  • For each genre, calculate p(Gi), which is the probability of the genre Gi in our training set

  • When doing prediction, assume the song has lyrics W = {w1, w2, …}, we will calculate p(Gi|W) for each genre Gi and return the top three genres that has the highest p(G|W).

  • We used a classic Bayesian classification algorithm for prediction. For each song to predict, assume it has lyrics W = {w1, w2, …}, we try to find the top three genre that has the largest p(G|W), which is proportional to Σlog(p(wi|Gi)) + log(p(Gi)), assuming the words in lyrics are independent.

  • Because many songs has a lot of genres, it’s too difficult to get everything right. To test accuracy, we call it a success if

    • 2 out of 3 of our predictions are correct

    • If the song have less than 3 genres, we need to get more than half right (1 out of 2 or 1 out of 1)

  • Based on this accuracy measure, we achieved an accuracy about 55%

  • Some genres are subset of each other, for example, classic rock is a subset of rock. And there are some genres are very close to each other such as jazz and blues. Therefore, we added a layer to the original bayes model. For example, we first predict whether a song is “rock”, then we run another prediction to see whether its normal rock, classic rock or metal, etc. Similarity for categories such as blues, jazz and soul. This improved our accuracy to 65%

 
 
 

Recent Posts

See All

Comments


Post: Blog2_Post

©2018 by Love and Hate. Proudly created with Wix.com

  • Facebook
  • Twitter
  • LinkedIn
bottom of page