I have millions of songs, each song has its unique Song ID. Corresponding to each Song ID I have some attributes like song name, artist name, album name, year etc.
Now, I have implemented a mechanism to find out similarity ratio between two songs. It gives me a value between 0 - 100.
So, I need开发者_JS百科 to show similar music to users, which can not be done on a run time. I need to preprocess the similarity values between each and every song.
Hence, if I create a DB with three attributes,
song1, song2, similarity
I will be having n*n records where n is the number of songs.
And whenever I want to fetch the similar music, I need to execute this query:
SELECT song2 WHERE song1 = x AND similarity > 80 ORDER BY similarity DESC;
Please suggest something to maintain such information.
Thanks.
I think you'd be better off comparing similarity to a "prototypical" song or classification. Devise a fingerprint mechanism that includes information metadata about the song and whatever audio mechanism you use to judge similarity. Place each song into one (or more) categories and score the song within that category -- how closely does it match the prototype for the category using the fingerprint. Note that you could have hundreds or thousands of categories, i.e., they're not the typical categories that you think of when you think of music.
Once you have this done, you can then maintain indexes by category and when finding similar songs you devise a weight based on the category and similarity measures within the category -- say by giving greater weight to the category in which the song is closest to the prototype. Multiply the weight by the square of the difference between the candidate song and the current song to the prototype for the category. Sum the weights for the say top 3 categories with lower values being more similar.
This way you only need to store a few items of metadata for each song rather than keep relationship between pairs of songs. If the main algorithm runs too slowly, you could keep cached pair-wise data for the most common songs and default to the algorithmic comparison when a song isn't in your cached data set.
What you are proposing will work, however, you can reduce the number of rows by storing each pair only once. Then modifying your query to select the song id in song1 or song2.
Something like:
SELECT if(song1=?,song2,song1) as similar WHERE (song1 = ? or song2 =?) AND similarity > 80 ORDER BY similarity DESC;
It seems required mass computation power to maintain and access the similarity information. For example, if you already have 2000 songs processed, and you still need to perform the similarity analyze 2000 times for the next new song. It may have scalability problem and the data scheme can make the database slow in just a short time period.
I recommend that you can find some pattern and tag each song. For example, you can analyze the songs for "blues", "rocks", "90's" pattern and give them tags. If you want to find similar song based on one song, you can just query all tags that the given songs have. ex. "New age", "Slow" and "techno"
精彩评论