Music Recommendation System

Situation

Recommendation systems are not perfect and have a huge scope for improvement as they vary based on many features. They are beneficial to both the user and the providers as users want perfect songs recommended to them and this is beneficial for providers as it keeps the user engaged on their platform. A hybrid recommendation system combines multiple recommendation techniques like content-based filtering and K-means clustering with sentiment analysis to provide a more comprehensive and personalized recommendation to users.

Task

I combined two datasets, one of them being created using sentiment analysis on recent Reddit data, and other data collected from song database websites such as Spotify. I mined the data using various NLP pre-processing techniques and trained the data on the BERT model. This is then aggregated to the data collected from Spotify which can be further pre-processed by cleaning, scaling the data, feature extraction, and feature engineering which we pass to various unsupervised learning models.

Action

Data processing

I started by assessing the various data sources, including the streaming platforms' APIs(Spotify) and social media(Reddit). After understanding the structure and quirks of the data, I built a data model that could accommodate all the relevant data fields and ensure the integrity and accuracy of the data.

To build the analytics pipeline, I utilized Python for its excellent ability to handle big data. Here's a snippet of the Python code used to extract the tracks from Spotify API and then load the features of the tracks into a data frame for further processing:

url = f'<http://ws.audioscrobbler.com/2.0/?method=chart.gettopartists&api_key={last_fm_api_key}&format=json&limit=1000>'
response = requests.get(url)
top_artists = response.json()['artists']['artist']
artists = [artist['name'] for artist in top_artists]
artists = artists[:500]

# function to get top tracks
def get_top_tracks(artist, api_key):
    url = f'<http://ws.audioscrobbler.com/2.0/?method=artist.gettoptracks&artist={artist}&api_key={api_key}&format=json>'
    response = requests.get(url)
    data = json.loads(response.text)
    top_tracks = data['toptracks']['track']
    tracks = []
    for track in top_tracks:
        track_dict = {}
        track_dict['name'] = track['name']
        track_dict['playcount'] = track['playcount']
        track_dict['listeners'] = track['listeners']
        track_dict['artist'] = track['artist']['name']
        tracks.append(track_dict)
    return tracks

# add tracks
def get_data(artists):
    top_tracks = []
    for artist in artists:
        tracks = get_top_tracks(artist, last_fm_api_key)
        top_tracks += tracks

# get features for each track
    def get_audio_features(track_id, spotify):
        try:
            features = spotify.audio_features(track_id)[0]
        except requests.exceptions.ReadTimeout:
            # Wait for 1 second and try again
            time.sleep(1)
            features = get_audio_features(track_id, spotify)
#         features = spotify.audio_features(track_id)[0]
        if features is not None:
            features_dict = {}
            features_dict['danceability'] = features['danceability']
            features_dict['energy'] = features['energy']
            features_dict['key'] = features['key']
            features_dict['loudness'] = features['loudness']
            features_dict['mode'] = features['mode']
            features_dict['speechiness'] = features['speechiness']
            features_dict['acousticness'] = features['acousticness']
            features_dict['instrumentalness'] = features['instrumentalness']
            features_dict['liveness'] = features['liveness']
            features_dict['valence'] = features['valence']
            features_dict['tempo'] = features['tempo']
            features_dict['duration_ms'] = features['duration_ms']
            features_dict['time_signature'] = features['time_signature']
            return features_dict
        else:
            return []

    for track in top_tracks:
        results = spotify.search(q=f"{track['name']} {track['artist']}", type='track', limit=1)
        if results['tracks']['items']:
            track_id = results['tracks']['items'][0]['id']
            features = get_audio_features(track_id, spotify)
            track.update(features)

    data = []
    for artist in artists:
        artist_tracks = [track for track in top_tracks if track['artist'] == artist]
        for key in ['acousticness','album_name','popularity', 'danceability', 'energy', 'instrumentalness', 'key',
                'loudness', 'mode', 'speechiness', 'liveness', 'valence', 'tempo',
                'duration_ms', 'time_signature','track_genre']:
            if key not in track:
                track[key] = float('nan')
        for track in artist_tracks:
            print(track)
            track_data = {
                'artist': artist,
                'album_name': track.get('album_name', float('nan')),
                'track_name': track['name'],
                'popularity': track.get('popularity', float('nan')),
                'acousticness': track.get('acousticness', float('nan')),
                'danceability': track.get('danceability', float('nan')),
                'energy': track.get('energy', float('nan')),
                'instrumentalness': track.get('instrumentalness', float('nan')),
                'key': track.get('key', float('nan')),
                'loudness': track.get('loudness', float('nan')),
                'mode': track.get('mode', float('nan')),
                'speechiness': track.get('speechiness', float('nan')),
                'liveness': track.get('liveness', float('nan')),
                'valence': track.get('valence', float('nan')),
                'tempo': track.get('tempo', float('nan')),
                'duration_ms': track.get('duration_ms', float('nan')),
                'time_signature': track.get('time_signature', float('nan')),
                'track_genre': track.get('track_genre', float('nan'))
            }
            data.append(track_data)
    return(data)

# final dataframe with all tracks 
df = pd.DataFrame(get_data(artists))
# df.to_csv('dataset.csv', index=False)

Sentiment analysis:

Once the data was created, I combined it with the user reviews on Reddit on tracks from the dataframe. I analyzed the track sentiment using the following code snippet. This was used as an aid to the recommendation system by passing the sentiment column to Kmeans and then performing content-based filtering.

# preproceesing user reviews(text data)
def remove_em(text):
    dem = demoji.findall(text)
    for item in dem.keys():
        text = text.replace(item, '')
    return text

for index, row in df_cols.iterrows():
    try:
        # Access the values of col1 and col2 for the current row
        col1_val = row['track_name']
        col2_val = row['artists']
        txt = []
        search_query = col1_val + ' ' + col2_val
        limit = 60

        search_results = reddit.subreddit('all').search(search_query, limit=limit)

        num_comments = 0
        for post in search_results:
            post.comments.replace_more(limit=None)
            for comment in post.comments.list():
                txt.append(comment.body)
                num_comments += 1

                # Stop iterating through comments once you have 5 comments
                if num_comments == 5:
                    break

            # Stop iterating through posts once you have 5 comments
            if num_comments == 5:
                break
        # result=specific_model(txt)
        for emo in range(len(txt)):
            txt[emo] = remove_em(txt[emo])
        result = pd.DataFrame(specific_model(txt)).label.value_counts().head(1).reset_index()['index'][0]
        dfs['sentiment'].iloc[index] = result
    except:
        dfs['sentiment'].iloc[index] = 'NEU'
    
# print(result)

Result

The model runs based on content-based filtering and the algorithm used is Kmeans. The selection of the algorithm is based on various features and cluster validity techniques. The pycaret module was used to run various clustering algorithms and the results are observed. The silhouette score is a measure of how similar an object is to its cluster compared to other clusters. It is used to evaluate the quality of clustering algorithms. Kmeans has a high silhouette score which means that the clusters are well-defined in Kmeans when compared to other algorithms in this case. Once the clustering was done, I used the cosine similarity for the user input song to find and recommend songs in the same cluster based on their cosine similarity.

Quantifiable results include: