Link to the live webpage: https://sjesusa1.github.io/
On January 4, 1936, Billboard magazine published its first list of the most popular songs based on their sales, radio airplay, digital downloading and online streaming, such as YouTube. Each week since then, Billboard magazine has been charting the popularity of the top 100 songs in the U.S.. Similarly, Spotify, one of the leading music streaming apps of today, also compiles a yearly list of top songs based on the number of times the song was streamed by the user.
With the pressure to make it into the top hits, record companies try to create music that might appeal to a large audience; however, it is difficult to understand what certain aspects of a song makes it a top hit.
For this project, I will be providing a data science analysis on data sets of top songs collected by Billboard and Spotify from various years to find different correlates of successful songs in order to determine what key qualities a song needs to become a hit.
This project will help understand how audio features play a role in the popularity of a song. The data set present can be used to find the correlation between audio features and song popularity by plotting the audio features vs. song popularity and using data visualization.
Python libraries:
1. Data Analysis Libraries: numpy, pandas
2. Visualization Libraries: matplotlib.pyplot, seaborn
#data analysis libraries
import numpy as np
import pandas as pd
#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
I found the data that I needed was already provided in a dataset on Kaggle as "Top Spotify songs from 2010-2019 - BY YEAR" by Leonardo Henrique. I checked the data and found that it was already tidy and no validation errors. The dataset was extracted from Spotify and Billboard and organized by Spotify Organize Your Music: http://organizeyourmusic.playlistmachinery.com/. The data set included my target variable, the popularity score of each song, and explanatory variables, audio features, which include beats per minute (BPM), danceability, loudness, energy, valence, and duration. It also included general characteristics, such as the genre, the title, the artists, and the year the song was released.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
# Show a ludicrus number of rows and columns
pd.options.display.max_rows = 500
pd.options.display.max_columns = 500
pd.options.display.width = 1000
#Artist dataset
artist_df = pd.read_csv("../data/artistDf.csv")
#Dropping unnecessary columns
artist_df = artist_df.drop(artist_df.columns[[0,4,5]], axis = 1)
artist_df.head()
#Top Spotify Songs 2010-2019
spotify_df = pd.read_csv("../data/top10s.csv")
#Dropping first column
spotify_df = spotify_df.drop(spotify_df.columns[0], axis = 1)
#Renaming columns
spotify_df.columns = ['Title', 'Artist', 'Top Genre', 'Year', 'BPM', 'Energy', 'Dance', 'dB', 'Liveness', 'Valence', 'Duration', 'Acoustic', 'Speech', 'Popularity']
spotify_df.head()
#Top 100 Billboard 2019 Songs
bb100 = pd.read_csv("../data/track_analyze.csv")
bb100.head()
import datetime
bbHot = pd.read_csv("../data/billboardHot100_1999-2019.csv")
#Drop unnecessary columns
bbHot = bbHot.drop(bbHot.columns[[0, 4, 5, 9, 11]], axis = 1)
#Change week and date objects to datetime objects
week = pd.to_datetime(bbHot["Week"])
song_release = pd.to_datetime(bbHot["Date"])
bbHot["Week"] = week
bbHot["Date"] = song_release
bbHot["Year"] = pd.DatetimeIndex(bbHot["Date"]).year
#Filter songs from 2010-2019
bbHot = bbHot.loc[bbHot['Year'] >= 2010.0]
#Tidying dataset
bbHot.columns = ['Artist', 'Track', 'Weekly Rank', 'Week', 'Date', 'Genre', 'Lyrics', 'Year']
bbHot = bbHot.sort_values(ascending = True, by = ['Artist', 'Track', 'Weekly Rank', 'Week', 'Date', 'Genre','Lyrics', 'Year'])
bbHot.head()
The first numerical variable I will look at is Beats per minute (BPM), which is the tempo or the pace of the song during the duraction of the song. For example, 60 BPM means a beat per second, while 120 BPM is twice as fast with two beats per second.
bpm_df = spotify_df
#Creating BPM Bins
bpm_bins = [0, 50, 100, 150, 200, 250]
bin_names = ['0-50', '50-100', '100-150', '150-200', '200-250']
bpm_df['BPM Range'] = pd.cut(bpm_df['BPM'], bpm_bins, right = False, labels = bin_names)
(bpm_df['BPM Range'].value_counts() ** .5).plot.bar()
plt.title("Frequency of BPM Ranges")
plt.ylabel("Frequency")
plt.xlabel("BPM Ranges")
Based on the graph above, majority of the popular songs contain a BPM within 100-150 and the least being 0-50 BPM.
#Songs with a BPM of 100-150
bpm_df.loc[bpm_df['BPM Range'] == '100-150'].head()
#Songs with BPM of 0-50
bpm_df.loc[bpm_df['BPM Range'] == '0-50'].head()
The second numerical variable I will look at is energy. The higher the value, the more energetic the song is. Energetic songs are usually described with the general feeling of being fast, loud, and noisy. Musical energy can be related to the volume or intensity of the song.
energy_df = spotify_df
eng_bins = [0, 20, 40, 60, 80, 100]
bin_names = ['0-20', '20-40', '40-60', '60-80', '80-100']
energy_df['Energy Range'] = pd.cut(energy_df['Energy'], eng_bins, right = False, labels = bin_names)
(energy_df['Energy Range'].value_counts() ** 1).plot.bar()
plt.title("Frequency of Energy Ranges")
plt.ylabel("Frequency")
plt.xlabel("Energy Ranges")
Majority of popular songs contain an energy level of 60-80 while the least contain an energy level of 0-20.
#Songs with Energy levels of 60-80
energy_df.loc[energy_df['Energy Range'] == '60-80'].head()
#Songs with Energy levels of 0-50
energy_df.loc[energy_df['Energy Range'] == '0-20']
The third numerical variable I will look at is danceability. Danceability describes how suitable a track is for dancing based on a combination of musical elements, which include tempo, rhytm stability, and beat strength. The higher the value, the more danceable the track is.
The fourth numerical variable I will look at is deciblas (dB). Decibals is a relative unit of measurement for sounds. To get an idea of decibals, a whisper is about 30 dB, a normal conversation is about 60 dB, and a motorcycle engine is about 95 dB.
db_df = spotify_df
(db_df.dB.value_counts() ** 1).plot.bar()
plt.title("Frequency of dB")
plt.ylabel("Frequency")
plt.xlabel("Decibals (dB)")
Majority of the popular songs contain a dB of -5. As we can see from the graph, popular songs that contain a decibal lower than -5 db are less frequent, as well as songs louder than -5 dB. Therefore, people are more drawn to songs that are louder but to a certain extent.
#Songs with a dB of -5
db_df.loc[db_df['dB'] == -5].head()
#Songs higher than -5 dB
db_df.loc[db_df['dB'] > -5].head()
#Songs lower than -5 dB
db_df.loc[db_df['dB'] <= - 11]
The fifth numerical variable I will look at is liveness. Liveness detects the presence of an audience in the recording. The higher the value, the more likely the song was performed live.
liv_df = spotify_df
liv_df.Liveness.value_counts().sort_index(ascending = False)
liv_bins = [0, 10, 20, 30, 40, 50, 60, 70, 80]
bin_names = ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80']
liv_df['Liveness Range'] = pd.cut(liv_df['Liveness'], liv_bins, right = False, labels = bin_names)
(energy_df['Liveness Range'].value_counts() ** 1).plot.bar()
plt.title("Frequency of Liveness Ranges")
plt.ylabel("Frequency")
plt.xlabel("Liveness Ranges")
Majority of the popular songs contain a liveness of 10-20, meaning that most people preferred songs that were not live recordings. Many people probably preferred studio recorded songs than live recordings, as studio recorded songs can be edited to produce more unique sounds not found in live recordings.
#Songs with a liveness range of 10-20
liv_df.loc[liv_df['Liveness Range'] == '10-20'].head()
#Songs with a liveness range of 70-80
liv_df.loc[liv_df['Liveness Range'] == '70-80']
The sixth numerical variable I will look at is valence. Higher values are associated with more positive moods for the song, while lower is more negative.
val_df = spotify_df
val_bins = [0, 20, 40, 60, 80, 100]
bin_names = ['0-20', '20-40', '40-60', '60-80', '80-100']
val_df['Valence Range'] = pd.cut(val_df['Valence'], val_bins, right = False, labels = bin_names)
(val_df['Valence Range'].value_counts() ** 1).plot.bar()
plt.title("Frequency of Valence Ranges")
plt.ylabel("Frequency")
plt.xlabel("Valence Ranges")
Majority of popular songs contain a valence in the range of 40-60, while the least are in 0-20.
#Songs with valence of 40-60
val_df.loc[val_df["Valence Range"] == '40-60'].head()
#Songs with valence of 0-20
val_df.loc[val_df["Valence Range"] == '0-20'].head()
The seventh numerical variable I will look at is duration.
time_df = spotify_df
time_bins = [0, 100, 200, 300, 400, 500]
bin_names = ['0-100', '100-200', '200-300', '300-400', '400-500']
time_df['Duration Range'] = pd.cut(time_df['Duration'], time_bins, right = False, labels = bin_names)
(time_df['Duration Range'].value_counts() ** 1).plot.bar()
plt.title("Frequency of Duration Ranges")
plt.ylabel("Frequency")
plt.xlabel("Duration Ranges")
Majority of popular songs have a duration of 200-300 seconds, which is about 3-5 minutes, and the least being in the range of 0-200 seconds (0-3 minutes) and 300-500 seconds (3-8 minutes).
time_df.loc[time_df['Duration Range'] == '200-300'].head()
#Songs with a duration between 0 - 200 and 300-500
time_df.loc[(time_df['Duration Range'] == '400-500') | (time_df['Duration Range'] == '300-400') | (time_df['Duration Range'] == '100-200')].head()
The eighth numerical variable I will look at is acousticness. The higher the acoustic level is, the more the song is acoustic. Acoustic music is music that soley or primarily uses instruments that produce sound typically through acoustic string instruments, as opposed to electric or electronic means. Acousticness is typically found in the folk music.
acoustic_df = spotify_df
acoustic_bins = [0, 10, 20, 30, 40, 50, 60, 70 , 80, 90, 100]
bin_names = ['0-10', '10-20', '20-30', '30-40', '40-50', '50-60', '60-70', '70-80', '80-90', '90-100']
acoustic_df['Acoustic Range'] = pd.cut(acoustic_df['Acoustic'], acoustic_bins, right = False, labels = bin_names)
(acoustic_df['Acoustic Range'].value_counts() ** 1).plot.bar()
plt.title("Frequency of Acoustic Ranges")
plt.ylabel("Frequency")
plt.xlabel("Acoustic Ranges")
Majority of popular songs have an acoustic level of 0-10, as most popular songs involve the use of electronic instrumentation.
#Songs that contain an acoustic level of 0-10
acoustic_df.loc[acoustic_df['Acoustic Range'] == '0-10'].head()
#Songs that contain an acoustic level of 60-70
acoustic_df.loc[acoustic_df['Acoustic Range'] == '60-70'].head()
Lastly, the final numerical value I will look at is speech. The higher the value the more words the song contains.
speech_df = spotify_df
speech_df.Speech.value_counts().sort_values()
speech_bins = [0, 10, 20, 30, 40, 50]
bin_names = ['0-10', '10-20', '20-30', '30-40', '40-50']
speech_df['Speech Range'] = pd.cut(speech_df['Speech'], speech_bins, right = False, labels = bin_names)
(speech_df['Speech Range'].value_counts() ** 1).plot.bar()
plt.title("Frequency of Speech Ranges")
plt.ylabel("Frequency")
plt.xlabel("Speech Ranges")
speech_df.loc[speech_df["Speech Range"] == '0-10'].head(10)
#Calculate the number of unique words in the lyrics of "Just the Way You Are"
print(len(set(bbHot.loc[bbHot['Track'] == "Just The Way You Are"].iloc[1].Lyrics.split())))
#Calculate the number of unique words in the lyrics of "Dynamite"
print(len(set(bbHot.loc[bbHot['Track'] == "Dynamite"].iloc[1].Lyrics.split())))
print(len(set(bbHot.loc[bbHot['Track'] == 'Marry You'].iloc[1].Lyrics.split())))
speech_df.loc[speech_df["Speech Range"] == '40-50'].head()
#Calculate the number of unique words in the lyrics of "I Luh Ya Papi"
print(len(set(bbHot.loc[bbHot['Track'] == "I Luh Ya Papi"].iloc[1].Lyrics.split())))
print(len(set(bbHot.loc[bbHot['Track'] == "Love Yourself"].iloc[1].Lyrics.split())))
Generally, most popular songs have a speech range of 0-10. Therefore, the less words the song contains, the more popular the song is.
To better see the distribution of popular songs, let's graph the distribution of different features for songs to see the general trends in popular songs. As we can see, most of the popular songs contain the following characteristics on average: fast paced with a 120 BPM, energetic, danceble, shorter duration, less acoustic, and less wordy songs.
spotify_df.median()
#Distribution plot of each song characteristic
fig, axes = plt.subplots(3, 4, figsize = (20,4), sharey = False)
sns.distplot(spotify_df['BPM'], ax = axes[0][0])
sns.distplot(spotify_df['Energy'], ax = axes[0][1])
sns.distplot(spotify_df['Dance'], ax = axes[0][2])
sns.distplot(spotify_df['dB'],ax = axes[0][3])
sns.distplot(spotify_df['Liveness'],ax = axes[1][0])
sns.distplot(spotify_df['Valence'], ax = axes[1][1])
sns.distplot(spotify_df['Duration'],ax = axes[1][2])
sns.distplot(spotify_df['Acoustic'],ax = axes[1][3])
sns.distplot(spotify_df['Speech'], ax = axes[2][0])
axes[2,1].set_axis_off()
axes[2,2].set_axis_off()
#axes[2,3].set_axis_off()
Next, I want to see whether gender plays a role in determination of what makes a song popular. To do this, I merged the artists' data with the top songs from Spotify by doing a left join on Spotify's data set through Artist. This will provide information on the artists' gender for each song.
new_spotify = spotify_df.merge(artist_df, on=["Artist"], how="left")
new_spotify.Gender.value_counts().plot.pie()
Based on the graph above, gender is evenly distributed among popular songs with the number of males slightly more than half; therefore, gender does not appear to play a significant role in the popularity of songs.
Next, we want to see which genre was more dominant amongst hit songs.
top_10_genres = spotify_df['Top Genre'].value_counts().head(10)
(top_10_genres ** .3).plot.bar()
Based on the graph above, we can see the most prominent genre amongst popular songs is pop music. This makes sense based on what we saw earlier with the general trends in the majority of popular songs when investigating each song attribute.
Between the years 2010-2019, the most prominent music genre, based on Spotify's data on Billboard songs, was pop music and any subcategory of pop.
Majority of the popular songs revealed common trends in the following song attributes: BPM, dB, duration, energy, dance, liveness, valence, acoustic, and speech, which are all provided by Spotify's data. From the data analyzed above, popular songs usually contain 120 beats per minute (BPM), an energy level of 74.0, a dancability of 66.0, a noise intensity of -5.0 dB, a liveliness of 12.0, a valence of 52.0, a duration of 3.68 minutes, an acoustic of 6.0, and a speech of 5.0.
Using this information, we can get a general idea about whether a song will become a hit or not.