Today, I am going to compare three dimensionality reduction methods, they are PCA (Principle Component), PLS (Partial least squares) and UMAP(Uniform manifold approximation and projection). The dataset we use today is Billboard top 100 songs, comes from #TidyTuesday.
Explore data
Our goal is to reduce the dimensionality of the features of Billboard Top 100 songs, connecting the positions of the songs with mostly audio features available from Spotify.
library(tidyverse)
## billboard ranking data
billboard <- readr::read_csv("billboard.csv")
## spotify feature data
audio_features <- readr::read_csv("audio_features.csv")
Using the data.table package to import .csv files is faster, but Billboard Top 100 songs dataset is not very large, so readr is sufficient.
url | week_id | week_position | song | performer | song_id | instance | previous_week_position | peak_position | weeks_on_chart | |
---|---|---|---|---|---|---|---|---|---|---|
1 | http://www.billboard.com/charts/hot-100/1965-07-17 | 7/17/1965 | 34.00 | Don’t Just Stand There | Patty Duke | Don’t Just Stand TherePatty Duke | 1.00 | 45.00 | 34.00 | 4.00 |
2 | http://www.billboard.com/charts/hot-100/1965-07-24 | 7/24/1965 | 22.00 | Don’t Just Stand There | Patty Duke | Don’t Just Stand TherePatty Duke | 1.00 | 34.00 | 22.00 | 5.00 |
3 | http://www.billboard.com/charts/hot-100/1965-07-31 | 7/31/1965 | 14.00 | Don’t Just Stand There | Patty Duke | Don’t Just Stand TherePatty Duke | 1.00 | 22.00 | 14.00 | 6.00 |
4 | http://www.billboard.com/charts/hot-100/1965-08-07 | 8/7/1965 | 10.00 | Don’t Just Stand There | Patty Duke | Don’t Just Stand TherePatty Duke | 1.00 | 14.00 | 10.00 | 7.00 |
5 | http://www.billboard.com/charts/hot-100/1965-08-14 | 8/14/1965 | 8.00 | Don’t Just Stand There | Patty Duke | Don’t Just Stand TherePatty Duke | 1.00 | 10.00 | 8.00 | 8.00 |
6 | http://www.billboard.com/charts/hot-100/1965-08-21 | 8/21/1965 | 8.00 | Don’t Just Stand There | Patty Duke | Don’t Just Stand TherePatty Duke | 1.00 | 8.00 | 8.00 | 9.00 |