Update (March 22, 2017): I've added some additional information below regarding the words that contribute most to artists' positions along the axes as plotted below. This was spurred by a conversation with Mustafa Hameed, and goes a long way towards clarifying what differences in lyrics I'm picking up on.
A few years ago, Matt Daniels took a look at the size of hip-hop artists' vocabularies, differentiating artists on the basis of how many unique words they used within their first 35,000 lyrics. I stumbled across this project again recently and began considering other ways to try to map out the differences and similarities between hip-hop artists. In particular, I started thinking about how to find out which rap artists are similar in terms of their lyrical content, rather than just their lyrical diversity.
To answer this question, I gathered all the lyrics on Genius for about 150 hip-hop artists. A full list of the artists I looked at appears at the bottom of this post. The artists I considered were chosen from Genius's list of top artists in January 2017, and a few online lists of the best rappers of all-time. When scraping the lyrics from Genius, I did my best to correctly attribute verses when multiple artists appeared on the same song. I also tried to ensure that hooks and choruses for a single song were only included once and that if one verse appeared in multiple songs (say, the original version of a song and a remix), it would only be included once. There are certainly mistakes in the lyrics I collected because of inconsistent formatting on Genius, unintentional mistakes on my part in attributing verses to the wrong artists, inaccurately transcribed songs, missing songs, etc. Despite this, I'm fairly confident that the lyrics I collected give a good representation of each artist's lyrical content.
Warning: The next few steps get a little technical and you might get bored. In short, I took a number of steps to reduce the thousands of lyrics I had collected to two numbers per artist representing each artist's lyrical content. This then allowed me to plot each artist in a 2-dimensional space, with the proximity of two artists' points in this space indicating how similar their lyrics are. If you're not really interested in the details of how I did this, feel free to skip ahead to the "Results" section.
Okay, if you decided not to skip ahead and just look at the results, the next step after collecting the lyrics was to stem them. This is a process by which various forms of the same word are reduced to a single root form. For example, a stemmer would reduce the word "grinding" to "grind," so that for the purposes of the analysis, these would be counted as the same word. In particular, I used the Snowball stemmer as distributed in NLTK. Next, I simply took a count of how many times each (stemmed) word appeared in the full set of lyrics for an artist. More technically, we can say that I represented each artist's lyrics using a "bag-of-words" model and calculated the term frequency for each word. In some ways, this is an incredibly simple approach that throws away quite a bit of interesting information. Still, it works surprisingly well in practice and often allows us to abstract away from small details to reveal larger patterns.
To give a concrete example of how this works, consider the opening lines to "C.R.E.A.M.":
I grew up on the crime side, the New York Times side
Staying alive was no jive
After stemming, we get the following:
I grew up on the crime side, the New York Time side
Stay aliv was no jive
If you're thinking that the stemmer should have replaced "grew" with the root "grow" or "was" with the root "be," that's a good intuition. To do this, I would have had to use a tool called a lemmatizer, rather than a stemmer. Ideally, I would have done so, but lemmatizers like the WordNet lemmatizer that is included in NLTK also require part-of-speech information for each word. That is, I'd have to have known whether each word was noun, or a verb, or an adjective, or a preposition, etc. Getting part-of-speech information would have added extra time and complexity to the project, so I stuck with using a stemmer. (I may follow up on this project in the future by reanalyzing the data with more sophisticated methods, such as a part-of-speech tagging and a lemmatizer.)
Next, I simply took the counts of how often each word appeared in a given artist's lyrics. We can represent this as a vector, i.e. a list of numbers. For example, the above lines from "C.R.E.A.M." could be represented by a vector (list) like the following: [1,1,1,1,2,1,2,1,1,1,1,1,1,1,1]. Each position in this vector corresponds to the number of times a particular word appears in the lyrics. In this case, the first number corresponds to the number of times "I" appears, the second number corresponds to the number of times "grew" appears, etc. Most of the numbers in this list are 1, since most words that appear in these lines only appear once. But two words, "the" and "side," appear twice. Just to illustrate the type of information lost by representing the lyrics in this way, we now have no information about syntax or rhymes or phrases larger than a single word. We have no way of knowing, for example, that the stemmed word "Time" in these lines is part of a proper name "New York Times."
I repeated this process for all of the lyrics I had collected for each artist, the result being a vector for each artist that represented the totality of that artist's lyrics. Note that each position in each of these vectors represents the same word for every rapper. This means that every rapper's list includes a position for every word used by any rapper. So, even though Grandmaster Caz never rapped about what goes down in the DM, there is a position in his vector with a 0 in it that represents how many times his lyrics contain the word "DM."
Since these vectors represent each artist's lyrics, artists whose lyrics cover similar (or different) topics should have similar (or different) vectors. In principle, we should be able to get a sense of how similar artists are by comparing their vectors. Unfortunately, there are a few problems with comparing these vectors directly without some more processing. First, at this point my vectors only contained raw or absolute frequencies of the words that appear in each artist's lyrics. So, for example, suppose we want to compare whether Jay-Z or Kanye West raps more about luxury cars. We find out that Jay's lyrics contain twice as many uses of words like "lambo," "benz," and "whip" than Kanye's lyrics. We might conclude that Jay talks about cars twice as much as 'Ye. But wait! This ignores the fact that Jay-Z's first album came out in 1996, whereas Kanye's didn't drop until 2004. Jay simply has many more lyrics overall, so it's hard to say whether he's rapping more about cars than Kanye. This suggests that instead of using raw frequencies, we should be representing how frequently each word appears in an artist's lyrics relative to how many words appear in that artist's lyrics overall.
A second problem is that some of the words in these vectors are going to be more helpful for identifying how similar artists are than others. Words like "the" and "of" are going to be frequent for all artists, and so they won't tell us too much about the content of artists' lyrics. The flip side of this is that words that aren't used frequently by artists in general are more informative about the contents of lyrics when they are used.
Both of these problems can be addressed by applying a weighting method known as tf-idf, for "term frequency–inverse document frequency," to the vectors we have for each artist. For each word that appears in an artist's lyrics, we calculate two values, the "tf" weight and the "idf" weight, and then multiply them. In general, if we're not simply using raw frequencies, we want the "tf" value to discount words with high raw frequencies. The specific version of the "tf" weight that I used also accounts for how many words appear in an artist's lyrics in total. The "idf" value for a word accounts for the total number of artists that use the word in their lyrics at least once. (The details aren't that important for getting the main idea, but there are a few different ways to calculate "tf" and "idf" values. Here are the formulas I used. For a term and a document (the set of all lyrics of one artist) , where represents the raw frequency of in and represents the maximum raw frequency of any word in , I used . For the total number of documents (artists) and number of documents (sets of artists' lyrics) in which term appeared , I used . All vectors were normalized after applying tf-idf weighting.)
At this point, we still have one vector per artist that represents this artist's lyrics. We could simply take pairs of these vectors to directly compare how one artist's lyrics differ from another's, but it would be more satisfying to compare the similarity of all artists at once. To do this, it would be useful to have some way of visualizing these vectors that both gives a sense of the big picture and let's us compare individual artists. The good news is that a vector can be thought of as a point in space. For example, the vector [2,3] represents a point in two-dimensional space with an x-coordinate of 2 and a y-coordinate of 3. The vector [1,0,-4] represents a point in three-dimensional space with an x-coordinate of 1, a y-coordinate of 0, and a z-coordinate of -4. So, in theory, we could take the vectors we have for each artist to represent points in space and then plot all the points.
The bad news is that the vector we have for each rapper has roughly 33,500 numbers. (Remember that each vector needs to include a spot for any word used by any artist.) Suffice it to say that this is not easy to plot. (Even bosonic string theory only has 26 dimensions.) Luckily, there are a number of ways to reduce representations in high-dimensional spaces (i.e. a set of vectors with lots of numbers) to representations in low-dimensional spaces (i.e. a set of vectors with fewer numbers) while preserving the structure of the data as much as possible. One of the more well-known methods for doing this is called Principle Components Analysis. For this project, I used a different method that is often used in computational linguistics known as Latent Semantic Analysis (LSA). (I also tried out a more contemporary method called t-SNE, but this didn't seem to work as well as LSA.) One virtue of LSA for a project like this is that it works well for "sparse" data sets, i.e. data sets that contain many 0's. In the context of this project, the data was sparse in the sense that most words used by any artist are unlikely to be used by a particular artist. Because the vector we have for each artist contains about 33,500 numbers, this means that there were 33,500 unique words used by all artists that I looked at. But individual artists tended to use between 1,000 and 5,000 unique words, meaning that 85%-95% of each list for an artist's lyrics consisted of 0's. (If you're noticing that the numbers for unique words per artist are much lower than those reported by Matt Daniels, note that Daniels did not stem words, so "grind" and "grinding" would have counted as different words for him, and that Daniels did not try to distinguish verses in a song by guest or featured artists from those of the song's primary artist.)
Using LSA, I reduced each artist's vector to a vector with just 2 values, allowing me to plot all of the artists in 2-dimensional space. We can think of this process as a way of getting down to a 2-dimensional representation of artists' lyrics that preserves as much of the variation in the original data set as possible. I ran LSA a few times, each time removing outliers that made the plots of the data uninteresting. In general, these outliers were artists for whom I had few lyrics. For groups, I tried to represent each member of the group separately whenever possible. In some cases, these individual members would have been outliers in the plot, so I collected the lyrics of each member of the group and represented the group as a single artist.
Below, you can see the final results of using LSA to plot the artists' lyrics in two dimensions. (If you skipped over the methods section, just know that LSA is a technique I used to distill the totality of an artist's lyrics down to 2 numbers.) I haven't included values on either axis, since it's not particularly enlightening to try to directly interpret the values that LSA gives. Rather, it's more useful to look at the overall structure and patterns in the data. Mouse over or tap on the points to see which artists they represent. The color-coding is explained below the chart.
One immediate question that arises is whether I've actually plotted artists in a way that reflects lyrical similarity, i.e. whether lyrically similar artists are represented by points close to each other in the plot and dissimilar artists are represented by points that are farther apart. One way to answer this question is to see if the points' locations correspond to some difference that we intuitively think would be related to a difference in lyrical content. Here, I've color-coded the points to reflect time, with each artist coded by the year of each artist's first release, appearance, or credit according to Discogs. Blue points represent earlier starts to artists' careers, whereas red points represent more recent artists. Based on this coding scheme, the first LSA dimension (i.e. the x-axis) seems to be picking up on how lyrical content in hip-hop has changed over time, with older artists on the left and more recent artists on the right. This is a good indication that the plot is picking up on real differences in the content of artists' lyrics.
One consequence of the LSA method that I used is that different words will contribute towards artists' positions along each axis to different degrees. Looking at which words contribute most to artists' positions along the x-axis gives us some additional information about what differences this axis is picking up on. Here are the ten words that contribute most towards being positioned on the left side of the above plot: "MC," "Mic," "Microphone," "Lyric," "Hip-hop," "Crowd," "Has," "Stage," "Rhyme," and "Control." And here are the ten words that contribute most towards being positioned on the right side of the plot: "Shawty," "Ridin'," "100," "Kush," "Wrist," "Molly," "Pour," "Im," "Aye." The words that are most associated with the left side of the plot have a lot to do with the core elements of hip-hop itself -- mics, lyrics, rhymes, etc. The words most associated with the right side of the plot have to do with drugs and alcohol -- kush, molly, pour -- and, I guess, keeping it 100 while riding with shawties. In principle, there's no reason that the same artists couldn't rap about both topics, but this analysis tells us that, in general, they don't. When we consider that the x-axis seems to correlate fairly well with time, these two sets of words also give us a sense of how the topics discussed in hip-hop have shifted over time.
We can do the same thing for the y-axis. Words that most contribute to artists appearing at the bottom of the plot are the following: "I," "the," "you," "a," "and," "to," "it," "my," "n't," "in." (Note that in processing the lyrics, I split contractions, so "n't" comes from words like "don't," "won't," etc.) These happen to be among the most common words among all rappers. On the other hand, uncommon words contribute most towards artists being positioned at the top of the plot. It's not too useful to look at exactly which words contribute most in this respect, since there are many words that only appear once that contribute equally to artists' positions along this dimension. (There are also many typos and misspellings that contribute towards artists being positioned at the top of the plot. That's just part of the problem in dealing with data from a site like Genius.) This dimension seems to be picking up on whether an artists' lyrics are dominated by highly common words or uncommon words. In some cases, this corresponds to Matt Daniels's ranking of rappers in terms of lyrical diversity. Notice, for example, that Aesop Rock, RZA, and MF Doom are all at the top of my plot and are among the most lyrically diverse rappers according to Daniels. But, as Mustafa Hameed pointed out to me, it's more appropriate to say that this dimension captures something like lyrical uniqueness. That is, the higher you are on my plot, the more you use words not used by other rappers in general. But your lyrics still might not be diverse in the sense that you use a relatively small set of (uncommon) words over and over.
We can also categorize rappers by region to see if the LSA representation can identify any differences between, say, East Coast and West Coast rappers.
There's no particularly clean separation between rappers from different regions, but it does appear that East Coast rappers are concentrated on the upper-left edge of the points, while southern rappers are mostly located in the lower-right corner. Midwestern rappers tend to appear in the middle of the plot, mostly on the right side. There's not any obvious pattern, at least as far as I can tell, when it comes to West Coast artists. Looking at this in terms of the interpretation we've given to the axes, East Coast rappers' lyrics tend to cover themes that were more prominent in early hip-hop, including the theme of hip-hop itself, and tend to use less common words. (Fetty Wap, a native of Patterson, New Jersey, is a notable exception here.) Southern rappers' lyrics tend to focus on more hedonistic topics and their lyrics are more dominated by common words.
Another interesting pattern is something I like to call the "region of wokeness," a grouping of alternative, political, and so-called "conscious" rappers on the left edge of the points. In the plot below, I've highlighted the points for A Tribe Called Quest, Common, Immortal Technique, Lauryn Hill, Mos Def, Public Enemy, and Talib Kweli. (If Jay-Z truly wanted to be lyrically Talib Kweli or to rhyme like Common Sense, he would have to shift a bit to the left to join this group.)
It's also interesting to note the artists in this region that I haven't singled out, but which, according to my methods, are lyrically similar to the "woke" artists. These include RZA, GZA, MF Doom, Nas, Busta Rhymes, the Beastie Boys, Cypress Hill, and Rakim. While I'm not sure these artists would all intuitively be classified in the same category, this analysis seems to think they're fairly similar. Having looked at what the axes seem to represent, we could say that these artists are grouped together because they are generally artists who were active during the same time period, who are a little bit closer to the "microphone" and "rhyme" side of things than the "shawty" and "kush" side of things, and whose lyrics aren't dominated by extremely common words.
We can also take a look at how similar artists who are members of the same group are. Click on the buttons below to highlight members of different groups. (I've noticed these buttons can be somewhat finicky on mobile. If it's not working for you, try refreshing the page.):
In general, these results say that members of the same group are fairly similar lyrically, which I take to be a good sign that the method is working. But we do see a few cases where members are lyrically distinct. While Jim Jones and Juelz Santana are plotted close to each other, Cam'ron is plotted higher, closer to Jay-Z, Drake, and Kanye. (Not to mention, to my surprise, Tyler, the Creator.) In the case of N.W.A., Dr. Dre and Ice Cube are considered fairly similar, but MC Ren and Eazy-E are a bit removed. This probably has to do with the longevity of these artists' careers. MC Ren hasn't been particularly active since the mid-90s, and Eazy-E passed away in 1995.
Before wrapping up, I consider the questions of who are the most typical and who are the most unique artists, lyrically, given the method used here. For the most typical artist, we can calculate the distance between every pair of artists in the 2-dimensional space (or, alternatively, the original 33,500-dimensional space) and see who has the smallest average distance. Using the 2-dimensional space, we find that Earl Sweatshirt is the most typical rapper in terms of lyrical content. This was pretty surprising to me, although it is consistent with the plots above in which Earl is pretty centrally located. If we look at distance in the 33,500-dimensional space, The Game comes out as the most typical rapper. This strikes me as more intuitive, although The Game appears at the very top of the above plots.
What about the most unique artist? Again, we can look at the distance between every pair of artists and see who has the largest average distance. In both the 2-dimensional space and the 33,500-dimensional space, the most unique artist by this measure is Grandmaster Caz. This result is probably just a consequence of the fact that there are few contemporaries of Grandmaster Caz in this data set. Another way we might try to determine the most unique rapper is to first find every rapper's closest lyrical neighbor. Then, we can figure out who is located farthest from their closest neighbor. In other words, rather than asking the question, "Who is lyrically distant from every rapper in general?" we can ask, "Who does not have any particularly close neighbors?" Using this method in the 33,500-dimensional space, we still categorize Grandmaster Caz as the most unique. But in the 2-dimensional space, Eric B., Grandmaster Caz, Grandmaster Flash, and Lauryn Hill are all about equally unique. For the Eric B. and the two Grandmasters, this is again probably a consequence of them having few contemporaries in the data set. But the same is not true for Lauryn Hill, whose point appears on the left edge of the region of woke artists, making her uniqueness a bit more significant.
(Note: An earlier version of this post only mentioned Lauryn Hill as the most unique artist according to the "whose closest neighbor is the farthest?" metric. To apply LSA to my data, I used the TruncatedSVD function in scikit-learn, which by default uses a randomized method. This means each run of the function on the same data set can give a slightly different result. When I revisited my data, I started seeing artists other than Lauryn Hill come up as the most unique and investigated what was going on. It turns out that Hill, Eric B., and Grandmasters Caz and Flash are all about the same distance away from their nearest neighbor. When the randomized method is used, different artists may come as the most unique each time the analysis is run. After noticing this effect, I re-ran the analysis using a non-random version of TruncatedSVD that uses ARPACK. This yields Grandmaster Flash as the "true" most unique artist.)
We've seen the approach taken here for identifying similar and dissimilar artists gets us a number of intuitive results, so it stands to reason that it's correctly identifying real patterns in the lyrics of these artists. But the most interesting results here are probably those that are unintuitive. Of course, there are likely to be some artists that are incorrectly categorized as lyrically similar according to this method, but there are probably also some interesting lyrical similarities that this method reveals that have gone unnoticed. I'm not ready to say which results are accidents and which are insightful. (For example, I'm not going to defend the claim that Nate Dogg and B.o.B., who appear close to each other in the center of the bottom of the plots above, are two of rap's long lost lyrical siblings.) But I think it's worthwhile to ask questions along these lines that wouldn't have occurred to us had we not tried to compare artists in this way.
The full set of artists considered in this project is as follows: 2 Chainz, 2Pac, 50 Cent, A$AP Ferg, A Tribe Called Quest, A$AP Rocky, Ad-Rock, Aesop Rock, Ali Shaheed Muhammad, André 3000, B.o.B., Beastie Boys, Big Boi, Big Daddy Kane, Big L, Big Pun, Big Sean, Bun B, Busta Rhymes, Camron, Cappadonna, Chance the Rapper, Chief Keef, Childish Gambino, Chuck D, Clipse, Common, Cypress Hill, Danny Brown, Desiigner, Diplomats, DMX, Dr. Dre, Drake, E-40, Earl Sweatshirt, Eazy-E, Eminem, Eric B., Fabolous, Fat Joe, Fetty Wap, Flatbush Zombies, Flavor Flav, Freekey Zekey, French Montana, Future, G-Eazy, Ghostface Killah, Grandmaster Caz, Grandmaster Flash, Gucci Mane, Guru, GZA, Hopsin, Ice Cube, Ice-T, Immortal Technique, Inspectah Deck, J. Cole, Ja Rule, Jarobi, Jay-Z, Jeezy, Jeremih, Jim Jones, Joey Bada$$, Juelz Santana, Juicy J, Kanye West, Kendrick Lamar, Kevin Gates, Kid Cudi, Kid Ink, Kool G Rap, Kool Moe Dee, KRS-One, Kurtis Blow, Lauryn Hill, Lil Durk, Lil Kim, Lil Wayne, LL Cool J, Logic, Ludacris, Lupe Fiasco, Mac Miller, Machine Gun Kelly, Macklemore and Ryan Lewis, Maseo, Masta Killa, MC Ren, MCA, Meek Mill, Melle Mel, Method Man, MF Doom, Migos, Mike D, Missy Elliott, Mobb Deep, Mos Def, N.W.A., Nas, Nate Dogg, Nelly, Nicki Minaj, No Malice, Offset, Outkast, Pharrell Williams, Pimp C, Posdnuos, Public Enemy, Pusha T, Q-Tip, Quavo, Queen Latifah, Rae Sremmurd, Raekwon, Rakim, Redman, Rich Homie Quan, Rick Ross, Run DMC, RZA, Scarface, Schoolboy Q, Slick Rick, Slim Jxmmi, Snoop Dogg, Soulja Boy, Swae Lee, T.I., Takeoff, Talib Kweli, Tech N9ne, The DOC, The Game, The Notorious B.I.G., Travis Scott, Trugoy the Dove, Twista, Ty Dolla $ign, Tyga, Tyler, The Creator, U.G.K., Wale, Will Smith, Wiz Khalifa, Wu-Tang Clan, Xzibit, YG, Yo Gotti, Young Thug.