Introduction
This is my first project using Word2Vec. I’m really more of a math and computer science guy than a literature one, though I think found some cool results in this project and the analysis Word2Vec allows you do to is quite amazing! I chose to do my topic on the African Americans in the South (more specifics later) because of two reasons: 1. There’s a lot to be researched and I enjoy discussing race and culture (which has been imparted to me by my English teachers over the years) and 2. The corpus was available to me. I was thinking before starting this project of doing something such as compiling a corpus of mathematics texts and analyzing them; unfortunately that project would take far to long to do both assembling and analysis for the project. So, I got to do something just as interesting that I still cared about!
What is Word2Vec?
Word2Vec is a software package that allows one to build a computational model of a body of works, allowing querying of sorts to analyze the corpus in a fashion complementing that of “close reading.” By training a model around a corpus, one can assign ‘meanings’ to words in the vectors. Vectors are objects which are composed of slots to put numbers; each of these slots represents a possible association. One can compute the distance between two vectors as well as adding them, which makes for very querying. For example, in a text the words ‘good’ and ‘great’ are likely to appear near each other in a text, so one would expect their vectors to be close together. Similarly, one can generate a list of all the words closest to a word (or group of words, as they are simply vectors in the model) (there are several interesting visualizations and processes that can be done with these models not discussed or shown in this post, more information here and here ). As context for this post, w2vModel
is the trained model associated with the corpus.
The Corpus
The corpus I selected was the First-Person Narratives of the American South (found here ) collection which is a group of documents put together by the University of North Carolina. The corpus has 294 files (which contains 344 items according to the readme) which were all digitized by people, meaning their accuracy is (most likely) better than a machine parser. The corpus itself contains a collection of diaries, memoirs, narratives, etc. of those underrepresented in the American South between 1860 and 1920 such as African Americans, women, and Native Americans. I am primarily interested in the view of the ex-slaves and African Americans but kept the other narratives as they provide good context for the model. In total the corpus has over 11 million words, surpassing the accepted minimum of 1 million for statistical significance.
Data Cleaning
In order to build a proper computational model, one must clean the data worked with. For my project this was quite simple: the only items I had to strip out were annotations enclosed in brackets; an easy job for regular expressions. Additionally, I chose not to remove section headers such as “Introduction,” “Preface,” and “Chapter I” from the data as there was not a consistent way in which they appeared and as the database is quite large. There were also some chapter/book/illustration titles which I chose not to remove, again due to the inconsistencies in how they occurred in the texts. I don’t think that these choices hurt my model or findings in any way for several reasons: I didn’t query the model for any of those words, none of them occurred in my queries, and they are rather far between in the documents. Finally, I chose not to remove additional (to a, and, the, etc.) “trivial” words (such as I, me, they, etc.) as I thought I may query on them to see how specifically people felt. As they didn’t appear in any queries, I believe that this decision had no consequences except to possibly dilute the “closeness” scores which is not an extremely bad thing.
I also had the choice of sifting through the collection for only narratives of ex-slaves (my primary query) and removing those of the other groups. However, I believe that the other narratives provide good context and strengthen the model, rather than dilute it, and so I kept them.
Training the Model
I created the model with 400 dimensions of meaning (meaning each word was associated with 400 different numbers, each corresponding to a different meaning that the machine learning process assigns) and trained it with 100 iterations and a window size of 12. Iterations denote how many times the model creation process looks over the entire body of texts. This means that the more iterations the better and so to increase the validity of the model and the accuracy of the results I used a decently large number. The window size describes how large the distance between words can be to consider a comparison; for example, a very small window size means that the training would occur with very small groups of words at a time- not necessarily likely to catch all of the meanings between words. Therefore the larger window size helps me capture deeper meanings at the cost of diluting results (more connections means that when comparing words more similarities will appear than if there were less connections).
Here’s a projection of the trained model/corpus onto the plane:
Querying the Model
When beginning my inquiries into the corpus I had a couple of questions in my head, the largest and most general being: “what was the importance of religion to slaves in the South?” I had a feeling that this may give some interesting results, namely I wanted to know if the strong influence of white power in the Christian faith in the South changed how African Americans responded to it, viewing the religion as more oppressive or whether they saw past the oppression to the core beliefs. I wanted to query this but also wanted to make sure I wasn’t missing any interesting or surprising results hidden within the model., so the first query I made was a process of clustering; I asked the model to make thirty ‘buckets’ and to fill each bucket with twenty words which were ‘close together’ as vectors in the model, thereby giving me a decent variety of output to sort through for interesting results. One of these buckets was particularly interesting and luckily applied very aptly to my question:
Code:
centers = 100
clustering = kmeans(w2vModel,centers=centers,iter.max = 100)
sapply(sample(1:centers,30),function(n) {
names(clustering$cluster[clustering$cluster==n][1:20])
})
Output:
"christianity"
"brotherhood"
"profess"
"practices"
"precepts"
"denounce"
"hypocrisy"
"corrupt"
"enslave"
"inalienable"
"consciences"
"degrade"
"oppress"
"vindicate"
"dictates"
"blot"
"uphold"
"depravity"
"cling"
"upheld"
The contrast of the first three words, christianity
, brotherhood
, and profess
with the rest of the highly negative words in the query surprised me; it seemed like my question about the meaning of religion held some validity. This bucket also gave me a list of words I could query on to “go down the rabbit hole” and continue. I decided to do some inquiries on the how of the response to the white influence, not if it occurred: “Was the Christian faith viewed in a more positive or more negative light to the ex-slaves and African Americans of the South?” Looking at two other buckets of the output:
"god"
"lord"
"spirit"
"soul"
"heaven"
"christ"
"jesus"
"prayer"
"pray"
"faith"
"ye"
"gospel"
"glory"
"holy"
"blessed"
"mercy"
"divine"
"sin"
"grace"
"god's"
"system"
"cruel"
"cruelty"
"wicked"
"beings"
"oppression"
"miserable"
"victims"
"victim"
"creatures"
"passion"
"exposed"
"worst"
"helpless"
"savage"
"injustice"
"curse"
"degraded"
"horrible"
"overseers"
I noticed an obvious connection between the first and my original bucket, and I felt like the ideas in the second closely tied to those expressed in the latter portion of the original. It seemed to me as if religion may have carried two different meanings among the underrepresented in the American South: that of God and holiness and that of the oppression by the authorities in the community. This gave a qualification to my interests: Christianity didn’t hold just one meaning and there wasn’t just one result of white people holding power; it held a dual meaning of religious faith and of an oppressive system to the ex-slaves and African Americans of the South.
Looking at Connections
Wanting to examine these findings a bit more, I queried the model with two questions: what words is Christianity ‘close to’ with an increased notion of it as a system
and what words is Christianity like with an increased importance of faith
? The +
in the code below indicates the vector addition operation which semantically translates to “what words are used most with the combination (of meanings) of christianity
and faith
?” Additionally, the numbers to the right of the word denote how close the given word is to the queried word in terms of usage/meaning and are on a scale of [0,1] with a higher rating indicating more related.
Code:
w2vModel %>% closest_to(~'christianity'+'system',20)
w2vModel %>% closest_to(~'christianity'+'faith',20)
Output:
word similarity to "christianity" + "system"
1 christianity 0.8442163
2 system 0.7997739
3 religion 0.5164677
4 slavery 0.4969512
5 principles 0.4766329
6 morality 0.4349808
7 civilization 0.4218466
8 doctrines 0.4152578
9 nature 0.4148770
10 evils 0.4130148
11 principle 0.4112998
12 institution 0.4098427
13 overthrow 0.3967365
14 humanity 0.3952436
15 fundamental 0.3808701
16 moral 0.3763819
17 its 0.3735489
18 oppression 0.3730632
19 doctrine 0.3693422
20 universal 0.368790
word similarity to "christianity" + "faith"
1 christianity 0.8284313
2 faith 0.7640235
3 religion 0.6478612
4 christ 0.5204228
5 christian 0.5088124
6 doctrines 0.4684567
7 spirit 0.4632661
8 principles 0.4627643
9 love 0.4477762
10 god 0.4399470
11 divine 0.4381351
12 precepts 0.4264785
13 morality 0.4242452
14 professed 0.4205163
15 gospel 0.4196988
16 truth 0.4185350
17 christians 0.4149958
18 principle 0.4077469
19 jesus 0.4056524
20 true 0.4045306
It is clear that both have similarities, though there are some big contrasts: the former has slavery
with a rating of about 0.5; This indicates that the idea of Christianity as a system is closely associated with the institution of slavery. For context, 0.5 is around the cutoff for statistically good querying results, though due to the size and diversity of the corpus as well as due to the reasons spoken of at the beginning of this post this bound may be shifted just slightly lower. Additionally, several more words with negative connotations such as evils
, institution
, and overthrow
occur farther down on the list. In contrast to this, however, the words on the latter query are all positive, indicating that Christianity is not simply a negative system of oppression; some ideas in the faith carried good connotations with the views of the people of the narratives.
Visualizing Differences
To further examine the duality in meaning I generated a cluster dendrogram on the words christianity
, faith
, and system
. A cluster dendrogram is a tree structure that pulls the words most closely associated with a list of words and groups them in clusters, allowing one to see the decomposition of meanings within the group:
Code:
religion_words = w2vModel %>% nearest_to(w2vModel[[c("christianity", "system", "faith")]],300) %>% names
religions = w2vModel[rownames(w2vModel) %in% religion_words[1:75],]
religion_distances = cosineDist(religions,religions) %>% as.dist
plot(as.dendrogram(hclust(religion_distances)),horiz=F,cex=1,main="Cluster dendrogram of the 75 closest words to ‘christianity,’ ‘system,’ and ‘faith'")
A note on interpreting the above: the branching closer to the beginning indicates that the top level meaning of these words are similar. Therefore, to find the “decomposition” of the meaning of a list of words one primarily wants to take a look at the subsets generated by the first few branchings. Of course, to go more in depth, one can also look at the lower level branchings (see Prof. Benjamin Schmidt’s introductory piece on Word2Vec ).
As one can see the first branching splits the words into two groups, and all of those on the left have to do with more of the faith side of Christianity: God
, love
, spirit
, etc. On the right side, after one more split in the tree, two groups appear: those having to do with the structure-like nature of religion such as laws
, doctrine
, practices
, etc. while those on the right are entirely more sinister nature: oppression
, institution
, slavery
, power
, etc. Therefore, to the people of the narratives, religion has three high-level meanings: A general idea of faith and God, the structured nature that comes along with any organization, and a category with strong negative connotations. The two meanings of Christianity to the African American communities is then that of a faith and of oppression. I drop the “less important” idea of Christianity being an organization as it is like having three meanings associated with government: helpful, controlling, organization. Clearly the final idea is not as important as the first two as it is a property of the objective, rather that the subjective.
Supporting Evidence
These claims are not simply based off the whims of my model, however. Agreeing with the claims above Chapter 9: “Religion” of Strategies for Survival: Recollections of Bondage in Antebellum Virginia presents a similar notion of the duality of Christianity: “When white clergymen preached to bondpeople, they normally presented an expurgated version of Christianity, which affirmed that obedience to masters would be rewarded in the afterlife” (Dusinberre 121). The Christian faith, when taught by white preached, promoted that obedience to their enslavement, to the system, is fundamentally and inherently right. To the enslaved people this was obviously not an acceptable notion, of course, and the book explains the response and result: “[enslaved people] were often recruited… not by white preachers, but by black preachers and exhorters… evangelicals preached a doctrine of spiritual equality; even when this message was watered down by white preachers, black converts retained their own understandings of it” (Dusinberre 123). Instead of disregarding the Christian faith altogether the African American communities instead decided to accept the core beliefs while deemphasizing the oppressive ideas taught by the Southern religious authority figures.
Therefore, the words on the far left of the dendrogram are the retained understanding of the faith itself, while those on the far right are the connotations of the specifically tuned messages to those in slavery the white preachers exhorted and the reactions of the African Americans to those messages. Although my model is based off of the narratives of people from 1860-1920 (written in this time period) and the source speaks of those still in slavery in the 1800s, I think that my claim still holds as the treatment of African Americans in the South remained similar post-Civil War; the culture stayed very much the same. Additionally, many of the narratives are from people who lived pre-Civil War and wrote their narratives post-Civil War, such as all of the ex-slave narratives. I believe that the results I found apply to the African American points of view despite the diversity in the corpus due to the close ties to the word slavery
in my queries as well as due to the discussion in the book above; the book and my claims line up well. Finally, I believe that due to the large size of the corpus that the results have statistical significance and are not baseless; the uniformity of the queries similarly says that the corpus was large enough as less uniformity would mean random/misplaced words would appear in the queries (which did not occur).
Although I don’t have the most background in Word2Vec or the South in the late 1800s, I think that my results are interesting (as I generalize the claim to the whole South and at least corroborate other results) especially as they were computed (that’s not seen too often in literary studies) with the relatively new method of Word2Vec.
Future Work
I think that this work can be extended in several ways- one can look at how other groups such as women or lower-class white men viewed the faith as well as how African Americans viewed other cultural ideas such as family, work, freedom, etc. For the former, one can limit the dataset, restricting it to the other groups of interest. Additionally, one can find supplemental databases to increase the accuracy of the queries. For the latter, one can query this model directly, restrict the data, or add other sources.
One final way this work can be extended would be to query based on African Americans in the North versus the South; I don’t think that the results would be obvious and that there may be interesting results found from comparing the two.
Works Cited
Dusinberre, William. “Religion.” Strategies for Survival: Recollections of Bondage in Antebellum Virginia, University of Virginia Press, 2009, pp. 121–140. JSTOR, www.jstor.org/stable/j.ctt6wrmk7.13.