In this project, students will use the programming language R to prepare a corpus of texts for analysis with the word2vec package. You should identify selection criteria for your corpus, complete whatever processes are necessary to make text files that can be loaded into R studio, train your model or models, and then conduct analysis on your corpus with your trained model(s). You are encouraged to experiment with different parameters and data-preparation processes and you will be asked to explain how you arrived at your decisions for model preparation and training in your final writeup for this assignment.
First, you should create a corpus with *at least* 500,000 words; a million or more words will give you more reliable results. Be thoughtful in how you select and prepare texts for your corpus—you should choose texts that will enable you to get at research questions you’re interested in exploring. You should submit your corpus to us for approval no later than November 9th. By email, send a brief description of your research question(s), a description of the texts you’re including (how many texts and words, where you got the texts, how you selected them), and your current plans for data preparation.
Remember that word2vec lets you ask questions only at the level of the full corpus (that is, unlike with topic modeling, you cannot use word2vec to study word relationships in a single text), so if you want to perform a comparative analysis, you will need to prepare and train more than one model. Have “Against Cleaning” in mind as you prepare your corpus (or corpora) for analysis—consider whether you want to regularize (or even lemmatize) the words in your corpus, whether you want to remove any kinds of textual content before you train the model (almost certainly metadata, perhaps things like speaker labels or editorial notes), and what stop words you want to use if the default list isn’t adequate for your purposes. You might also think about tokenization—are there any terms that you want to treat as a single token (for example, “new” and “york” when they refer to the location)? Experiment with different data preparation and model training processes (testing the results of different iterations and window sizes, for example), so that you will be able to articulate in your final post how you chose the preparation processes that you did—and what these enable you to say about your particular collection.
Once your model is trained, you will query it to examine word relationships—you can look at “closeness” for individual words or combinations of terms; you can also use other aspects of vector math (such as subtraction) and you can experiment with constructing analogies. You can also perform more complicated queries, such as those outlined in the “vignettes” and blog posts authored by Professor Benjamin Schmidt (all linked below in the “Examples” section).
The final step of this assignment asks you to write up your results in the form of a blog post (you can look at the examples below to get a better sense of how scholars in the digital humanities discuss their research in blog posts—these are less formal than peer reviewed scholarly articles, but they are still serious research and are likely to be more formal than many blog posts you’ve read). In your post, make sure to discuss your data preparation and model training processes, as well as the most significant results from your queries into your model, while also articulating the research question(s) you were trying to answer in your word2vec experimentation. You should talk about what your results mean and discuss the directions you think future research should take. You don’t need to go into an extended explanation of word vector analysis, but you should set out the basics clearly enough that those unfamiliar with vector space models can still follow your arguments. As you have seen from our examples, this kind of post can be quite long; yours should be at least 1,000 words (and probably not much more than 2,000). You will publish your blog post on our class website by midnight on November 17 (please note that this is public facing; spelling and grammar count). We will provide you with credentials to log into our class site and instructions for how to add your post.In your blog post, you should situate your own work within existing scholarship on your research question, discussing the relationship between your research question and the context provided by one or two scholarly articles.
At the end of this unit, you should submit:
- The .bin file with your trained model
- The file or folder with your corpus
- The blog post in which you write up your results
Because these files will be too large to send by email, we will set up a class dropbox folder where you can upload the .bin file and the files/folders with your corpus. We will also show you how to link to this Dropbox folder from your blog posts, if you wish to share your files there.
If you are interested in comparing models or would like to create a particularly large corpus, you can work in pairs or groups on this assignment, but each student will be expected to write and submit their own posts and group work will require prior approval.
Grading Rubric (20% each):
- Corpus appropriately scoped, selected, and prepared
- Compelling research questions clearly defined in relation to corpus
- Model trained correctly in R
- Research questions appropriately explored through queries of corpus with word2vec
- Blog post explaining results is clear, complete, and free of grammar and spelling errors; post connects research question with at least one scholarly article
There will a grading penalty for late and incomplete submissions.
- Ryan Heuser ECCO exploration and response by Gabriel Recchia
- Ben Schmidt introduction and exploration vignettes
- Ben Schmidt post on rejecting the gender binary
- Ben Schmidt introductory post on word2vec
- Lynn Cherney Pride and Prejudice visualization
- Jonathan Fitzgerald WWO exploration and Shiny app
Links and Resources
- Our RStudio Server instance
- Lincoln Mullen: “Setting up R” from Digital History Methods
- word2vec basics file
- word2vec introductory exercise
- word2vec demo file
- word2vec template file
- word vectors folder
- DH Toychest list of datasets
- Kaggle datasets
- Download link for RStudio
- Download link for R
- “The Soul Through Life and Death in James Joyce” by Aislin Black
- “Political Authority in Children’s Literature” by Li Breite
- “Women in Wartime” by Giselle Briand
- “Gender Archetypes in 1960s Pulp Science Fiction” by Kieran Croucher
- “Portrayal of Women in Popular Culture Magazines” by Vanessa Gregorchik
- “Women in Engineering: How decades of stereotypes are impacting the field” by Bryce Griffin
- “Marriage and Race Before and After the Civil War” by Danielle Nguyen
- “Men and Women Discuss Men and Women: Gender in Victorian Novels” by Ciara McAloon
- “Government Representation in Dystopian Literature” by Nupur Neogi
- “The Dual Meaning of Christianity to Ex-slaves in the American South” by Benjamin Quiring
- “Are Russian Folklores Influenced Greatly by European or Asian Folklores?” by Sheetal Singh