This project has two major activities: first you will explore basis computational text analysis using several web-based tools and second you will learn more advanced methods of machine learning using the computer programming language R.
Introduction to text analysis
Due Date, Part One:
March 22, posted to Canvas before midnight
In the first part of this uint, you will be introduced to computational text analysis using several web-based tools to study various versions of the Inkle and Yarico account. You will then locate an Inkle and Yarico account of your own, transcribe it (or a selection from it, for longer versions), create metadata for it, and analyze the text with our web-based tools. You will write a brief (600–800 word) blog post on our class site describing your findings and explaining the rationale behind your metadata decisions. We will provide you with credentials to log into our class site and instructions for how to add your post.
At the end of this part, you should submit:
- Your transcription (posted to Canvas; please submit as an attached plain text [.txt] file)
- The metadata for your archival item (posted to Canvas; please either include this with the .txt transcription file or send as a separate Word doc, Google Doc, or .pdf file)
- The blog post in which you write up your results (by ‘publishing’ the post on our class site)
Word embedding models
Due Date, Part Two:
April 9, posted to Canvas before midnight
Introduction: In this assignment, you will use the programming language R to prepare a corpus of texts for analysis with the word2vec package. You should identify selection criteria for your corpus, complete whatever processes are necessary to make text files that can be loaded into R studio, train your model or models, and then conduct analysis with your trained model(s). You are encouraged to experiment with different parameters and data-preparation processes and you will be asked to explain how you arrived at your decisions for model preparation and training in your final writeup for this assignment.
Create Corpus: First, you should create a corpus with at least 800,000 words; a million or more words will give you more reliable results. Be thoughtful in how you select and prepare texts for your corpus—you should choose texts that will enable you to get at research questions you’re interested in exploring.
Train Your Model: Remember that word2vec lets you ask questions only at the level of the full corpus, so if you want to perform a comparative analysis, you will need to prepare and train more than one model. As you prepare your corpus (or corpora) for analysis, consider whether you want to remove any kinds of textual content before you train the model (almost certainly metadata, perhaps things like speaker labels or editorial notes). You might also think about tokenization—are there any terms that you want to treat as a single token (for example, “new” and “york” when they refer to the location)? Experiment with different data preparation and model training processes (testing the results of different iterations, numbers of vectors, and window sizes, for example), so that you will be able to articulate in your final post how you chose the preparation processes that you did—and what these enable you to say about your particular collection.
Identify Research Question and Run Queries: Once your model is trained, you will query it to examine word relationships—you can look at cosine similarity for individual words or combinations of terms; you can also use other aspects of vector math (such as subtraction) and you can experiment with constructing analogies. You can also perform more complicated queries, such as those outlined in the “vignettes” and blog posts authored by Professor Benjamin Schmidt (all linked below in the “Examples” section).
Write Up Results: The final step of this assignment asks you to write up your results in the form of a blog post (you can look at the examples below to get a better sense of how scholars in the digital humanities discuss their research in blog posts—these are less formal than peer reviewed scholarly articles, but they are still serious research and are likely to be more formal than many blog posts you’ve read). In your post, make sure to discuss your data preparation and model training processes, as well as the most significant results from your queries into your model, while also articulating the research question(s) you were investigating in your word2vec experimentation. You should talk about what your results mean and discuss the directions you think future research should take. You don’t need to go into an extended explanation of word vector analysis, but you should set out the basics clearly enough that those unfamiliar with vector space models can still follow your arguments. As you have seen from our examples, this kind of post can be quite long; yours should be at least 1,000 words (and probably not much more than 2,000). You will publish your blog post on our class website: please note that this is public facing; spelling and grammar count.
In your blog post, you should situate your own work within existing scholarship on your research question, discussing the relationship between your research and the context provided by one or two scholarly articles.
To download your .bin file from RStudio Server, select the file using the check box on the left, then hit the “More” button near the top of the file viewer pane (with the little gear next to it), and choose “Export.” From there, follow the prompts to download your .bin file onto your computer.
At the end of this part, you should submit:
- The .bin file with your trained model (posted to Canvas or Dropbox)
- The folder with your corpus (posted to Canvas or Dropbox)
- The blog post in which you write up your results (by “publishing” the blog post on our class site)
If your corpus folder and .bin file are too large to post to Canvas (larger than 500MB), you can submit them using Dropbox instead. Go to this link and upload your materials; please put everything in a single folder.
If you are interested in comparing models or would like to create a particularly large corpus, you can work in pairs or groups on this assignment, but each student will be expected to write and submit their own posts and group work will require prior approval.
Grading Rubric (20% each):
- Corpus appropriately scoped, selected, and prepared
- Compelling research questions clearly defined in relation to corpus
- Model trained correctly in R
- Research questions appropriately explored through queries of corpus with word2vec
- Blog post explaining results is clear, complete, and free of grammar and spelling errors; post connects research question with one or two scholarly articles
There will a grading penalty for late and incomplete submissions.
- Word vectors folder, with data and walkthroughs
- Our RStudio Server instance
- You will need the VPN client installed to use the RStudio Server instance and to log into the class WordPress. There is a FAQ here with links to installation instructions (note that if you have a Mac, you may need to go into System Preferences and Security & Privacy, then allow the VNP to open)
- Ryan Heuser ECCO exploration and response by Gabriel Recchia
- Ben Schmidt introduction and exploration vignettes
- Ben Schmidt post on rejecting the gender binary
- Ben Schmidt introductory post on word2vec
- Lynn Cherney Pride and Prejudice visualization
- Becky Standard post on gender and labor
- James Clawson post on modernist literature
Links and Resources
- Women Writers Vector Toolkit and Word Vector Interface
- Lincoln Mullen: “Setting up R” from Digital History Methods
- DH Toychest list of datasets
- Kaggle datasets
- NULab list of datasets
- Download link for RStudio
- Download link for R
- VPN instructions
Example Student Projects
- “Consent and Marriage” by Alden Gisholt Minard
- “Why Gender and Identity in Young Adult Literature Matters” by Dara Sostek
- “University Identities: Are they truly unique?” by Nick Payne
- “Using Twitter Data for Political Science: A Brief Crash Course” by Sarah Bernt
- “Agencies of Power in Children’s Literature” by Daliyah Middleton
- “Why Do the Red Sox Care More About the Yankees than the Yankees Do About the Red Sox, and What Is the Discourse Surrounding This Disparity?” is locked Why Do the Red Sox Care More About the Yankees than the Yankees Do About the Red Sox, and What Is the Discourse Surrounding This Disparity?” by Kelly Fleming
- “What Kind of Memes are Created and Maintained in a Participatory Internet Subculture” by Christian Hauser
- “Freedom in the Eyes of Poetry” by Katie McColgan
- “LGBTQ Characters at Hogwarts: Using Data Analysis to Prove Fan Theories?” by Emma Reed
- “Freedom and Societal Expectations in the Golden Age of Children’s Literature” by Laura Packard
- “Presidential Speeches and Divisiveness” by Emily Hontoria
- “Scientific Truths are Not Self Evident: Science, Perception, and Identity in America Since the 1960s” by Niall Dalton
- “The Soul Through Life and Death in James Joyce” by Aislin Black
- “Political Authority in Children’s Literature” by Li Breite
- “Women in Wartime” by Giselle Briand
- “Gender Archetypes in 1960s Pulp Science Fiction” by Kieran Croucher
- “Portrayal of Women in Popular Culture Magazines” by Vanessa Gregorchik
- “Women in Engineering: How decades of stereotypes are impacting the field” by Bryce Griffin
- “Marriage and Race Before and After the Civil War” by Danielle Nguyen
- “Men and Women Discuss Men and Women: Gender in Victorian Novels” by Ciara McAloon
- “Government Representation in Dystopian Literature” by Nupur Neogi
- “The Dual Meaning of Christianity to Ex-slaves in the American South” by Benjamin Quiring
- “Are Russian Folklores Influenced Greatly by European or Asian Folklores?” by Sheetal Singh