PART II: Textual Corpora and Computational Text Analysis


This project has two major activities: first you will explore basis computational text analysis using several web-based tools and second you will learn more advanced methods of machine learning using the computer programming language R.

Introduction to text analysis

Due Date, Part One:

March 22, posted to Canvas before midnight

In the first part of this uint, you will be introduced to computational text analysis using several web-based tools to study various versions of the Inkle and Yarico account. You will then locate an Inkle and Yarico account of your own, transcribe it (or a selection from it, for longer versions), create metadata for it, and analyze the text with our web-based tools. You will write a brief (600–800 word) blog post on our class site describing your findings and explaining the rationale behind your metadata decisions. We will provide you with credentials to log into our class site and instructions for how to add your post.

At the end of this part, you should submit:

  1. Your transcription (posted to Canvas; please submit as an attached plain text [.txt] file)
  2. The metadata for your archival item (posted to Canvas; please either include this with the .txt transcription file or send as a separate Word doc, Google Doc, or .pdf file)
  3. The blog post in which you write up your results (by ‘publishing’ the post on our class site)

Word embedding models

Due Date, Part Two:

April 9, posted to Canvas before midnight

Introduction: In this assignment, you will use the programming language R to prepare a corpus of texts for analysis with the word2vec package. You should identify selection criteria for your corpus, complete whatever processes are necessary to make text files that can be loaded into R studio, train your model or models, and then conduct analysis with your trained model(s). You are encouraged to experiment with different parameters and data-preparation processes and you will be asked to explain how you arrived at your decisions for model preparation and training in your final writeup for this assignment.

Create Corpus: First, you should create a corpus with at least 800,000 words; a million or more words will give you more reliable results. Be thoughtful in how you select and prepare texts for your corpus—you should choose texts that will enable you to get at research questions you’re interested in exploring.

Train Your Model: Remember that word2vec lets you ask questions only at the level of the full corpus, so if you want to perform a comparative analysis, you will need to prepare and train more than one model. As you prepare your corpus (or corpora) for analysis, consider whether you want to remove any kinds of textual content before you train the model (almost certainly metadata, perhaps things like speaker labels or editorial notes). You might also think about tokenization—are there any terms that you want to treat as a single token (for example, “new” and “york” when they refer to the location)? Experiment with different data preparation and model training processes (testing the results of different iterations, numbers of vectors, and window sizes, for example), so that you will be able to articulate in your final post how you chose the preparation processes that you did—and what these enable you to say about your particular collection.

Identify Research Question and Run Queries: Once your model is trained, you will query it to examine word relationships—you can look at cosine similarity for individual words or combinations of terms; you can also use other aspects of vector math (such as subtraction) and you can experiment with constructing analogies. You can also perform more complicated queries, such as those outlined in the “vignettes” and blog posts authored by Professor Benjamin Schmidt (all linked below in the “Examples” section).

Write Up Results: The final step of this assignment asks you to write up your results in the form of a blog post (you can look at the examples below to get a better sense of how scholars in the digital humanities discuss their research in blog posts—these are less formal than peer reviewed scholarly articles, but they are still serious research and are likely to be more formal than many blog posts you’ve read). In your post, make sure to discuss your data preparation and model training processes, as well as the most significant results from your queries into your model, while also articulating the research question(s) you were investigating in your word2vec experimentation. You should talk about what your results mean and discuss the directions you think future research should take. You don’t need to go into an extended explanation of word vector analysis, but you should set out the basics clearly enough that those unfamiliar with vector space models can still follow your arguments. As you have seen from our examples, this kind of post can be quite long; yours should be at least 1,000 words (and probably not much more than 2,000). You will publish your blog post on our class website: please note that this is public facing; spelling and grammar count.

In your blog post, you should situate your own work within existing scholarship on your research question, discussing the relationship between your research and the context provided by one or two scholarly articles.

To download your .bin file from RStudio Server, select the file using the check box on the left, then hit the “More” button near the top of the file viewer pane (with the little gear next to it), and choose “Export.” From there, follow the prompts to download your .bin file onto your computer.

At the end of this part, you should submit:

  1. The .bin file with your trained model (posted to Canvas or Dropbox)
  2. The folder with your corpus (posted to Canvas or Dropbox)
  3. The blog post in which you write up your results (by “publishing” the blog post on our class site)

If your corpus folder and .bin file are too large to post to Canvas (larger than 500MB), you can submit them using Dropbox instead. Go to this link and upload your materials; please put everything in a single folder.

If you are interested in comparing models or would like to create a particularly large corpus, you can work in pairs or groups on this assignment, but each student will be expected to write and submit their own posts and group work will require prior approval. 

Grading Rubric (20% each):

  1. Corpus appropriately scoped, selected, and prepared
  2. Compelling research questions clearly defined in relation to corpus
  3. Model trained correctly in R
  4. Research questions appropriately explored through queries of corpus with word2vec
  5. Blog post explaining results is clear, complete, and free of grammar and spelling errors; post connects research question with one or two scholarly articles

There will a grading penalty for late and incomplete submissions.

Exercise folder


Links and Resources

Example Student Projects

Spring 2021, Word2Vec projects

Fall 2018

Fall 2017