The scientific world is often best known for discovery. The most common pieces of the scientific world that the average person is aware of are the pieces of knowledge that accumulate into the understanding that we have today. But in coming to college and being able to speak with research professors in STEM communities, I’ve realized how little the process of research accurately fits my Good Will Hunting-esque image of a brilliant professor writing an equation on a chalkboard and achieving scientific superstardom. I’ve discovered a large amount of red tape and road blocks in the academic world that make real research a relatively small part of the research process. As a result, I’ve become curious as to what this process is really like, and how often it shows itself in the public eye.
Deciding on a Corpus:
Initially, I had a different concept for a corpus, and hadn’t considered exploring my previously mentioned curiosity through text analysis. I had found a database with a couple different sets of religious publications, and intended to put them together and analyze how different religious perspectives discuss similar topics. Unfortunately, different pieces of these publications were encoded differently, making some of them incompatible with the RStudio server. Without these incompatible pieces, the corpus became too small to gain any worthwhile information, so the corpus had to change. When I discovered this, I happened to be in a Computer Science Research class, and the professor was elaborating on a paper that attempted to skip over the red tape of research. This project, called Encore1, ignored some rules of ethics and didn’t seek certain approvals, and was shunned from publications and considerations, and the researchers behind it lost what was referred to as “scientific respectability” in class. While the research was questionable, there wasn’t any explicit harm that came from the research, but ignoring these obligations and restrictions can seriously harm a researchers career. This inspired me to turn to science publications, instead of religious publications, and explore how this culture surrounding research makes its way into the larger public eye. I found four different databases of scientific publications from the 1990s and before, each covering a different area of research, most of which are known for their heavy regulation. At the end of the search, I ended up with a corpus of news publications of research and discovery in the fields of space, medicine, electronics, and cryptology/data security, totaling over 1.2 million words and a solid, trainable corpus.
Identifying a Central Question:
With this corpus defined, I turned to some of the reactions to the Encore paper, and how the community reacted to an innovative project that was unsupported by the organizations that control research. One response paper2, which is both critical and supportive of the project, pointed out issues with the research process that made it nearly impossible for a project like Encore to truly adhere to the rules and still be worth exploring. The paper states, “The uncertainty induced by the potential rejection of research by program committees might lead to researchers abandoning some research ideas or entire areas of research — especially research that pursues methodological innovation”, showing that Encore either had to exist beyond the restrictions of the committees or it wouldn’t exist at all. This guided me to investigate how these kinds of restrictions reach real publications, as I was confident that this ever present body that controls research would make several mentions throughout the text. If this were true, we would then be able to see, more specifically, what the opinions of research controlling bodies were at the time of this corpus’ publications. To put this as a question: What was the opinion of research regulation as seen in research publications before the year 2000?
Investigating the Central Question:
Two major pieces of history contributed to the historical context of research restriction. The first of these would be the foundation of IRBs (Institutional Review Boards) following World War II, where the abuse of human experimentation harmed many individuals, specifically leaving minorities and other disenfranchised groups subject to the horrors of dangerous human testing. The second was the Belmont report, which was a document setting guidelines for human centered research, and was widely accepted as a regulatory document. Knowing that both of these are in play for this time of research, we can then relate specific words to our trained model of research publications. We begin with some clusters that could relate to this issue, which were found mostly by chance, and not through direct queries. The following clusters were considered notable to this question:
Cluster 1: without, access, order, users, legal, police, court, illegal, allow
Cluster 2: political, committee, party, opportunity, libertarian, platform
Cluster 3: issue, administration, Clinton, proposal, agency, proposed
We can see from Cluster 1 that there is significant discussion of the law, meaning that there is some discussion of legality when it comes to research. There is also significant discussion of politics, as can be seen by Clusters 2 and 3, which discuss political parties and elected bodies, despite none of the groups of publications being about politics. While not enough information to draw conclusions from, this information makes it reasonable to assume that there is some relationship between the research community and the government that creates these research regulations.
From here, we begin specific queries, looking for specific words that could relate to the general idea of regulation. I began with the word safety, attempting to see if the model relates safety in science more as a necessity for moral reasons or legal reasons. When finding the words closest to “safety”, we find words like “restraints”, “mandatory”, and “incompetence”. While this naturally characterizes the word safety as a requirement and not a goal, the word “incompetence” does not immediately fit this idea. However, when considering the cause of this regulation, it could be possible that the model relates “safety” to “mandatory” because human incompetence is the cause of the mandatory safety requirements. This is further shown by a query on “human”, which comically returns “stupidity” as its closest related word. Due to the obvious connection between stupidity and incompetence, we can connect these two queries to reasonably assume a complete picture: that the view on regulation is that its purpose is to prevent foolish researchers and subjects from inadvertently putting themselves in danger. While optimistic relative to considering research dangers as a malicious act, this does show a generally negative perspective on research restrictions, as researchers who likely think highly of themselves may pin their frustration and blame for the rules on less intelligent individuals that require these bounds. This negative opinion also shows itself in a query of “regulate”, which returns the words “conform”, “enforce”, “rules”, and “oppose”, in that order. These words, specifically “conform” coming before the rest and “oppose” coming shortly after, reflect a rather negative opinion of regulations in the science community. Conforming seems to reflect a kind of binding within a certain practice, showing that researchers likely feel that there is only one way to operate in a research community, inhibiting things like Encore, that were innovative but outside of the bounds of ethical research guidelines.
Unrelated Notable Findings:
While not directly related to the central question, there were a few other queries and clusters that I found interesting, and thought would be worth sharing. One cluster that was particularly notable was a health focussed cluster:
Health Cluster: Diet, Vitamin, Blood, Fat, Increase, Protein, Levels
While doing many clusters, it was rare for a medical cluster to appear, but this cluster was obviously filled with tightly related words, revealing that a large amount of discussion in the medical category was in health, wellness, and dieting. This may be a result of more popular topics being included in the corpus. While most medical topics would go over the head of the average individual, writings about health are often very popular, as people are often concerned with their appearance and general health. This may be important to consider for the corpus as a whole, as works that interest readers may appear more often than topics that do not interest readers.
A few queries were also particularly interesting. Gender based biases were immediately obvious when applying “men” and “women”, and subtracting each gender’s results from the other’s. As likely expected, the most related term to “women” was “children”, but the most related term to “men” ended up being “infections”. While likely in the medical category, I struggle to find a reason to relate these two, and am open to ideas of why this may be (my only thought at the moment being several AIDS related works, considering this is primarily 1990s research). The final two queries were “religion” and “ethical”, which had their most connected words being “prohibiting” and “constitutional”, respectively. For religion, this shows an opinion in the science community of the restrictions and rules that many religions hold that scientists may not, revealing a potential divide that is obvious today. For “ethical”, we can see that the ethics of science have, in a way, been outsourced to the government, as the scientific community may only consider ethics in terms of legality, not determining their own ethics as a result of government restrictions.
After examining texts from prior to the year 2000, I feel comfortable concluding that the sentiments expressed by my professor when discussing research restrictions are not new issues, and have existed for years before today. When relating negative words to research restrictions, it becomes clear that there isn’t a warm embrace of the rules of research, instead a more begrudging acceptance. To extend this further, it may be worthwhile to develop larger corpuses for analysis, but splitting the research into topic groups to determine which focuses are the most affected by regulation. It’s possible, if not likely, that a field like electronics or cryptology did not face as much restriction as a result of minimal use of human subjects, but it is impossible to determine that with this corpus as a result of the subjects being grouped together. I expect a great deal of this regulation comes in the form of medical research, which is notably slow as a result of the great care that is required to ensure that subjects aren’t harmed in the process. However, this issue does still likely permeate research as a whole, and the fact that it has not gone away after many decades indicates that there may not be a perfect balance of regulations and freedom to be found.
Citation 1: Encore
Sam Burnett and Nick Feamster. 2015. Encore: Lightweight Measurement of Web Censorship with Cross-Origin Requests. SIGCOMM Comput. Commun. Rev. 45, 4 (October 2015), 653–667. DOI: https://doi.org/10.1145/2829988.2787485
Citation 2: No Encore for Encore?
Narayanan, Arvind and Zevenbergen, Bendert, No Encore for Encore? Ethical Questions for Web-Based Censorship Measurement (September 24, 2015). Available at SSRN: https://ssrn.com/abstract=2665148 or http://dx.doi.org/10.2139/ssrn.2665148