Defining Digital Humanities with Digital Humanities
This post defines a process of defining Digital Humanities recursively.
Our naive technique will be using simple text acquisition to create tables and word clouds that represent context of URLs related to the Digital Humanities. The definition will not be in words, but instead be determined by oneself when observing patterns in the data (word cloud/frequency).
Our first decision is selecting sources to acquire our text. Because this is a ‘quick’ process, we will crawl the web manually at a small scale. An automated approach could be used to increase the volume of our data set and look for more general convergences in the future.
We now define the steps to determine which documents (URLs) we will include in our data set.
Our manual data crawler will use the following search engines:
These 3 queries will be run (selected empirically):
- a.) “Definition of Digital Humanities”
- b.) “Digital Humanities”
- c.) “Digital Humanities Research”.
The web browser used will be Google Chrome with incognito mode to prevent profile bias. Note this will not prevent location bias.
The selection criteria are as follows:
- Query (a) with search engine (1).
- Select the first normal results (no snippet, pre-parsed URL, or advertised URL).
- If this document has not been added, add it. Otherwise, move down results page and add first non-selected document).
- Repeat steps 1-3 with search engine (2) and (3).
- Repeat steps 1-3 with query (b) and (c).
This process resulted in the following URLs (2/15/2016).
- URL_1a: http://digitalhumanities.org/answers/topic/what-is-digital-humanities
- URL_2a: https://en.wikipedia.org/wiki/Digital_humanities
- URL_3a: http://whatisdigitalhumanities.com/
- URL_1b: http://digitalhumanitiesnow.org/
- URL_2b: http://shc.stanford.edu/digital-humanities/
- URL_3b: http://www.cdh.ucla.edu/
- URL_1c: http://guides.library.ucla.edu/c.php?g=180354&p=1186136
- URL_2c: http://idrh.ku.edu/
- URL_3c: http://www.mitpress.mit.edu/sites/default/files/titles/content/9780262018470_Open_Access_Edition.pdf
With a simple, repeatable process we have found our data-set. The process may be used at another time in the future, or in another location to compare results or progression.
Now that we have our data, we may decide to use raw data or parse the data. We could go through each file and remove the html tags, and find the content blocks to make sure every word we analyze appears on the website or…
… lets include whats under the hood and see what Digital Humanities is really made up of on the web (whats your opinion of this? do you see any drawbacks? positives?).
Our first technique will be parsing the raw data into a simple text processor that will tokenize (split the words), stop (remove uninformative words like ‘the’ and ‘you’), and naively stem (ie. running -> run / exceedingly -> exceed) the 9 documents.
The stop words being used can be found here: https://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquery
The stemming will be very simple, following only two rules of the Porter stemming technique*:
- Replace sses by ss (e.g., stresses → stress ).
- Delete s if the preceding word part contains a vowel not immediately be-
- fore the s (e.g., gaps → gap but gas → gas ).
- Replace ied or ies by i if preceded by more than one letter, otherwise by ie
- (e.g., ties → tie , cries → cri ).
- If suffix is us or ss do nothing (e.g., stress → stress ).
- Replace eed, eedly by ee if it is in the part of the word after the first non-vowel following a vowel (e.g., agreed → agree , feed → feed ).
- Delete ed, edly, ing, ingly if the preceding word part contains a vowel, andthen if the word ends in at, bl, or iz add e (e.g., fished → fish , pirating →pirate ), or if the word ends with a double letter that is not ll, ss, or zz, remove the last letter (e.g., falling → fall , dripping → drip ), or if the word is short, add e (e.g., hoping → hope ).
*More information on page 82: http://ciir.cs.umass.edu/downloads/SEIRiP.pdf
First, we list the top 25 most frequently used ‘tokens’ that our parser finds. Observing this data, we struggle to find a lot of value in most of the rankings. But, if you noticed, the 9th document (found with the search “Digital Humanities Research” on Yahoo!) was a PDF of a book named Digital Humanities. Scroll through figure 1 below to see the patterns:
Figure 1: Top 25 Rankings of 9 URLs
Well, we have our definition of Digital Humanities in tabular form, but it looks more like parenthesis and code than anything we thought we were interested in.
So Our Data Stinks! (?)
Depends. We can gain a lot of incite from the website development patterns between 8 websites and gain valuable information about the terms associated with the book, Digital Humanities, from the last link. This process, done on a smaller scale, is valuable as it shows the importance of early decisions in a project. Although intuition in this case would prevent the decision to keep the raw HTML tags, it is curious to see how ‘code-heavy’ websites are relative to the content displayed on the web-page. If the researcher was more inclined to find Digital Humanities content with this process, they would much rather 9 PDFs about Digital Humanities than raw HTML files. By keeping everything at a smaller scale to start, we successfully ‘failed fast’ in this regards.
With this first representation of Digital Humanities out of the way, lets now create our word-cloud based on the tokenized words we generated. Maybe it will be more informative. We have 84,211 of them so something must have converged (?). To simplify the task, we will use wordle.net (Settings: Font: Primer Print Medium, Straighter Edges, Mostly Horizontal, yramirP, A Little Variation, Do Not Remove Common Words).
Remember, we processed all of the text through tokenization (splitting), stopping (removing irrelevant words), and stemming (merging tenses/etc).
Figure 2: Word Cloud of Formatted Data
If we observe this word cloud we see the consequences of keeping all of the HTML, but under all of the larger tags, we see the more humanly valuable words such as research, information, history, and scholarship. An interesting remark is that this represents what a lot of Digital Humanities experts experience: constant wrangling with code/technology/data in order to pull forth the more valuable aspects of their work.
So what good has our process done for Digital Humanities? Did it even get us closer to our goal of creating a visualization of the Digital Humanities? Have we pulled forth, even a little, anything valuable? Is it ‘the Digital Humanities’ or just ‘Digital Humanities’?! Lets go back to the data.
Comparison: Raw Data
Observe Figure 3 below. A result of no processing.
Figure 3: Word Cloud of Raw Data
Like an onion, the raw data shows a new layer of words that inhabit our language, casting aside some of the more valuable words. Our process successfully peeled back a small portion of the noise. And that is all that is needed; For the data and process still exists – waiting for the next Digital Humanities researcher to iterate just a little bit.. in pursuit of the Definition of Digital Humanities.