Defining Digital Humanities with Digital Humanities

Defining Digital Humanities with Digital Humanities

This post defines a process  of defining Digital Humanities recursively.

Our naive technique will be using simple text acquisition to create tables and word clouds that represent context of URLs related to the Digital Humanities. The definition will not be in words, but instead be determined by oneself when observing patterns in the data (word cloud/frequency).

The Data

Our first decision is selecting sources to acquire our text. Because this is a ‘quick’ process, we will crawl the web manually at a small scale. An automated approach could be used to increase the volume of our data set and look for more general convergences in the future.

We now define the steps to determine which documents (URLs) we will include in our data set.

Our manual data crawler will use the following search engines:

  1. Google
  2. Bing
  3. Yahoo!

These 3 queries will be run (selected empirically):

  • a.) “Definition of Digital Humanities”
  • b.) “Digital Humanities”
  • c.) “Digital Humanities Research”.

The web browser used will be Google Chrome with incognito mode to prevent profile bias. Note this will not prevent location bias.

The selection criteria are as follows:

  1. Query (a) with search engine (1).
  2. Select the first normal results (no snippet, pre-parsed URL, or advertised URL).
  3. If this document has not been added, add it. Otherwise, move down results page and add first non-selected document).
  4. Repeat steps 1-3 with search engine (2) and (3).
  5. Repeat steps 1-3 with query (b) and (c).

 

This process resulted in the following URLs (2/15/2016).

With a simple, repeatable process we have found our data-set. The process may be used at another time in the future, or in another location to compare results or progression.

The Formatting

Now that we have our data, we may  decide to use raw data or parse the data. We could go through each file and remove the html tags, and find the content blocks to make sure every word we analyze appears on the website or…

… lets include whats under the hood and see what Digital Humanities is really made up of on the web (whats your opinion of this? do you see any drawbacks? positives?).

Our first technique will be parsing the raw data into a simple text processor that will tokenize (split the words), stop (remove uninformative words like ‘the’ and ‘you’), and naively stem (ie. running -> run / exceedingly -> exceed) the 9 documents.

The stop words being used can be found here: https://sourceforge.net/p/lemur/galago/ci/default/tree/core/src/main/resources/stopwords/inquery

The stemming will be very simple, following only two rules of the Porter stemming technique*:

Step 1a:

  • Replace sses by ss (e.g., stresses → stress ).
  • Delete s if the preceding word part contains a vowel not immediately be-
  • fore the s (e.g., gaps → gap but gas → gas ).
  • Replace ied or ies by i if preceded by more than one letter, otherwise by ie
  • (e.g., ties → tie , cries → cri ).
  • If suffix is us or ss do nothing (e.g., stress → stress ).

Step 1b:

  • Replace eed, eedly by ee if it is in the part of the word after the first non-vowel following a vowel (e.g., agreed → agree , feed → feed ).
  • Delete ed, edly, ing, ingly if the preceding word part contains a vowel, andthen if the word ends in at, bl, or iz add e (e.g., fished → fish , pirating →pirate ), or if the word ends with a double letter that is not ll, ss, or zz, remove the last letter (e.g., falling → fall , dripping → drip ), or if the word is short, add e (e.g., hoping → hope ).
  • Whew!

*More information on page 82: http://ciir.cs.umass.edu/downloads/SEIRiP.pdf

First, we list the top 25 most frequently used ‘tokens’ that our parser finds. Observing this data, we struggle to find a lot of value in most of the rankings. But, if you noticed, the 9th document (found with the search “Digital Humanities Research” on Yahoo!) was a PDF of a book named Digital Humanities. Scroll through figure 1 below to see the patterns:

Figure 1: Top 25 Rankings of 9 URLs
URL_1a:

  1. class= 106
  2. http 89
  3. > 85
  4. answer 78
  5. href= 74
  6. < 73
  7. div> 62
  8. >< 50
  9. digitalhumanitiesorg 46
  10. post 45
  11. id= 39
  12. digital 38
  13. p> 36
  14. a> 34
  15. humaniti 33
  16. <a 29
  17. value= 28
  18. option> 26
  19. >&nbsp 26
  20. <option 25
  21. s 25
  22. <div 25
  23. avatar 24
  24. a>< 24
  25. topic 22
URL_1b:

  1. > 217
  2. < 181
  3. class= 154
  4. http 112
  5. div> 89
  6. digitalhumanitiesnoworg 87
  7. <div 86
  8. href= 75
  9. fa 56
  10. id= 49
  11. >< 49
  12. menu 45
  13. a>< 43
  14. item 42
  15. wp 41
  16. content 40
  17. text 35
  18. li> 32
  19. dhnow 32
  20. 2016 29
  21. digital 29
  22. humaniti 28
  23. col 25
  24. type= 25
  25. 02 25
URL_1c:

  1. s 245
  2. > 233
  3. class= 205
  4. lg 190
  5. div> 125
  6. < 122
  7. } 119
  8. { 119
  9. lib 108
  10. id= 104
  11. box 84
  12. content 80
  13. <div 60
  14. font 55
  15. color 52
  16. border 49
  17. href= 47
  18. guide 46
  19. * 43
  20. http 43
  21. profile 42
  22. # 40
  23. tab 40
  24. link 36
  25. a> 36
URL_2a:

  1. class= 570
  2. href= 465
  3. a>< 271
  4. span>< 264
  5. title= 262
  6. id= 223
  7. wiki 213
  8. humaniti 185
  9. li> 172
  10. cite 170
  11. digital 164
  12. text 157
  13. ><a 154
  14. a> 151
  15. < 149
  16. mw 148
  17. <a 142
  18. > 139
  19. reference 137
  20. ref 118
  21. #cite 118
  22. ><span 113
  23. note 112
  24. <span 98
  25. rel= 92
URL_2b:

  1. class= 561
  2. field 417
  3. < 275
  4. > 266
  5. div> 258
  6. <div 227
  7. item 190
  8. http 138
  9. href= 138
  10. menu 134
  11. pane 117
  12. ><div 107
  13. link 105
  14. 1 92
  15. digital 90
  16. humaniti 89
  17. content 89
  18. div>< 80
  19. a>< 79
  20. dh 77
  21. ><a 75
  22. header 68
  23. view 62
  24. name 61
  25. file 59
URL_2c: 

  1. class= 498
  2. < 375
  3. > 345
  4. href= 320
  5. ku 277
  6. block 254
  7. <div 252
  8. div> 249
  9. li> 224
  10. a>< 194
  11. ><a 165
  12. menu 158
  13. http 156
  14. <li 149
  15. id= 123
  16. static 117
  17. view 98
  18. region 95
  19. text 94
  20. title= 93
  21. clearfix 84
  22. row 79
  23. new 77
  24. search 73
  25. inner 69
URL_3a:

  1. > 13
  2. text 12
  3. < 11
  4. } 9
  5. { 9
  6. <meta 6
  7. href= 6
  8. http 5
  9. line 5
  10. digital 5
  11. content= 5
  12. quote 5
  13. = 4
  14. humaniti 4
  15. function 4
  16. height 4
  17. font 4
  18. >< 4
  19. <a 4
  20. id= 4
  21. name= 4
  22. lib 4
  23. random 3
  24. css 3
  25. script> 3
URL_3b:

  1. menu 203
  2. item 184
  3. class= 137
  4. > 130
  5. http 112
  6. < 99
  7. wwwcdhuclaedu 89
  8. href= 77
  9. type 74
  10. post 55
  11. id= 47
  12. ><a 42
  13. wp 40
  14. li> 38
  15. object 38
  16. a>< 36
  17. content 35
  18. page 35
  19. { 31
  20. } 29
  21. text 27
  22. <li 26
  23. type= 25
  24. div> 24
  25. css 23
URL_3c:  (PDF!)

  1. digital 707
  2. humaniti 608
  3. project 339
  4. work 242
  5. knowledge 216
  6. design 199
  7. new 180
  8. research 179
  9. data 174
  10. cultural 167
  11. form 149
  12. model 138
  13. social 133
  14. create 133
  15. process 123
  16. way 123
  17. information 123
  18. text 111
  19. media 110
  20. scholarship 106
  21. platform 104
  22. system 100
  23. method 99
  24. network 98
  25. world 96

What Happened?!

Well, we have our definition of Digital Humanities in tabular form, but it looks more like parenthesis and code than anything we thought we were interested in.

So Our Data Stinks! (?)

Depends. We can gain a lot of incite from the website development patterns between 8 websites and gain valuable information about the terms associated with the book, Digital Humanities, from the last link. This process, done on a smaller scale, is valuable as it shows the importance of early decisions in a project. Although intuition in this case would prevent the decision to keep the raw HTML tags, it is curious to see how ‘code-heavy’ websites are relative to the content displayed on the web-page. If the researcher was more inclined to find Digital Humanities content with this process, they would much rather 9 PDFs about Digital Humanities than raw HTML files. By keeping everything at a smaller scale to start, we successfully ‘failed fast’ in this regards.

Clouds

With this first representation of Digital Humanities out of the way, lets now create our word-cloud based on the tokenized words we generated. Maybe it will be more informative. We have 84,211 of them so something must have converged (?). To simplify the task, we will use wordle.net (Settings: Font: Primer Print Medium, Straighter Edges, Mostly Horizontal, yramirP, A Little Variation, Do Not Remove Common Words).

Remember, we processed all of the text through tokenization (splitting), stopping (removing irrelevant words), and stemming (merging tenses/etc).

Figure 2: Word Cloud of Formatted Data

tokenized_words.png

If we observe this word cloud we see the consequences of keeping all of the HTML, but under all of the larger tags, we see the more humanly valuable words such as research, information, history, and scholarship. An interesting remark is that this represents what a lot of Digital Humanities experts experience: constant wrangling with code/technology/data in order to pull forth the more valuable aspects of their work.

So what good has our process done for Digital Humanities? Did it even get us closer to our goal of  creating a visualization of the Digital Humanities? Have we pulled forth, even a little, anything valuable? Is it ‘the Digital Humanities’ or just ‘Digital Humanities’?! Lets go back to the data.

Comparison: Raw Data

Observe Figure 3 below. A result of no processing.

Figure 3: Word Cloud of Raw Data

raw_words.png

Like an onion, the raw data shows a new layer of words that inhabit our language, casting aside some of the more valuable words. Our process successfully peeled back a small portion of the noise. And that is all that is needed; For the data and process still exists – waiting for the next Digital Humanities researcher to iterate just a little bit.. in pursuit of the Definition of Digital Humanities.

 

@Alex_M_Sullivan

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s