Voyant Tools 2

David Hoover lists a number of interesting possibilities for textual analysis, including “assessing how similar imitations, pastiches, completions, continuations, prequels, and sequels of texts written by other authors are to the original texts.” Sherlock Holmes has a very long and rich history of pastiches, so I thought it would be interesting to compare some to the original stories and see how they hold up. I located some pastiches from this site and loaded them into Voyant.

Let’s start with some word clouds. I had to trim out some common words to make a cleaner cloud, but I also found I needed to remove ‘Holmes’ from the pastiche cloud in order to produce a cloud that was even useful. (The word ‘Holmes’ eclipses incredibly small words otherwise)

Now, ideally, the word clouds for the original Doyle stories and the pastiches would be very similar. And, in fact, there are some decent similarities. The cloud on the left are the original stories; on the right are the imitators. Here’s a list, starting with the most frequent words first:

Top 15 Original Words: said, come, little, know, room, time, sir, came, face, think, house, way, night, watson, good

Top 15 Pastiche Words: said, asked, watson, time, case, way, door, just, looked, room, good, know, left, turned, I’m

I’ve bolded the words that appear in both lists. (‘door’ and ‘case’ also appear on the top 20 of the words from the original stories) The pastiches stack up better than I expected. One very clear distinction, however, is how ‘Watson’ is much further up the list of frequent words for the pastiches, as the third most common word. It’s tempting to conclude that fans of Doyle’s work may ascribe a more important role to Watson than the original author even did.

So, the word clouds look somewhat different, but decently similar. What else can we check? Let’s look at some phrases.

Here are the top five most frequent phrases and how often they appear in each story. The original Doyle stories are the first top eight, with the bottom five being the imitation stories. Here the distinction between authors is much more apparant. The way Doyle habitually combines words differs quite a bit from how the other authors do, with the phrase ‘all this’ really being the only phrase of his that appears in the pastiches with any regularity. (That said, it is interesting to note Doyle’s final collection of short stories, ‘His Last Bow,’ also appear quite distinct from the others. Perhaps Doyle’s style had begun to shift at that point.)

These are just two quick and simple ways of comparing the texts. The next steps would be to apply more sophisticated techniques to compare them, such as cluster analysis. It also would be interesting to gather many pastiches from multiple authors and compare them to Doyle’s work, one author at a time, so that in theory you could locate the author who has been the most successful at imitating Doyle’s work!

Experimenting With Voyant Tools

For my textual analysis, I used (almost) all of Sir Arthur Conan Doyle’s stories on Sherlock Holmes. All of these stories have passed into public domain, except for The Casebook of Sherlock Holmes collection. This means the complete corpus has 8 documents– 4 novels and 4 collections of short stories.

The word cloud these texts produce as a whole seems like a nice place to start, although it mostly confirms what we would already suspect to see. ‘Holmes’ tops the list of most frequent words, as he is of course the star of the show. His side-kick and biographer, Watson, makes an appearance as well, but only as the 14th most frequent word. Again, considering Watson’s relatively humble role in the partnership, this comes as no surprise.

Words centered around the intellectual nature of Holmes’ efforts (‘know’ and ‘think’) and around observing physical details of people and the environment (‘face’ ‘hand’ ‘door’ ‘house’ ‘room’) also are what we would expect.

So let’s do something a little more interesting. Let’s type in the word ‘cocaine’ and see how often it appears across the timeline in a simple plot graph. These texts are arranged in order of when they were published, starting with A Study in Scarlet and ending with His Last Bow.

Everyone knows Holmes has a cocaine addiction, right? But if we check the data, it actually looks like his addiction is not a life-long problem. It has zero mentions in the first novel, peaks with the second novel The Sign of the Four, then tapers down fast, with three and then one mention in the two short story collections that follow. After that, it isn’t even mentioned again. Seems Holmes does kick the habit pretty fast. I find this especially interesting because I never got around to reading the last of the stories, so I can confirm this without having even read them. (Unless Sir Doyle suddenly starts referring to cocaine with euphamisms, which seems unlikely, given his style.)

The “knots” tool is a little more complex of a visualization tool. The green line here is for Holmes and the blue is for Watson. The twistier the line is, the more frequently a word occurs, which explains why the green is so much twistier. However, the points of intersection also mean something. The more the lines overlap, the more strongly linked the two words are in a text. Looking at these knots, you might be tempted to conclude that Holmes and Watson have almost no relation to one another. After all, they only seem to overlap once or twice! But if you click on the points of overlap, Voyant will give the context for those little pieces of overlap:

Ah-ha. So when the Voyant Tools Guide described the words as being ‘linked’ it meant it quite literally– it checks to see if the words occur very closely together. I now have a better understanding of the Knots tool, and what it can and cannot do. Clearly, you can’t just plug in character names and expect the results to show how often these characters interact with one another.

There are still plenty more tools for me to explore, but I am intrigued by the possibilities they pose.

Polybius Progress

The more I work with the article the more I realize how poorly organized the entire page is. There are many areas that need reorginization and expansion. I’m currently working on an outline.

Urban Legend

(a) Summary of original legend as listed by coinop.org (need to confirm all details in article are cited properly)

Note– Strongly consider adding ‘screencap’ of the game, which is the oldest visual ‘proof’ of the game and very influential to the myth

(b) Explain the significance of the myth’s appearance in GamePro magazine (first time it appeared in print, the single biggest source of spreading myth)

(c) Steven Roach adds details to the myth that become embedded in the story. (esp. vector graphics) Briefly summarize his involvement and his added details

(d) Consider adding a little bit of info about the photos of fake arcade cabinets that popped up, as this helped further fuel and spread the story


(a) Debunking the myth (move relevent material from elsewhere in the article to this section): BRIEFLY review the proof that the game does not exist (no proof of company existing, no printed mention from media of the time, no ROMs or machines found, name of the company seems to be made-up by non-German speaker)

(b) Argument for 80s origin (Dunning, DeSpira), with focus mainly on why the myth developed and was popular/appealed to people

(c) Argument for 2000 origin (Brown, et al), with focus mainly on why the myth developed (a hoax by coinop.org to drive traffic) and why it was popular and appealed to people/why it still appeals to people



  • I realize ‘legacy’ is a common method of organizing Wiki articles. It does feel a bit odd here, though, to seperate it from ‘history.’ That is because the nature of urban legends is that they are on-going and always evolving. The ‘legacy’ really is just the next step in the story’s history, spread, and morphing. Still, it may be easiest to stick with this header.
  • Add screenshot of The Simpsons episode that has Polybius, if this is legal to do so. It was arguably the most major mainstream reference to the myth.
  • Clarification needed tag in this section to tend to, clean up ‘in popular culture section’
  • Check to see if the legacy list is accurate and if anything needs to be added

Wiki Project

For my Wikipedia page project, I am currently looking at editing the page for the urban legend surrounding the arcade game Polybius. This article is rated C-class on the quality scale, implying there’s some room for improvement.


Gives a brief recount of the legend itself but does not discuss the history of the legend’s development in much depth. Urban legends are all about growing and evolving– this is how they develop, get added on to, and spread. As such, it’s extremely relevant to include that information.

A man named Roach has had a large impact on the development of the legend. In the Talk page, you can see he has been added into the page and removed multiple times. I argue that he should be added back in. A Talk note says the main reason for removal was poor sources (a primary source, not secondary, which apparently is bad?), so I will aim to use good secondary sources. (I have already gathered some) The version of the legend changes over time in part because of the details Roach introduces, so it’s relevant to include it.

Also involved in the spread of the legend are a series of fake Polybius arcade machines that cropped up, which helped promulgate the tales.


(The History section also includes analysis–from Snopes.com in particular– of the legend, which needs to be moved to the Analysis section of the article.)

Can be further expanded. Theories abound that the development/popularity of the urban legend is due to a number of factors: the controversy and parental paranoia surrounding the new and exciting arcade machines, real-life FBI raids of these arcades, and popularity of conspiracy theories of men-in-black (such as The X-Files) and mind control experimentation (some games were indeed employed by the military in brief experiments, which seems worthy of a brief mention).

The analysis section can pull from many more sources than just Dunning. The fact that young children from the 80s may have heard of these events/stories and had distorted recollections of them could have further served to fuel and distort the stories. I have a source that draws quotes from the kid who suffered a migraine after a marathon of a game, one of the events believed to have fueled the rumor, and the boy explains how his friends were convinced it was evil video games to blame. There is also a related rumor to a death linked directly to the arcade game Berzerk, and it’s not mentioned at all in this article.


Needs images, for starters! Namely for the Simpsons screenshot. The Popular Culture subsection also is extremely brief (one has a details needed tag) and I suspect there may be a few major games/pop culture references to add. Still need to double check on that.


Here’s a list of my sources so far. I won’t necessarily use them all, just sort of to help me keep track of things.

What is most likely the original source of the Polybius myth, the coinop.org post


Article in Atlas Obscura (I still need to evaluate the authority of this source)

Stuart Brown produces video-game documenteries on YouTube. His extensive video recounts some pretty exhaustive research he’s conducted via archive searches and interviews.

This post mentions another ‘suspect’ for the origin of the legend (but also discusses the ultimate futility in locating the ground zero for an urban legend online.)

There was an aborted Kickstarter for making a film about Polybius

Polygon is a very well-known gaming news outlet

A very old interview with Steven Roach. Because the Wikipedians said this was a primary source, they didn’t seem to consider it as an appropriate source? I’m still unclear as to why this is a problem.

(List incomplete.)

Will You Annotate? I Would Prefer Not To.

While I was reading Melville’s Bartleby, the Scrivener, I have to say I definitely preferred reading the story on its own, without any annotations. I could simply focus on the story itself and my own reactions to it.

When I began to read the annotated versions, I found I often felt frustrated. Many of the annotations offered potential interpretations of the text, which made me feel as though I was being robbed of the chance to interpret the story in my own way. It also felt as though I was letting other people do all the thinking for me. Perhaps if I had waited for a longer period of time in between reading the story on my own and reading the annotated versions, this frustration would not have been as strong. Presumably I would have had more of a chance to process and really ponder the story if I had waited longer.

Another frustration was how distracting the annotations were. Some of the annotations were entirely frivolous– for example, one annotation of the word ‘luny’ was simply a .gif of Homer Simpson. Even when they were not frivolous, the annotations could still be overwhelming or distracting from the story. In general, I found almost all annotations with images or video to be especially distracting, and they felt jarring to me.

I must admit, though, that there were times when I enjoyed the annotations. Occasionally, an annotation would explain a phrase, term, or cultural concept I didn’t understand, and I actually appreciated this. It enhanced my ability to understand the story, hopefully without overwhelming me with too much context or background information. There were also a few annotations that genuinely made me laugh (one simply was titled “Laugh at this” and said ‘It’s a joke’) while simultaneously being informative. I also sometimes appreciated that the annotations made me consider new interpretations. For instance, a small series of annotations suggested the possibility of homoerotic subtext to the story, which I had not before considered.

Another thing I noticed was how different my reactions were to the two different annotation websites. I strongly preferred Slate over Genius. Slate’s visual layout was clean, uncluttered, and it was still easy to read the text itself. The annotations were not very intrusive and it was much easier to ignore them if I wished to. Overall, I also felt the quality of the annotations from Slate was a lot higher. (That said, the annotations could also provide excessive information or give interpretations that I felt were reading an awful lot into things.) Finally, the tagging system on Slate was a nice bonus, making the annotations very well-organized.

Genius, meanwhile, impressed me a lot less. All the highlighted text was very ugly and distracting and the visual layout felt a lot clunkier and constrained. The extra interactable features to the annotations (being able to up vote, down vote, share or suggest an improvement) added a ridiculous level of distraction. I also found the content of the annotations to be of lower quality. Additionally, those writing the annotations tended to state their interpretations as if they were fact, which annoyed me.

However, Slate clearly has a professional staff creating their web pages and writing their annotations. Genius, on the other hand, is composed of crowdsourced annotations. This could certainly explain the differences I found. It only makes sense the quality of annotations from Genius would vary a great deal. This is still no excuse for a poor visual layout, though.

All things considered, I feel there’s much I can take away from this when writing my own annotations. I may not be able to control the visual layout of Hypothesis, but I can control the content of my annotations, in the very least. I will strive to give my own interpretations very rarely, and allow the reader to think on their own and draw their own conclusions. I will try my best to avoid frivolous annotations just for the sake of humor– they should actually have something to say. And if there is an area that I feel could benefit from further context, I will try not to overwhelm the reader with a massive wall of text that gives too much context.