Exploring with Text Analytics

Text Analytics Blog

2017 marks 40 years of space exploration by Voyagers 1 and 2. As a huge fan of science, science fiction, and travel, the idea of two scientific instruments at the farthest distance from Earth captures my imagination as much as the Enterprise’s encounter with V’Ger (the fictional Voyager 6, lost in space).

Into the Unknown

At the moment of this writing, Voyager 1 is 13,147,751,940 miles from earth. As Elton John once said, “It’s lonely out in space.” Despite being where many of us would consider empty space, the Voyagers are still collecting information and transmitting it back to Earth. When it can’t communicate directly with Earth, Voyager 1 has a tape recorder which can collect about 64 kilobytes of data. The documentary about the Voyagers, The Farthest, stated that the amount of computer processing power and information storage on the Voyagers equates roughly to an electronic key fob.

The Explorers & The Unknown

We are explorers. When we read, we “explore strange new worlds…and new civilizations”. As much as we have moved to an online world, the concept of reading for knowledge, exploration, and pleasure is not dead. What an online world has given us, however, is much bigger than our own world and is a veritable universe of information. Much as the Voyagers are probing through the dark reaches of space outside our solar system, we probe through the vast universe of information seeking truth and understanding.

What’s in this universe? Well, here are some mind-boggling facts about the rate of information creation from Analytics Week. As an opener, “more data has been created in the past two years than in the entire previous history of the human race.” When you stop to think about what exactly makes up all that data, it can include intriguing blogs like this one, posted pictures of your friends’ lunches, shared and reshared and posted and reposted and downloaded and reposted posts, all the versions of your work document and all downloaded and saved and resaved in multiple locations versions of your work document, the one piece of content you are actually searching for, and…a multiverse of useful and not very useful information.

Consider what that means. At the time the Voyagers were launched, we were dealing with information sizes in the kilobytes. Today, I have a relatively inexpensive 1TB storage device on my desk. That’s just for me. I’m not storing the collective knowledge of humanity.

Unless you have a V’Ger-like mind, you are probably more than a little overwhelmed by the size of this information universe, especially when you are seeking a particular truth. The trending term is big data, but data has always been big, even before it was digital. The difference now is mainly one of medium and how to access and retrieve what you are looking for.

The Known

The Voyagers didn’t (or haven’t yet) unravelled all the mysteries of the universe, but they are important tools providing us much sought-after information. Likewise, drowning as we are in our sea of information, we need tools to both help us stay afloat and probe the depths.

One of the ways we seek information is through pattern detection. Whether the pattern is known words or phrases, the way information is usually communicated, or learned through a lifetime of experience, finding patterns is one of the best ways we have of looking within larger quantities of information. Just as the Very Large Array is seeking patterns in blips by listening to the skies, text analytics can detect patterns in unstructured information.

The most basic form of pattern detection is matching. For example, if you have a list of names or a taxonomy of controlled values, you can use text analytics to match the same concept in text. The problem is, how often do we use language in an absolutely consistent manner? How common is it that there is only one word or phrase form to indicate a topic? For example, if I’m looking for Dr. Seuss in text, all the documents which discuss Theodor Seuss Geisel or Geisel, Theodor Seuss or Theodor Geisel or Geisel, Theodor may or may not be retrieved (or automatically tagged). While it’s possible to use synonymy to equate multiple forms of a concept to a preferred concept, this is labor-intensive work if it is the only means of expanding on a concept.

There are patterns which we know to be true which allow us to extrapolate unknown concepts or reinforce known concepts. Using the same example, we could include rules to say that when we see two or three nouns in a row with capital letters, this is very likely a name. Of course, this is not true in all languages, but a single rule like this covers a lot of ground despite the inherent weaknesses in coverage. Using one rule can move us from the known, such as a list of names, to the unknown, such as any proper name which may appear in text.

Rule writing can also be labor intensive, but rules can expand the universe of unknowns in text greatly when used in conjunction with known entities which are important to an organization.

The Known Unknown

“As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.” At the time Donald Rumsfeld said this, he was widely mocked for his circumlocution to obfuscate events in Iraq. While probably true, and definitely not spoken in the context of imagination and space exploration, it does hit home that there are basically things we know and things we don’t. Known entities in text can be matched and found relatively easily. Unknown entities are the latter category: the more difficult ones.

I’ll expand more on the search for the known unknowns in later blogs using text analytics as a tool for identification, retrieval, and exploration. In the spirit of the Voyagers, hurtling through the vastness of space in their ongoing mission to collect data, let’s be explorers using text analytics to cut through the universe of semi- and unstructured content.

%d bloggers like this: