There’s much written about unstructured data and how it is notoriously over-generated and underutilized. What is unstructured data? Wikipedia provides the following definition:
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
While this is a pretty good definition, obviously it depends on your perspective. Using the term loosely, text does have a data model.
The basic data model for English is S-V-O. Unless I’m being poetic or I’m speaking like Yoda, I’ll use a subject followed by a verb and often include an object. That’s just the basics, and then I can add all kinds of other parts of speech, like adjectives and helping verbs. I swam. I was swimming. I was swimming intensely. I don’t usually construct sentences like “Swam” with an implied subject “I” or mix the order of the words and let the user sort it out, such as “was swimming I intensely” (though native speakers will probably get the meaning anyway).
When matching text against a known taxonomy, linguistic structure is necessarily that important. We have defined terms, their synonyms, their acronyms, and their related terms and we simply want to match these if found in the document. Language processing is much more important when we are doing entity extraction. The system needs to know parts of speech and which words belong together as phrases. Recognizing single word entities is only going to go so far. Recognizing “New York City” as “New”, “York”, and “City” doesn’t do us much good, and only getting “New York” is also misleading, but at least closer to the truth. What we really want is good language processing that can parse words and sentence structure, identify parts of speech, and know what terms make up a phrase.
Linguistic structuring is one of the pillars of text analytics, using the rules of the language to break it up and parse it so a machine can use it. In a sense, text analytics takes structure, breaks down that structure, and applies that structure back to the original text so we can do things like find related concepts and cluster the results.
Another source of structure is the content form itself which was devised precisely to structure text into human readable presentation. A common structuring language is HTML, which prescribes how a web page should be rendered including the text on that page. HTML uses markup tags to tell web browsers whether a portion of the text is a title, in bold, and what font size should be used. HTML provides the directions to web browsers a human might do if he or she was writing, using capital letters, slightly larger or darker handwriting, and skipping a line to start a new paragraph. How anachronistic that analogy sounds!
The same is true for various kinds of Microsoft Office documents, PDFs, and a variety of other formats allowing us to read text on a screen. They all have a markup language which tells the computer how to render the documents. Essentially, they take raw text and define the way it should be presented.
In a slightly strange cycle of necessity, human speech is converted into symbols we can read and interpret (Roman characters, for English), converted into a format for machines to read and render, and presented back to us in a marked up form so we can understand what is being shown. In turn, we can use content markup and text placement to add meaning to text which we can only usually get by reading.
One of the more obvious pieces of structure in an electronic document are the metadata fields, such as title, author, date created, etc., which are included with the document. Unless the data feeding these fields came from controlled values, the content may be irregular, but it’s often a very useful place to start. Similarly, titles often include very important terms or phrases which give us an indication about what the whole document may be about. In fact, it’s fairly standard to weight a title more heavily in search engines as part of their overall relevance ranking.
Defining Document Sections
In a text analytics tool – assuming we can view and understand markup language – one way to define a text boundary in order to create a document section is by using tags. In this case, we can create a text boundary section using the standard HTML title opening and closing tags, <title></title> as beginning and end markers. Once we’ve defined a title section within the tool, we can add customized weightings. We can also define known concepts we think are important if they are in the title. In entity extraction, we can define important document sections and unknown concepts which might be important will be weighted ranked higher because they are in the title.
In cases in which we don’t know the markup language, or simply for ease of use, we can define document sections based on text indicators. For example, we can define an introduction document section as starting at the beginning of the document and going for n number of words. Likewise, we could define a section from the end of the document counting backwards n number of words and not perform text analytics on this section to avoid things like references or bibliographies.
Another way to define a document section is by using text as it appears in the document. For example, setting up an introduction document section specifying starting at the term “Introduction” and ending at the term “Table of Contents” might ensure we always get an introduction weighted more heavily in templated documents. Using a combination of text indicators and markup expands the possibilities exponentially.
Despite being labeled as unstructured data, there is often a lot of structure in language and the document format we can use to define our own document sections and find meaningful concepts and phrases.