Taxonomies do not build themselves. Constructing any controlled vocabulary requires, at minimum, input from end users, subject matter experts, and a taxonomist. End users and subject matter experts can provide concepts and their attributes for inclusion in an enterprise taxonomy. A taxonomist validates literary warrant, concept usage in content sources, and concept labels, definitions, scope notes, and other attributes and relationships.
The construction of knowledge organization systems (KOS), including taxonomies, can be time-consuming and even contentious. How do we build up useful KOSs, especially those used for content tagging, in an efficient manner while maintaining precision and accuracy?
Ask the Experts
The surest way to develop thorough, accurate, and well-constructed KOSs is to consult or hire a taxonomist. An experienced taxonomist has the knowledge and background to seek out and reuse vocabularies designed by other taxonomists or build new models from the ground up. Hiring a taxonomist for a consulting period or as a full-time employee leads to the development of an organizationally or domain-specific vocabulary (or, more often, a set of vocabularies) with longevity.
The drawback to this preferred method is that development can take a significant amount of time. Even reusing existing vocabularies is rarely a plug-and-play methodology as organizations frequently require using only part of the vocabulary, using several vocabularies which must be mapped and interrelated, modifying the vocabulary to be suitable for purpose, or a combination of all of these needs.
While beginning development, the taxonomist or the fledgling taxonomy project is often a business process bottleneck. Taxonomy projects usually begin by tackling a piece of the overall domain in a proof of concept. For example, the direction may come from a particular organizational vertical to improve an existing process; this could be communications content tagging, manufacturing document identification, call-center scripts, or a host of other applications. Onboarding and scaling across the organization is a process which takes coordination and negotiation. Terms, even when streamlined through a formal request process, take time to research, approve, and make available for tagging. Taxonomy projects can also become victims of their own success. Once the business realizes the advantages of taxonomy use, ramping up quickly can overwhelm the solo taxonomist.
If the main obstacle is scalability, how do we speed up the process to get to an end product more quickly?
Ask the Users
In the early 2000s, the availability of publicly generated and shared content was already scaling beyond individual and organizational ownership. Since expert review and validation was time-consuming and unable to grow in proportion to the amount of content being generated, many content platforms relied on users to supply concept tags to content in the form of folksonomies. The thinking was that, over time, the natural selection of preferred concept forms and use would generate useful classification systems. These systems started with individual user tagging, making it easy for users to create and apply tag to their own content. Eventually, users were able to tag nearly any content. We still see this today in the ease of hashtag creation and application on social media sites.
The creation of concepts tags in what is essentially a crowdsourcing activity has some advantages. First, asking end users to supply concepts can generate a vast number of tags in a relatively short time. If many people are sharing the task of creating (or even just tagging existing) content, many useful concepts can be collected quickly. Second, including even basic mechanisms to sort similar terms for analysis or further development into a hierarchical structure with attributes and relationships can help ramp up usable vocabularies much more quickly than starting from scratch. Finally, depending on the content being tagged and the user base doing the tagging, many domains of knowledge can be covered with some breadth and depth from the people who are the subject matter experts.
On the other hand, there are many disadvantages to allowing users to tag content, and these can be significant enough to thwart a taxonomy project. One major disadvantage is the certainty of human inconsistency, even within the parameters of a single tagger. If you’ve ever lost a document because you can’t remember the folder you put it in, you know what makes sense one day may not make sense the next. In practice, the creation of variant tags in the form of plurals, alternative labels and misspellings, and near-synonyms result in extremely large tag clouds lacking precision. This issue becomes amplified across many users, resulting in many nearly synonymous concepts with differing scopes of application.
Another issue is differences in levels of granularity. If one user is tagging content as being about environmental issues while another is tagging similar content with more specific concepts like air pollution, rising sea levels, and strip mining, the resulting mix of general and specifically tagged content will make for messy search results.
One of the keys to getting users to tag content in the first place is designing a mechanism allowing for easy tag creation and application. The problem is that ease of tag creation doesn’t always support thoughtful formation and application. The result is often quick shortcuts to a concept which often exists in another form.
Let me provide an example. I recently went on vacation in Puerto Vallarta, Mexico. I wanted to create a social media post checking in and tagging it with the name of the airport. Here is the set of results I was presented:
Note that my smart phone is providing me localized options since I’m back home. Typing just aeropuerto to start the search does show options from my most recent history, but also localized language variants for Oakland, San Francisco, and even the not-so-local Mexico City. The maddening part for a taxonomist are the first two options: they are nearly identical, but the second includes the misspelling “Inrernacional”. Worse, 70.2K people have chosen that as the preferred option! So much for the wisdom of crowds.
When I add more information to narrow my search, I get a mix of versions of the Puerto Vallarta airport name as well as other recent check-in locations. My top two choices are the same while I also get other versions of the airport name, all added by end users. These are just the first two pages of results. How many results are there? Which is the correct version? Should I pick by volume and select the misspelled variant? Should I pick the name which provides the most information, including the Mexican state of Jalisco?
What is the actual name of the Puerto Vallarta Airport? Even in the local guide, it’s not clear. If you look closely at the photo, you will see the full name of the airport (with the first word in the title abbreviated).
Wikipedia gives the name as “Licenciado Gustavo Díaz Ordaz International Airport (sometimes abbreviated as Lic. Gustavo Díaz Ordaz International Airport)” but does not include “Aeropuerto de Puerto Vallarta” as part of the name.
Having worked on a large vocabulary, part of the work was to research terms and proper names for inclusion in the thesaurus and the name authority, including the accepted term form, definition, scope note, and relationships to other terms. This is where a taxonomist comes in, performing the verification tasks which are not expected of end user taggers, even when they are subject matter experts.
Asking the users to provide tags, as Facebook and Google (especially when adding a location to Google Maps) do, addresses the issue of scale, but cannot successfully enforce precision without preventing users from adding any tags at all.
Ask the Content
An alternative to asking the experts or the end users (who may also be subject matter experts) is to ask the content. Using text analytics to interrogate textual content or image recognition on non-textual images can provide a scalable solution to ramping up both the identification of candidate concepts (named entity recognition and extraction) and the automatic tagging of large quantities of content based on these concepts.
Often sold as Artificial Intelligence, machine learning can develop models based on selected content to tag new content as it is added or discovered. In essence, machine learning is “reading” the content and asking it what it is about. Sometimes, the resulting concepts are organized into navigable hierarchies. If machine tagging is the only method for applying concepts to content, there may be no human-friendly, navigable structure at all.
Machine learning is very useful for large quantities of content, pattern recognition, and other tasks which take people a lot of time and effort to perform. What machine learning is not good at is understanding notional concepts which are not deliberately selected to teach the algorithms. For instance, sarcasm, implied meanings, subtexts, extra-textual references, different versions of concept labels which mean the same thing (including abbreviations), and slang and jargon are just some of the stumbling blocks which trip up automated entity recognition, tagging, and taxonomy construction.
Like uncontrolled user tagging, machine learning can generate and apply a large number of concepts to very large sets of content, but there is a significant time investment in training machine learning on curated document sets to be accurate and precise.
Ask the UI
Content and tag creators must have a user interface in which to perform their work. This UI should allow for easy content creation, whether created directly in the UI or uploaded from another location, and for the easy application and creation of concept tags. Frequently, these UIs pull from a KOS in order to ensure consistent, vetted tags are applied to content. However, content can be tagged with inappropriate or outdated concepts if there is no mechanism to allow users to suggest concepts.
In the folksonomy years, users were allowed to add any concept they liked, with or without hashtags. As discussed above, this leads to large, inconsistent, and inaccurate tag clouds. What can we do to make sure the people who create content can also suggest concepts for more accurate tagging and taxonomy development?
One way to mitigate erroneous or duplicative user-generated tag entries is to provide a typeahead suggestion list of existing concepts in the tagging panel. The typeahead should support searching the typed text anywhere in a multi-word phrase. For example, if a user begins typing wate-, matching concepts such as water, heavy water, waterlogged, and inland waterways should all be returned as possible matches.
Ideally, the number of times a concept has been used to tag content should be presented after the term in parenthesis so users can use more popular concepts where appropriate and learn from the system. Over time, taggers learn that concepts used only a few times are frequently misspellings, variants, or less accurate. These terms then fall to the bottom in usage and can be identified as candidates for removal or replacement by more accurate options during routine taxonomy reviews.
If it is possible to set the sorting order within the typeahead, consider what will work best for your users. Should matching tags with more usage be presented at the top of the list regardless of where the typed word matches in the phrase? Should concepts be presented in order by the quality of the match? So, for example, show words which begin with the typed letters first, then concepts in which the typed letters match the middle of the concept, and so forth? Also, consider how many matches are possible and whether all or only some of them should be shown. If this is an option, then how do you decide which top concepts should be presented first? If your cutoff is ten terms, you may be doing a disservice to the content taggers by not showing them all possible options.
Make It Easy, but Not Too Easy
If you allow your content taggers to suggest concepts, make sure the mechanism is easy, but not too easy. Users should be allowed to add concepts, but offer them suggestions based on what they enter. If they choose to add a concept because there is no suitable alternative, ask them to verify the concept label form, including any rules on language, capitalization, spelling, and the like.
You may also ask them to provide additional information, such as a brief note, the concept source or reference links, or other helpful information. You can even require them to add their name and email address so they are accountable for the concepts they suggest. While this may add some overhead in tag suggestion, it also forces users to consider what they are asking for and why. If they share the same goal of accurate content creation and tagging, it is to their benefit to suggest accurate concepts and describe their content well.
Make It a Request
Another possibility is to have the option for users to request the concept. The ability to request concepts should be part of the tagging workflow so users don’t need to leave the screen they are working in. The request should be a formatted template and could be integrated into email (to email the taxonomist directly as a named or role account) or through an existing request ticketing system like Jira or ServiceNow.
One drawback to this method is that the requested concept is not immediately available for use. Content creators must then be responsible for tracking their request and going back to tag the content once the concept becomes available. A taxonomy management system integration with the content tagging platform may allow for the concept to be tagged to the concept, suggested to the taxonomist, and then verified and retained or removed once the concept is reviewed.
Have a Target
While some taxonomists balk at the idea of allowing end users to suggest concepts, if managed and governed well, it makes taxonomy development easier and faster. Like crowdsourcing and folksonomies, concepts are suggested from many users and added to a backlog for review. The manual suggestions tend to be more accurate and come in at a steady pace as compared to automatic term extraction of target document sets which potentially identifies many concepts with or without textual context.
Don’t ask the end users to be taxonomists; have a target instead. This target could be a separate taxonomy or a top level category called General, Miscellaneous, or Other which includes only suggested concepts. Just kidding! A taxonomist will never go for that. How about Suggested Concepts or Pending Concepts instead? It is then up to the taxonomist(s) to review and make decisions about the suggested concept. Concept review and approval may involve adding attributes and relationships and moving the concept within or between vocabularies. Once it has been reviewed, the concept is then officially approved and published for use by all taggers.
In summary, while folksonomies and crowdsourcing may have their drawbacks, developing taxonomies over time based on end user concept suggestions leads to a more content-centric and accurate vocabulary.