Taxonomy Transformation

Legacy in Language

There are potential or new clients who come to us with a definitive version of their taxonomy, typically in a spreadsheet format like CSV or Excel. The file might need a little formatting, but it is more or less ready for import into a dedicated taxonomy management system.

Then there’s everyone else.

Typically before an organization is even ready to consider which taxonomy management software vendors they would like to invite for software evaluations and trials, they are plumbing the depths of their information (lake, sea, black hole…pick your metaphor) for controlled vocabularies. If there were no taxonomist, centralized taxonomy program, or other responsible party for developing and managing one or more enterprise taxonomies, chances are there are multiple locations and document versions in which to look for terminology.

Let’s talk about where to look for legacy vocabularies, methods for consolidation, and how to move from all those different sources to a final taxonomy ready for additional development and deployment.

Vocabulary Sources

The most obvious place to start with vocabularies is in any systems which were purpose-built—or near purpose-built—to manage taxonomies. If you’re lucky, you have a legacy taxonomy management system (TMS) which can export the vocabularies in one of several useful formats like CSV, Excel, RDF-XML, Turtle, or similar. A legacy TMS could have been abandoned when a role was vacated or eliminated or when budget constraints forced the system to be retired. Whatever the case, an existing system will provide the easiest transition to a new TMS.

If there were no prior dedicated taxonomy management systems, look to your current and legacy content, digital asset, or web content management systems (CMS, DAM, WCM). Did anyone do some work to create taxonomies in SharePoint Term Store? What about controlled metadata lists or schemas? Are there systems you can find that managed metadata in one form or another and what was it used for? Check those systems for navigational structures, both for internal and external use. Can the navigational structures be exported and consolidated? Dig deeper. What about search logs? Despite being unorganized and unstructured, they can be a gold mine for the types of concepts end users are really searching for.

The good news is that while you may have to check several different systems and conduct a knowledge audit for existing systems, once you find them, you can probably export the values into useful formats. The hard work will be finding and extracting concepts from other document types. Imagine someone made a taxonomy in a spreadsheet. Great! Spreadsheets are a very good place to start working with taxonomies if there is no taxonomy management software. They are relatively easy to manipulate and can be modified to include additional information such as concept values, definitions, identification numbers, etc.

Imagine, however, that someone built a glossary of terms and definitions in a word processing format (most likely Word). You have to get that information out of the document and into a spreadsheet which requires more than just cut and paste. Worse (though they did it for the better), they shared that document with others who found it useful. Those people made additional versions…and maybe added a few more terms and definitions or modified what was already there. Maybe a few people copied this document and then put it in PDF format so they could share it without others changing it. Now there are many different versions of a glossary with mostly the same, but not completely the same, terms and definitions. All of these vocabulary documents, in various formats and with multiple versions, are scattered across the information landscape in one or more systems throughout the organization.

Concept Cartography

In this search for definitive vocabularies, you are both discovering and mapping. You are an explorer, discoverer, and surveyor. You are performing a concept cartography in which you are surveying the information landscape of systems and content and trying to inventory and map your findings. Welcome to the undiscovered country! You’ve just become a cartographer mapping potentially uncharted geographies and, yes, finding monsters at the edge of the known world. What is undiscovered for you may very well be known to others. Still, you are mapping the known to the unknown and creating a topography of the information landscape.

Some organizations task themselves with this knowledge audit and others hire consultants to perform the work and even potentially create a unified taxonomy from the results. Whichever way you go, consolidating and mapping the results of your findings can be challenging. There are several methods, or combination of methods, you can use.

The first is manual. This includes finding the best examples of your vocabularies, cutting and pasting them into a master spreadsheet, and then using spreadsheet tools like finding duplicates, concatenating or separating string values, and other tricks to work toward a consolidated taxonomy master. While the manual route is potentially tedious, there are some advantages. One is that one or more people become familiar with the types of concepts which are of value to the organization. By virtue of becoming submerged in the content, they become expert navigators of the landscape. These are the true trailblazers. Another advantage is that the work is centralized to one or a few people who can make quick decisions about the taxonomy development, including what to keep, what to throw out, and what to deal with later.

The second is semi-automated. I say “semi” because even with the best tools, there will inevitably be human-in-the-loop intervention to clean up the results. Semi-automated methods can include text analytics systems. Gather your vocabulary documents and run them through a text analytics tool and see what you get. Some of these tools claim they can even build taxonomies from the results, but the end products vary in quality. Regardless, the time saved with a tool which builds even a not very good taxonomy might well be worth the effort to take that result and manipulate it into a higher quality vocabulary.

What text analytics tools are good at are clustering and mapping, finding concepts which are similar and grouping them. For example, finding all of the ways users typed in search keywords which fall all over the alphabet can be rapidly consolidated using text analytics software. For instance, “can I work remotely”, “remote working policy”, and “work from home policy” are all asking the same thing, but only have a few concepts in common and are falling alphabetically in different places in what is likely a very long list of conducted searches. Text analytics software can pick apart these multi-term search queries from several different file types and cluster them into sensible “buckets” which can then be mapped as a single taxonomy concept with or without alternative labels (synonyms).

Another semi-automated method is to purchase a TMS, load up your resulting spreadsheet(s), and clean up the results using dedicated software functionality. For example, a taxonomy tool can perform duplicate concept searches (with near or exact matching), create general or custom mapping relationships between concepts using drag and drop or typeahead, and support the moving of concepts within and between schemes. Taxonomy systems also allow for replicating and versioning schemes so that multiple builds can be tried and accepted or rejected. These are just some of the features allowing for the clean-up, manipulation, and finalization of one or more vocabularies. While it can be difficult to make the case for a dedicated TMS before the final taxonomy (or taxonomies) is complete, the functionality in the system may speed the process of taxonomy development significantly.

Chances are high that your process will include a healthy mix of semi-automated processing and manual work. The result will hopefully be a navigable map.

Your Organizational Map

Your organizational map will not just be a concept cartography resulting in a taxonomy mapping all of the nouns and concepts of importance and structuring them, it will also be an archaeology of knowledge and mapping of your conceptual domain. What these maps reveal are not only the concepts, but the relationships between concepts. At a high level, the key to your organizational map is an ontology that is much like a legend. If the map is the landscape itself (concepts as places and landmarks), the legend is the ontology explaining how everything on the map fits together (verbs as relationships). The concepts are all the many points of interest on the map while the legend tells you how to read the map and what rules the map follows.

The result of your cartography is a complete organizational map. The ontology tells you what things can be on the map and how they are related. The taxonomy is made up of the values themselves. Both components are supported by the standards developed to guide building and maintaining ontologies and taxonomies and are expressed simply in RDF as S-P-O (subject-predicate-object or subject-verb-object, just like in many languages).

You’ve mapped the whole landscape! Take these results and load them into your GPS. In this case, the GPS is a taxonomy and ontology management system which provides centralized guidance for many applications like search, content tagging, and analytics dashboards, just to name a few. Once in a dedicated system, you can maintain and expand your map to other regions (domains of knowledge).

Who knew that moving from legacy vocabularies to a dedicated management system would make you an explorer and mapmaker in the process?

Legacy in Language

Vocabulary Sources

Concept Cartography

Your Organizational Map

Related Posts

Taxonomies in Records Management

Author Ahren Lehnert

Contact Us

Stay Connected