Skip to main content

I am not a fan.

A taxonomist working on any taxonomy of sufficient size and scope will encounter disambiguation issues. This occurs when the label for a term isn’t clear or unique enough to describe the concept (in the context of the domain(s) of the vocabulary) unambiguously; that is: the same word (text string) is needed to represent two or more different concepts.

The root problem, of course, is that language is ambiguous, especially with lack of context provided by other words in a conversation, document, or subject area. English alone has some 170,000 words (depending how you count them; if you include variants the number is more like 600,000); compare French, with about 60,000. But even given this abundance of semantic riches, we still use the same words to describe different concepts in context.

Some well-worn examples will suffice; the reader is invited to peruse Wikipedia disambiguation pages for further examples.*

Bridge can refer to dental work, civil engineering structures, card games, metaphorical connections, any number of people, places, films, art objects, and so forth (see https://en.wikipedia.org/wiki/Bridge_(disambiguation)).

Mercury, a proper noun, has almost as many meanings as it can refer to a planet, car brand, chemical element, Roman God, a programming language, and (again) any number of brands, companies, people, and creative works (see https://en.wikipedia.org/wiki/Mercury).

Most taxonomies describe one or more domains in which words have relatively unambiguous meanings; any particulars can be clarified in Scope Notes or other definition-type fields. In a taxonomy describing concepts in civil engineering there’s little chance that a dental bridge or card game is the intended meaning of the term Bridge (although chances are high that multiple types of bridges will be represented) so disambiguation for this term is not required.

Multi-domain and very large taxonomies, however, will require disambiguation strategies and methodologies. Additionally, as taxonomies grow and expand to cover additional content or use cases, taxonomists must be on the lookout for disambiguation problems. An astronomy thesaurus may not require disambiguation for Mercury but expanding the domain to include physics or chemistry will necessitate some way to differentiate the planet from the silvery metallic element.

Keeping in mind three important taxonomic principles:

  1. Term labels must be unique;
  2. The label for a term should unambiguously describe the concept without reference to the hierarchy, and
  3. Insofar as possible, term labels should be expressed in natural language (that is: as it might be commonly encountered “in the wild” in speech or writing);

…we can identify two common strategies for disambiguating terms.

 

Disambiguation Strategy 1: Parenthetical Disambiguation

This technique, recommended explicitly in the ANSI/NISO Z39.19 standard and widely used by working taxonomists, involves clarifying the context of the ambiguous term in parentheses after the term name as part of the label. For example:

  • Mercury (planet) or Mercury (astronomy)
  • Mercury (chemistry) or Mercury (chemical element)
  • Mercury (automobile make)
  • Mercury (Roman deity)
  • Bridge (dentistry)
  • Bridge (civil engineering)
  • Bridge (card game)

…and so on.

This approach has advantages: it is concise, the format is easy to grasp and memorize, and it provides a simple, at-a-glance way for users to understand the concept being communicated. 

Parenthetical disambiguation, however, does not express concepts in natural language, which is contrary to a basic taxonomic principle (although, admittedly, an exception that is expressly included in the standards).

Moreover, it is detrimental to text processing, since it requires special treatment to look for term labels in text, for example, and other NLP-related operations found in automatic categorization logic, when they have parentheses appended. That is to say: nowhere in your document will a machine categorization engine find “Bridges (dentistry)” as a text string to be matched.

Parenthetical disambiguation also tends to increase the character count of disambiguated terms, which can be an issue for display in certain applications.

This strategy may have been logical in the era of print-focused taxonomies, but with the advent of taxonomy software it makes little sense. It is also clumsy and looks awful:

  • Astronomy
    • Planetary science
      • Planets
        • Earth
        • Jupiter
        • Mars
        • Mercury (planet)
        • Neptune
        • Saturn
        • Uranus
        • Venus

 

Disambiguation Strategy 2: Label wrangling

In order to avoid using parentheticals, it is possible to consider changing term labels to be more descriptive using something more like natural language to achieve the same ends.

Sometimes this works out nicely:

 

Dairy (farm)

Dairy (products)

 

…can become, with little fuss and good results:

 

Dairy farms

Dairy products

 

Sadly, this does not always work and can result in silly alphabetization and otherwize bizarre-looking hierarchies; although it seems reasonable that “Planet Mercury” is a fair enough term label, it might result in something like:

 

  • Planets
    • Earth
    • Jupiter
    • Mars
    • Neptune
    • Planet Mercury
    • Saturn
    • Uranus
    • Venus

This result, besides looking terrible, has also altered the order of the terms as Mercury is now sorted under “P”.

Other similar attempts to avoid parentheticals can result is straight-up silly term formation; while “Dental bridges” is fine, the following are undesirable:

 

Civil engineering bridges

Bridges in civil engineering

Bridge, the card game

Card game of bridge

 

None of these are very good; they either require a prepositional phrase or an unnatural-language expression of the term resulting in the same text-processing issues faced by parenthetical solutions.

In the end, the “rejigger the term label” approach is no better than parentheticals as a universally applicable technique for disambiguation. It seems even worse to mix and match these strategies (rejigger the term if you can, else use parenthetical disambiguation) as there is great potential for a confusing mess–and, after all, making confusing messes understandable is our business.

Sadly, I have no panacea to recommend. I tend to solve these problems on a case-by-case basis with the goals of clean term formation and natural language whenever possible, resorting to one or the other technique above when absolutely necessary.

———————————

*Be sure to check out the outstanding Twitter account @wikishoutouts which delightfully automates shout-out Tweets using information from Wikipedia disambiguation pages: