Taxonomy Theory and Practice

1. Theory

In the land of information science (related to but distinct from generalized information theory, which is much more mathematical and concerned with, for example, signals), taxonomy theory is more or less dictated by the commonly accepted standards including but not limited to:

These standards describe formalized guidelines for, say, determining whether a given proposed Broader-Narrower (BT-NT) hierarchical relationship is valid. In theory, Term A is a valid BT of Term B (and therefore Term B a valid NT of Term A) if and only if:

(1) B is a part of A (as Albany is part of New York, or an Engine is part of a Car)

(2) B is an instance of A (as the Sears Tower is an instance of a Skyscraper, or the Red Sox are an instance of a Baseball Team)

(3) B is a subgenre of A (as Algebra is a subgenre of Math, or a Tree is a subgenre of Plants)

In hierarchical fashion, these examples can be represented thus (with a little term formatting and additional hierarchical context for clarity):

United States

-New York

–Albany

Vehicles

-Automobiles

–Automobile engines

Buildings

-Skyscrapers

–Sears Tower

Sports teams

-Baseball teams

–Major League Baseball teams

—Boston Red Sox

Mathematics

-Algebra

-Geometry

-Number theory

Eukaryotes

-Plants

–Trees

All of these guidelines, to generalize, follow the is a or All-Some rule: B is a valid NT of A if every B is a(n) A. 

Of these three guidelines it’s the last one that gets tricky, as it seemingly provides license to group things together on a looser basis than those prescribed by (1) and (2). Specifically, it admits grouping things together by topic, which can admit reasonable but questionable structures. Consider the following hierarchical fragment:

Economics

-Macroeconomics

-Microeconomics

–Consumer demand

–Opportunity cost

While Microeconomics and Macroeconomics are clearly subfields (in our parlance, subgenres) of Economics (as all Macroeconomics is Economics, and Micro- and Macroeconomics are established subfields in academic departments, etc.), it’s less clear that “Consumer demand is a Microeconomics”.

At first this seems reasonable: the concept of Consumer demand arose in Microeconomics in which it is an important and widely studied notion. Besides: where else would you put it in your hierarchy?

2. Practice

As in many fields, theory and practice do not always align when theory collides with reality (in this case, practical applications). External factors like user testing, subject matter expert input, and business requirements can disrupt a well-intentioned standards-compliant hierarchical structure.

To use one of my favorite examples: If I’m designing a website for a pet store, I don’t care that Dog food isn’t a Dog if people are going to find what they’re looking for.

In taxonomies designed primarily for information tagging and retrieval, the Microeconomics/Consumer demand issue is not really a problem. In fact, we might justify this choice using something like the following reasoning:

Term B is a valid NT of Term A if all content about Term B is about the broader topic of Term A

That is to say: if I do a search for documents related to Microeconomics, would I expect content about Consumer demand to be included in the result set? Put another way: are all papers about Consumer demand about Microeconomics?

By this slightly altered criterion we can assign relevant topics studied in a field as NTs of the field (and, therefore, interdisciplinary fields may belong in several branches of a taxonomy).

3. Practice as Theory

Let us consider the microeconomics and dog food examples for a moment. In both cases, the logic of the hierarchy is dictated by objects external to the taxonomy: content or products described by the vocabulary, which is used to classify or categorize the products/content for retrieval.

This formulation, however, leaves out an important aspect: the retrieval of products/content is done by a person. Whatever abstract rules (the discourse of information science) govern the hierarchical structure of idealized taxonomies must make room for the discourse of the user.

Taxonomies for information retrieval, as well as taxonomies for website navigation and product classification (which are related but not always the same), must take into account where people will look for things (if, indeed, the goal is for products/content to be found) when considering how to arrange a useful, implementable hierarchical structure.

The downside of loosening the BT-NT restrictions is the conflation of topics and concepts (discussed in a previous post) leading to irrational or cumbersome hierarchies.

Consider a small taxonomy about Sports. By the standards, we might expect to find a list of sports, or categories of sports, as NTs of Sports.

Sports

-Individual sports

-Olympic sports

-Team sports

-Winter sports

Or, alternatively, but just as reasonably:

Sports

-Badminton

-Baseball

-Basketball

-Gymnastics

-Lacrosse

-Tennis

However, according to our new logic outlined above, we can also include Topics about sports:

Sports

-Athletes

-Sporting equipment

-Sporting events

-Sports leagues

-Sports medicine

This is all well and good until we combine the NTs, all of which we have decided are valid:

Sports

-Athletes

-Badminton

-Baseball

-Basketball

-Gymnastics

-Individual sports

-Lacrosse

-Olympic sports

-Sporting equipment

-Sporting events

-Sports leagues

-Sports medicine

-Team sports

-Tennis

This is a bit of a mess, and no longer seems intuitive. We have mixed the concept of Sports (containing different sports, or types of sports) with the topic of Sports (which includes stuff like equipment and adjacent fields like sports medicine).

There are various ways to clean up this hierarchy, but the principles here, for product and retrieval taxonomies, are:

You have to take into account where people will look for things, but
This can cause a mess

Slightly paraphrased, we might say that the two laws governing the construction of taxonomies of these types, which are often in conflict with each other, are:

What is it? (The discourse of information science)
Where will people look for it? (The discourse of the user)

”Taxonomies for information retrieval, as well as taxonomies for website navigation and product classification must take into account where people will look for things when considering how to arrange a useful, implementable hierarchical structure.

4. Coda

The implications of this for the construction of most product and information retrieval taxonomies are minimal; the taxonomist will find ways (working with the search, design, content/product, and other teams) to present a useful structure within the competing constraints discussed above. As long as the vocabularies are being used for organizing and presenting content/products, the compromise (compromised?) vocabulary structure is of little consequence.

This all changes once we are using taxonomies for inference. Feeding data tagged from a taxonomy as input for AI or ML presupposes a certain (that is: strictly standards-compliant) rigidity of the structure. Algorithms rely on rigorously structured input to produce good output, and less-strictly-structured input will provide poor output if we, as part of the input file, tell the system that dog food is a dog. That is to say: the algorithm doesn’t care where people will look for things; it does not participate in the discourse of the user, so vocabularies used for this purpose need to be constructed with this in mind.

So, different taxonomies need to maintain different rules and guidelines for their construction. Idealized taxonomy guidelines need to participate in the discourse of the user as well as following the standards whenever possible; other vocabularies used for projects on the computer science side of the ledger require the opposite approach.

You can follow Bob Kasenchak on Twitter.

@taxobob

Taxonomy Theory and Practice

1. Theory

2. Practice

3. Practice as Theory

4. Coda

Related Posts

Ontologies and Ethical AI

Author Bob Kasenchak

Contact Us

Stay Connected