The Source of the Source

The Information Ecosystem

I like to think of information systems as ecosystems in which technology, people, and processes interact in a (hopefully) symbiotic relationship. In such an ecosystem, information is managed according to governance processes throughout its lifecycle from creation until archive or deletion. Content is tagged with meaningful and accurate metadata upon creation or ingestion; able to be found through navigation, search, and push and pull notifications; versioned and managed through its active life; and then archived or deleted according to records management policies or industry best practices.

Like a natural ecosystem, things fall apart when an information ecosystem is thrown out of balance. Perhaps an apex predator (a competing or rogue system?) grows its numbers and eats itself or its competition into starvation. Perhaps an invasive species (new or untrained employees?) is introduced. Perhaps there is an environmental change (workaround processes?) thwarting the current functioning processes.

In the information ecosystem, taxonomies are a source of truth. If taxonomies are a lake from which we draw our truth, then what happens when the source of the lake is contaminated? When the river runs foul, the content of the lake is suspect. When the lake feeds downstream systems, they are also polluted. The entire ecosystem can be corrupted with the introduction of unwanted elements.

The Semantic Ecosystem

The deeper level of the information ecosystem is the semantic ecosystem. At the semantic level, the concepts and relationships used in information provide context and meaning. Taxonomies are more than just labels used for tagging content with metadata: the labels have meanings, both overt and enigmatic.

How can it be possible that a taxonomy, which seemingly has no agenda, can introduce bias into the information and semantic ecosystem? I’ve previously written about this kind of bias in machine learning and taxonomies. The main issue is not usually the technology, but the people and processes involved in creating, maintaining, and implementing taxonomies.

In some cases, there is concept shift over time which causes meaning to slip. For example, concepts describing race have shifted greatly over the past 50 years. Terms Martin Luther King, Jr. used in his speeches would not necessarily be used in conversation today. If these were indexing terms at one time, they should be updated to the contemporary equivalent while maintaining a link and record to the original indexing language.

Another bias can be ingrained in the corporate culture itself. Internal jargon or a myopic view of the organization’s industry can lead to concepts which have a completely different, or no, meaning to anyone else. These meanings may conflict with industry standards. While this may seem innocuous, differences in language in health and safety may lead to critical misunderstandings.

Shifts in meaning or the use of narrowly defined jargon create semantic ambiguity which then impacts content across the ecosystem.

Definitive Sources

If the taxonomy is our source of our truth, then what is the source of the source? In other words, how did we determine taxonomy labels and what sources did we use?

When I first started in taxonomy work as a thesaurus editor, the organization I worked for defined definitive sources for various types of information. For example, for a common or everyday concept we used Merriam-Webster’s dictionary. For a geographic location we used GeoNames. We used a combination of print (yes, print) and online sources we determined to be verified sources of truth.

How did we determine what would define our semantic ecosystem? We used sources which were:

  • Authoritative: the resource is a well-established and recognized expert individual or organization in the field;
  • Accurate: the resource must be truthful, updated, and verifiable; and
  • Objective: even when a resource is accurate, there may be a hidden (or obvious) agenda which may drive the inclusion of only a carefully curated set of truths to skew the final result.

For any given industry, it should be fairly easy to pick out individual experts and recognized organizations with a reputation which don’t necessitate a deep investigation of their truthfulness and accuracy.

Form & Definition

When it comes to taxonomies, the first line of semantics is the concept form and its definition.

Even when a concept seems straightforward, it’s best to check the form with a definitive source to ensure it has the proper spelling and appropriate diacritic marks if used. The label form may depend on whether it is singular or plural or if it should or should not include hyphens. The form may also be language-specific and may require the use of a native language term mixed with the primary language of the vocabulary. Most taxonomy management systems should be able to handle the form, so it’s worth the effort to get it right.

In addition to the form of the term, you might find variant spellings and/or alternative label names for the same concept. This is particularly true of acronyms. A determination around which form to use, as long as it has been verified, may be up to the organization. That said, simple checks like seeing how many search results come back with each form or checking SEO tools for concept use online may offer guides as to which form to use.

The label name is the first step. The next step is to verify the definition and whether the concept has a different meaning than the one intended. For any given concept, there may not be a definition conflict within a single taxonomy, but once that taxonomy grows over time and is connected to internal and external systems, the ecosystem expands and the number of potential conflicts grow. It’s best to lay the groundwork at the source and get the concept label and definition correct from the start.

Definitions typically are included in a text field attribute associated with the concept in question. This field may be a Scope Note and/or Definition. The simplest way to supply a definition for a concept may be to simply cut and paste the definition from the definitive source, citing the source name, URL link, and the date it was applied. These pieces of information may live in one field or in several. The term-by-term method may work well when you are developing a taxonomy or adding to an existing taxonomy, but for an existing structure of thousands of terms, a bulk import of definitions may be faster and more efficient.

Another option is to use Linked Data sources such as DBpedia which may offer abstracts for the concepts. If you trust Wikimedia projects as a definitive source for your terminology, then the addition of links will speed the time to add definitions (or other attributes of interest) to large numbers of terms and keep them updated without taxonomist intervention. DPpedia is not the only source of Linked Data, and there may be more specific sources of information depending on the field of study.

Relationships

A final consideration in the way a taxonomy as a source of truth impacts downstream consumers is the way terms are interrelated. Just as the concept label and definition imply a source of truth, so to do the relationships. Once standalone concepts are labeled and defined based on a definitive source of truth, where the concept lives in the taxonomy and what terms it is connected to also implies a truth which may be explicit or implicit.

For instance, spelling a concept like tomato correctly and having the right definition may make the concept authoritative, but if it lives under vegetable instead of under fruit or berry, it may imply a different viewpoint between scientific accuracy versus popular use. A scientist may wish to see the formal definition, whereas someone using an online recipe resource may not think to look for tomatoes as an ingredient under berries or nightshade.

Where you basket your tomatoes is a fairly non-contentious example. Depending on the subject matter of your taxonomy, the relationships between concepts may be political or even life-threatening in the cases of medicine and pharmaceuticals or manufacturing and safety.

Your source of truth needs its own source or sources of truth. It may seem specific to the realm of academic vocabularies, but it’s worth taking the time and making the effort to cite the source of your taxonomy concepts so your information ecosystem has a clean spring from which to drink.

%d bloggers like this: