Language is inherently ambiguous. The dream to create universally agreed-upon, standardized conceptual language will be extremely difficult. We can achieve widely-accepted consensus, but that’s the silent agreement we already have in place with written and spoken language. We agree to accept certain rules and a set of vocabulary, and we also agree that these rules and vocabulary will change over time.
In the world of information semantics, and especially controlled vocabularies, we create even narrower and more stringent rules around what vocabulary we will accept and what rules govern how those concepts are defined and related. The more specific and unambiguous the domain, the easier it is to do this. The broader and more subjective the domain, the more difficult.
The development of standards, such as the Resource Description Framework (RDF) and SKOS, allow us to construct standardized, portable vocabularies and ontologies defining the structural rules for these vocabularies. “Using SKOS, concepts can be identified using URIs, labeled with lexical strings in one or more natural languages, assigned notations (lexical codes), documented with various types of note, linked to other concepts and organized into informal hierarchies and association networks, aggregated into concept schemes, grouped into labeled and/or ordered collections, and mapped to concepts in other schemes” (W3C).
In recent client interactions, we’ve had many discussions about concept identifiers and their applicable use cases. Let’s dig into this.
For most of us, the two most important lexical labels, “a string of UNICODE characters…in a given natural language”, used in SKOS-based controlled vocabularies are the preferred label (skos:prefLabel) and alternative label (skos:altLabel). The reason these two are the most important is because they are both human and machine readable. In the world of controlled vocabularies, this is how most users understand and interact with concepts.
In term-based vocabularies, these may be called descriptors or preferred terms for preferred label and entry term, non-preferred term, or synonym for alternative label. Whatever term is used, the result is a human-readable string representing the preferred form of the concept and associated alternative labels used for synonyms. For much of our work in the field and for those consuming the controlled vocabularies on the front end, these are the lexical identifiers that matter most.
Less apparent to end users, but also having an important place in the SKOS-defined lexical labels, is the hidden label (skos:hiddenLabel). Hidden labels can be used in search to account for things like frequently misspelled words to find a relevant concept. A hidden label is intended for use as a background search redirection and not shown to the end-user. It is still human readable, even if it is a misspelling. Hidden labels are very much like the “did you mean…” in search but already associated to a preferred label in a vocabulary management system.
Despite their obvious use for human consumption, lexical labels don’t cover all the use cases for identifying concepts. For instance, preferred labels must be unique per language within a controlled vocabulary. The fact that the prefLabel must be unique may constrain some use cases relying on lexical labels alone.
To unambiguously identify a concept, RDF specifies unique Uniform Resource Identifiers (URIs) used to identify anything described in RDF. Because URIs are a unique combination of a namespace and the concept name or randomly generated character sequence, they are unique to a concept.
URIs are persistent, which makes them ideal for identifying a concept regardless of any label changes in the preferred label or the alternative label. While a concept label may change, the URI never will and can thus always represent the concept. Additionally, some vocabularies require the same label in different contexts. Rather than using polyhierarchy, in which the single concept is represented in many hierarchical locations, concepts with the same label may be contextually unique. In many vocabularies, these differences are identified by parenthetical qualifiers, such as mercury (metal) and Mercury (planet). However, it may be that these concepts are used in the same environment but truly have different meanings or are even sourced from different vocabularies. In these cases, a URI clearly indicates which is which.
URIs can be human readable for properties, such as http://www.w3.org/2004/02/skos/core#prefLabel, which clearly indicates this concept is from W3C namespace, generated in February of 2004, and is part of the SKOS standard defining the core property of prefLabel. Concept URIs are generally not human readable and are really meant to be machine readable and identifiable. The namespace may be clearly recognizable in this property URI, but the end of a concept URI may be a GUID generated by a system, such as a 36-character randomly generated string. While not human readable, it clearly identifies a unique concept for machine to machine APIs.
Labels and URIs seemingly are enough to identify a concept, but there are also other codes, numerical identifiers, and classification codes (such as those used in library catalogs) which may be used to identify a concept. These may be legacy values assigned to concepts from previous systems or classification methods.
Notations differ from lexical labels in that they can be alphanumeric, integers, floats (supporting decimal places), or dates. These alphanumeric and numeric datatypes may be consumed by other systems and can be used in conjunction with the more human readable lexical labels. Labels and notations are not mutually exclusive. Examples of notations could include classification codes indicating hierarchical levels and sequence order, concept IDs used by other systems, or part numbers generated in a product information management (PIM) system (versus the name of the part).
Notations may also be uniquely generated by the system to use in downstream consuming systems. For instance, each new concept is assigned a unique integer counting from a specified value so all concepts have a unique number separate from their lexical labels and URIs.
A Concept By Other Names
Using lexical labels, URIs, and notations may be used in conjunction to identify concepts in scenarios in which human end users and machines are consuming vocabularies for several purposes.
On the front end, internally or externally to an organization, users may be using a search field in which they type in free text keywords or are prompted to select existing concepts using a typeahead. Let’s imagine that the user types “mobile phones” into the search box in hopes to find smart phone products. The browser is regionally located, so the search prompts a “did you mean ‘cell phones’?” message and allows the user to refine the search using “cell phones” as the new keyword. In this case, “cell phones” is the regionally preferred label and “mobile phones” is the regional alternative label. Likewise, if the user had typed “cel phone”, a hidden label would redirect the search to the preferred label to correct for the frequent misspelling.
Usually located in the left-hand pane of the search, faceted preferred labels from the controlled vocabulary can help the user to narrow the search parameters. Used alone or in conjunction with other preferred labels applied as metadata to content, users can narrow their search results and refine the results.
On the back end, concepts surfacing in search are being retrieved via an API calling on URIs as the basis for identifying the correct concept labels to display in the search interface. Changes made to the vocabulary preferred labels or alternative labels are updated in the search index based on these same immutable URIs rather than retaining the conflicting old and new labels.
Finally, specific cell phone models, part numbers, or other classification codes used by systems providing concept properties are retrieved by cross-referencing the notation properties associated with concepts for display on a product page. In another scenario, a user may be using a notation code as a lookup for the preferred name of the product to order a part or to find more information about the product.
We tend to think of controlled vocabularies as words, but identifying a concept by other names is part of many complex organizational information architectures. The ability to identify concepts by other names (or numbers or URIs) allows a taxonomy to power multiple applications and be more flexible and adaptive to business and system needs.