Skip to main content

On Granularity

By September 2, 2021Categorization, Taxonomy

In taxonomy, granularity refers to the number of terms describing a concept and its sub-concepts. This rubric can apply to an entire vocabulary or some subset. In short, the question to be answered is:

How many terms do I need to usefully cover [some subject area] to aid in content discovery?

Imagine that you have some large corpus of documents (say, several hundred thousand) about science–a bunch of journal articles or web pages or any other collection of content–and you need to build a taxonomy to tag them for retrieval. How big should your taxonomy be?

Clearly, having one term “Science” to describe every document is not useful, as all of the content is about science; this does not aid the user in any way. On the other hand, science is an extremely broad topic; a taxonomy containing every possible concept/subject in the field would easily run into the tens or hundreds of thousands of terms. And, in any event, it’s unlikely that your corpus includes every possible topic in science; we can therefore omit any topics that are not covered in the corpus (users tend to dislike navigating to a topic of interest only to find no associated content).

The answer is therefore somewhere between “one” and “all of them”, which is not very useful; let us consider a smaller case.

Consider the Onion

Imagine, instead, that we are building a taxonomy to index content about food, perhaps for a cooking website or similar information environment. I want to include terms describing “onions” – but how many kinds of onions do I need to include? That is to say: how granular must my coverage of onions be?

To circumscribe the scope of the problem, we can ask a few questions:

  • How many kinds of onions are there?
  • How many assets in my corpus are about onions?
  • How many kinds of onions are represented in my content?
  • How many kinds of onions represented in my content warrant inclusion in my vocabulary?

Although there seems to be no strict consensus, cursory research suggests that there are twenty-something onion varieties (not including other, related allium vegetables like shallots). This is not a huge number, and they could easily all be included in a taxonomy.

selection of onions

Corpus Analysis

If the total number of documents describing onions is small–say, enough to fit on a couple of pages of search results; perhaps 10-20, depending on the size of your corpus–it is probably not necessary to include any additional terms besides “onions” to index this content for retrieval.

If there is a great deal of onion content, it makes sense to see what kinds of onions are mentioned–and in how many documents. A simple frequency count of the number of documents mentioning each kind of onion will give you a rough idea of the scope of onions represented in your corpus.

To accomplish this, it is necessary to have some familiarity with, or access to someone with familiarity with, some basic text mining tools. Ideally, I think, taxonomists should have some familiarity with Regular Expressions which are compatible with many search tools and fairly simple to learn without requiring any facility with coding. Alternatively, access to a programmer familiar with NLP-friendly languages like Python, R, or AWK has also been very helpful to me in the past.

One approach to make this information useful is to set some threshold above which a specific type of onion will be included. Again, this needs to scale to your corpus: if you have 1000 articles, 10 articles on red onions might be enough to warrant inclusion; if you have 100,000 articles, 10 might not make the cut.

External Factors

Now, other factors may be in play. Perhaps you know that there is a special feature coming out about Spanish onions, or that Vidalia onions are popular right now (featured in another influential publication, perhaps), or from reviewing search logs that people often search for sweet onions. External factors beyond frequency can and should be taken into account to make the taxonomy as useful as possible. (This is why content review is an important piece of any taxonomy governance plan.)

In the abstract it would be simple to include all common onion varieties, but since our imaginary taxonomy is going to be used to index a specific corpus of content we should let the content (and some external factors) be our guide when deciding how granular our coverage of onions should be.

Does this solution scale to our first imaginary case about science? Given an unlimited amount of time and resources: sure! But practically: no, because instead of 20 kinds of onions, there are thousands upon thousands of possible concepts to consider.

One possible remediation is subject matter expert (SME) review: the practice of bringing in domain and content (this is critical!) experts to review your emerging taxonomy and recommend the inclusion (or removal) of terms they know will be useful to index the corpus.

Starting with Existing Vocabularies

Another approach that may be available, depending on the domain (and access to tooling), involves beginning with one or more existing vocabularies (many are published as open data) as a starting point. This is good practice, as it can shorten the vocabulary development process and provide a stepping stone towards using linked data to enrich your taxonomy.

To tailor such a vocabulary to your specific needs, the next steps are to cull out the concepts (perhaps entire branches) that are not required and subsequently to add missing concepts needed to describe your assets.

Auto-classification tools can be used to facilitate both of these processes. If you run your entire corpus through a document classifier (even if not trained or with only simple string-matching classification rules), any term in the vocabulary that does not generate a hit (or, more likely, some threshold of hits) can be discarded. Conversely, any document that does not receive any tags should be examined for concepts to add to your taxonomy.

While there is no one-size-fits-all solution to the problem of granularity, there are guidelines and techniques to help tackle the issue. For this and other reasons, it’s increasingly clear to me that some familiarity with NLP tools and techniques should be part of a well-rounded taxonomist’s skill set.