For over fifteen years, Keith Bloomfield-DeWeese has been engaged in all aspects of metadata development, application, and management across the publishing, finance, and news media industries. Starting as an index and thesaurus editor working on digital consumer products at Encyclopaedia Britannica, his experience there lead him to manage controlled vocabularies and machine-assisted indexing initiatives and programs for the Federal Reserve, Dow Jones, the Tribune Company, and Gannett-USA Today.
For this Insight interview, Keith shared his experiences plus his continued interest in controlled vocabularies, ontology development, and content categorization as part of machine-learning.
Tell us about you and your early experiences.
Keith: In the early part of my career, I worked in academic libraries and taught. After receiving my Master’s of Science in Information Science and Library Studies, over the years I augmented my studies and work experience informally studying natural language processing, ontology, and machine learning and applying what I learned to my work as I advanced in my career. My foundation and grounding, though, began by being an index and thesaurus editor for Encyclopaedia Britannica.
Britannica was the first encyclopedia to create an index of its content, which means that a reader could find the main article on Joan of Arc under “J,” using the alphabetical arrangement of entries; but, also, within the index, locate all other entries in the encyclopedia that included significant information about her, say, in the article about Charles VII of France, in the article about the notorious Gilles de Rais, and so on.
During this period, in the early 2000s, I was fortunate to have been mentored by a brilliant linguist, Carmen Hetrea, at Britannica. She shared with me tactics that I still use today. It was a grounding different, perhaps, to someone who came up solely through the library path. The experience I acquired working in libraries I consider inestimable; but the basics of linguistics that I learned at this time really helped me find the niche in which I’ve enjoyed working for so many years now.
The people I worked with at this time were incredibly forward-thinking. I can still distinctly recall discussions with them about how encyclopedia knowledge, combined with the index and thesaurus, could feed into an artificial intelligence system. For me, it was one “light bulb moment” after another and which I would keep going back to as my career progressed.
What became increasingly clear to me was the need to relate terms, concepts, to each other, in some way, and semantically. Implicitly, human beings understand the relationships that exist between concepts, but it’s not so simple, not implicit, for machines. For automation, relationships between concepts need to be made explicit, they need to be specified. No concept exists on an island, alone. All concepts exist within some kind of network.
”“For automation, relationships between concepts need to be made explicit, they need to be specified. No concept exists on an island, alone. All concepts exist within some kind of network.”
When did you start working with news content?
Keith: My first exposure to the news world was when I joined Dow Jones as a taxonomy consultant. It was during this period that I met and worked with Dave Clarke and his brilliant team. Even though I wasn’t working directly with Dow Jones news content at that time, I was in the news sector and moving closer to working directly with news content.
At the time, the news media world was unlike any I had encountered until then with quite different approaches, processes, ways of organizing and accessing content, than I had been used to. I found the cultural differences between academia and Britannica, Dow Jones and the Tribune Company very different, much faster paced in terms of content creation and surfacing workflows. Whereas Britannica was deliberate, painstaking in its editorial and publishing processes, reducing the tempo of news publishing is just not an option. In the news world, no matter what you do, moving fast is everything. It was exciting! During Obama’s first inauguration, we couldn’t go into the Tribune office due to a Chicago snowstorm; but I was able to watch the ceremony on TV, update the taxonomy remotely, and deploy it to production all while the inauguration was occurring, in real-time.
For more specific examples of what it’s like to work at applying descriptive metadata to news content, categorizing it, Iet me refer to Gannett, which is a US media holding company that owns the national newspaper USA Today and over 100 local newspapers. In 2019, Gannett merged with GateHouse Media making Gannett, now, the largest newspaper publisher in the States.
As you can expect, in recent months, COVID-19 has become one of Gannett’s focal points, which, with regard to Gannett’s ontology, means defining and contextualizing the concept, “COVID-19,” using numerous attribute values and relationships between the concept and a wide-range of others, spanning various classes with instances covering the economy, health resources, politics, and so on. In Gannett’s case, part of this process also entails the step of writing programs needed to auto-categorize content covering the virus. In other words, it’s not just a matter of creating a hashtag label of #covid19, and leaving it at that.
With news, declaring new instances of classes and creating the programs that auto-categorize is a constant and always interesting. For example, the evolution of a concept’s preferred term and its synonyms, say, for something like COVID-19 can take a while. When the virus first appeared on my radar in January news feeds, it was commonly referred to as “Wuhan virus.” Before Gannett, and other news organizations, settled on what to call it, what was in the best interest of journalistic excellency exercising sensitivity with regard to the cultural facets of the virus, its preferred term label changed a handful of times.
What kinds of challenges did the merger create?
Keith: Well, at all levels of the enterprise, it challenged notions of what is valuable content, which, by extension, challenges priorities when augmenting controlled vocabularies and the relations that exist between existing and new instances of classes. It meant, too, that now, instead of only USA Today and the local news sites publishing before the merger, there were, post-merger, many more local news sites to consider. For controlled vocabulary work, it meant many, many more named entities needing to be manually defined in the ontology and related to each other.
In news organizations, at least the ones I’ve worked at, relating concepts is typically a manual exercise, which is often surprising to some because there do seem to be pervasive notions that machines just “know” which concepts relate to each other without any kind of human prompting or training. That’s a nice idea and, certainly, there are brilliant people programming machines to do just that; but, since so many news organizations are “in the red,” they just can’t afford to experiment and invest in those systems.
Tell us about the idea of a tag being like the ‘tip of an iceberg’.
Keith: I’ve used the metaphor of an iceberg when, explaining taxonomy and ontology, to help stakeholders understand that a tag is not just a hashtag or a string of letters, it’s a label that represents a concept. I’ve frequently encountered enterprise stakeholder assumptions that, what the end-user sees, the preferred term is all that exists, all that the knowledge worker creates for a vocabulary. That’s completely understandable particularly when, more often than not, news media project owners do not have any hands-on experience in actually creating vocabularies. I use the iceberg metaphor to help them understand that the preferred term is just “the tip of the iceberg” so they get a better sense of what is beneath the tip, the attribute values a term record stores other than just a preferred label. Having even a bit of that understanding can mean the difference between successfully navigating whatever the project is that will use the vocabulary rather than being surprised and not navigating the project so well, hitting the iceberg, and sinking a project.
Can you describe what a news organization’s ontology is like? What are the components?
Keith: The Gannett ontology, in its current state, has nine classes used to organize 70,000+ concept instances. An example of a class is Subjects, which has instances that represent general concepts (i.e., not named entities). These general concepts are frequently the same as the names of news site sections, such as “Sports” and “Entertainment.” Then there are other classes, such as Companies and Persons, to name two, which contain instances of named entities, for example, “IBM” and “Claude Monet,” respectively. Lastly, there are classes, such as Systems, the instances of which are used to describe how and where content may be used, should be used, in the publishing system. An example of this class would be “sponsored Lockoff,” which is unique to Gannett and doesn’t really make much sense outside of the enterprise.
Across the classes, the instances are related to each other using recommendations put forward by standards organizations, one being the W3C. Additionally, class instances are, where appropriate, mapped to IPTC ”Media Topics” and, for advertising purposes, to the Interactive Advertising Bureau’s taxonomy. A subset of instance records contain Wikidata links, too.
What do you feel is the biggest challenge for the news sector?
Keith: Well, these days, it’s decreasing revenue, particularly from advertising. The ripple effect of that is that dollars are allocated for mission critical initiatives, for example, merging near-incompatible publishing platforms; and dollars are not readily available for initiatives considered important but non-mission critical, just not important enough. Taxonomy, ontology, machine learning, etc., tend to fall into the latter category.
I think, frequently, some news organizations then get in “Catch-22” kinds of situations in which they could benefit enormously from work invested in developing and maintaining controlled vocabularies; but, in a kind of desperation to generate revenue, taxonomy, in order to get buy in, is cast as a panacea that can make up for issues created by broken news management systems and their myriad workflows. To paraphrase a former news colleague, taxonomy just isn’t auto-magical enough to meet mismanaged expectations.
What words of advice do you have for those working with taxonomy and news content?
Keith: Three pieces of advice which, though pretty obvious, I think are always worth repeating:
- Build on the work of others. By that, I mean consider starting your taxonomy initiative, laying its foundation, using resources such as the IPTC ”Media Topics” and the Interactive Advertising Bureau’s (IAB) taxonomy. These resources are good starting points and designed to be used for news content.
- Avoid thinking that IPTC, IAB, or any other resource is a “set it and forget it” kind of tool. It’s only a foundation on which to continually build, so know that, accept it, and plan, accordingly, to engage in the continuous expansion of your taxonomy. Neither IPTC nor IAB have “Coronaviruses” in their taxonomies but there could be many use cases requiring that it be in a news organization’s taxonomy.
- Manage expectations up and across the stakeholder chain. Not an easy task considering the number of stakeholders that one encounters in news organizations and their experience with or understanding of taxonomy; but, honesty, transparency regarding what to expect, I have found, is always best.
”“Manage expectations up and across the stakeholder chain. Not an easy task considering the number of stakeholders that one encounters in news organizations and their experience with or understanding of taxonomy; but, honesty, transparency regarding what to expect, I have found, is always best.”
More “in the weeds,” and when you’re actually analyzing news content to find concepts to add to your vocabulary, you might consider what I call ‘the rule of three.’ which I picked up when I worked at Encyclopaedia Britannica. I asked my mentor, “How do you know when to add a new term to the index?” Her advice: when analyzing an encyclopedia article to find potential index terms, and you think you may have found one, test it by asking yourself, ‘Will readers learn at least three things about the concept when they follow the new index term to article? If the answer is, ‘Yes,’ then consider including it in the index. It’s a simple rule of thumb, but one that I still use. Of course, it’s not the only heuristic that one might use because indexing and taxonomy work are too complex to be reduced to such simplicity. But, as I’ve mentored taxonomists over the years, I’ve shared it and they’ve told me they’ve found it helpful.
I would also recommend referencing the NISO Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. Memorize it if you can.
Lastly, keep an open mind. As much as news revenue might be generated by advertising on a sports page or in the entertainment section, something like COVID-19 comes along and impacts everything, across all sections of a new site. Expect and plan for it. Though newspapers were once printed in black and white, news sites do not exist in a black and white world, and will not prosper black and white thinking.