The Associated Press is an independent, not-for-profit news cooperative headquartered in New York City. AP works with organizations of all sizes across a broad spectrum of industries and is considered the definitive source for news. We interviewed Veronika Zielinska, Deputy Director, Data Sciences about her experiences with The Associated Press and developing the Tagging and Taxonomy Service. AP made its extensive set of news-related taxonomies for topics, people, places, organizations and companies available to Synaptica KMS and Graphite.
Tell us about the history of The Associated Press and its services today?
VZ: The Associated Press is one of the oldest news organizations, founded as an independent, not-for-profit news cooperative. Its members are U.S. newspapers and broadcasters. In 1846 five New York City newspapers funded a pony express route through Alabama to bring news of the Mexican War north faster than the U.S. Post Office could deliver it. Today AP remains at the forefront of news delivery and technology, with two-thirds of our employees being journalists. News is delivered digitally through words and images: Associated Press publishes 2,000 stories a day, 70,000 news and sports videos a year and 1 million photos a year. More than half the world’s population sees AP content every day. The Associated Press has earned 53 Pulitzer Prizes, including 31 for photography.
Can you tell us about early experiences and your current role?
VZ: My background is in natural language and computational linguistics. I was interested in how computers could understand human language in a useful way: the idea of working with taxonomies, automated tagging and creating rules or statistical engines that apply taxonomies to content was one application of that. I worked with a small news aggregator first, and then joined AP as part of the Information Management team, building a taxonomy framework from the ground up. The taxonomy effort started before I joined, in 2005 – it was a long journey, and we looked at the entire body of content The Associated Press produces in order to define the scope for the taxonomy. Associated Press content has historically been authored by various production systems depending on media type, which can result in metadata that is unharmonized and content that is difficult to search and discover. We needed to create a way to unify these disparate sets of content, and the taxonomy and corresponding tagging enrichment pipeline was one way to bring everything together.
My current responsibility on the Metadata Technology team is, among other things, to continue to develop and grow the commercially available Metadata Services platform. Metadata Services gives external customers API access to The Associated Press’ standardized taxonomy and automated tagging system – the very same one we use ourselves. My role also encompasses search quality and relevance tuning for some of AP’s content distribution platforms; schema design, data transformations and building out AP’s analytics and business intelligence program.
My team is also responsible for working with product owners to develop new methods of content enrichment. We are using machine learning and natural language processing techniques and technologies to further enhance content enrichment and metadata workflows. For example, we are working on an automated photo tagging project where we built an editorial photo taxonomy based on image product requirements, and trained models to tag photos using computer vision. We also support the Associated Press’s searchable digital archive, going back to 1985, for various internal metadata analysis and insights needs.
I am also actively involved in the IPTC (International Press Telecommunication Council) NewsCodes development group. IPTC maintains controlled vocabularies and standards for use by media across the globe – MediaTopics is the IPTC’s subject taxonomy, which is separate from Associated Press’s but has the same mission – to describe the news in a standard way.
How did you initially develop the taxonomy?
VZ: We looked at all the content that Associated Press produced and scoped our taxonomy to cover all possible topics, events, places, organizations, people, and companies that our news production covered. News can be about anything – it’s broad, but we also took into account there are certain areas where Associated Press produces more content than others. We have verticals that have huge news coverage – this can be government, politics, sports, entertainment and emerging areas like health, environment, nature, and education. Looking at the content and knowing what the news is about helps us to develop the taxonomy framework. We took this content base and divided the entire news domain into smaller domains. Each person on the team was responsible for their three or four taxonomy domains. They became subject and theme matter experts.
”“The user gets a wider range of choices for the search term they are using because of the effectiveness of the taxonomy in the background.”
Can you tell us how AP values its taxonomies?
VZ: Seeing the value is not always easy, it’s not something that on the surface is obvious. The value comes from seeing the content package, which can include photos, articles, videos around certain themes, unified by descriptive metadata and therefore aggregable and findable. Taxonomies are used for fine-tuning our search engine and APNews.com, our world-facing news hub which exposes visitors to our content and topic areas.
We have several internal distribution platforms for our content and our taxonomy metadata is used among these in various ways. For example, our subjects, events or locations may be part of the search algorithm to improve recall in your result set – content that may not be findable using simple keywords can still be found because it was tagged with a more abstract but comprehensive taxonomy term. If someone is searching for news on education policy, they will also see relevant content where those keywords aren’t necessarily explicitly mentioned, but which has the tag. The user gets a wider range of choices for the search term they are using because of the effectiveness of the taxonomy in the background.
We have a deep hierarchy in our taxonomy, which is automatically part of the metadata of each piece of content. If your search targets a granular term in the taxonomy, for example “Local elections,” the results will also contain the news vertical, i.e. Politics. Our taxonomy is also used to aggregate content on various topics, create landing pages (for example for major events like Olympics or Elections), which improves browsing and filtering.
Another major benefit of our taxonomy metadata is that we create Breaking News event taxonomy terms, which automatically tag content as updates on the event are published. This allows for an automatically aggregated multi-format set of content about a breaking news event, dynamically updated throughout the event’s life cycle. This is a valuable feature to have when you’re talking about a quickly developing story, and as an editor you don’t want to miss any details.
What value do tagging and taxonomy bring to your external metadata customers?
VZ: In many ways, an AP Metadata Services customer will see the same type of value from our standard AP News Taxonomy and Tagging system as we do ourselves at AP. Because the AP’s news coverage is broad, the taxonomy scope is general and wide-ranging enough to be useful across the board – not just for small newspapers, broadcasters, or larger news organizations, but for data or insights companies as well.
External customers incorporate our taxonomy and tagging service into their workflows and platforms to aid in content findability and discovery, much in the same way as we ourselves do. They may have other use cases as well, such as improving targeted contextual advertising, topic-based content recommendations or suggestions. We make our taxonomy data available to Metadata Services customers using semantic web standards, which makes the data easier to integrate. The rich relationships between concepts in our taxonomy allow for dynamic content linking, leading to further discovery – for example, an athlete is connected to their sport and the team they are a member of, from that team to the league, etc.
The taxonomy is robust and detailed, in some cases containing eight or nine levels of hierarchy. There’s coverage for various levels of content granularity. The time and resources a taxonomy and tagging system like this takes to build is something that small organizations may not have at their disposal in house, so they find benefit in integrating AP’s system instead.
Over the past year and a half, AP has been upgrading the Metadata Services platform, adding to the core components of the service. The same standard taxonomy and tagging are still available, but we added flexibility to the offerings as well as new features like additional taxonomies and tags. For example, we now offer IAB (Internet Advertising Bureau)’s taxonomy and tags as part of our service, and allow our customers to incorporate their own additional taxonomy terms directly into our taxonomy via API.
How do you maintain the taxonomy and metadata?
VZ: AP constantly updates the taxonomy and tagging engine with new terms, as people become newsworthy or companies go public, and often due to emerging and breaking news. New topics appear all the time, e.g. “vaping” emerged as a hot topic not too long ago and is now covered by news publications looking at health. When a breaking news event occurs, we put the new taxonomy term and tagging rule in production often within 24 hours so that content gets automatically tagged with that breaking news event term as new developments come out in the story. We may also retire a topic that is no longer relevant to the news.
The AP taxonomy team follows the news and collaborates with the AP Editorial team to identify these new and emerging topics. Topics are used throughout AP’s platforms to route and aggregate content. On our consumer-facing platforms, we employ both automated tagging results and manual curation to define story hubs. We also collaborate with Sales and Product teams to define topical products and to highlight relevant content for editorial customers.
We also receive external requests to add specific topics to our taxonomy. We often accommodate these requests if they are likely to also serve the scope and coverage of the AP content set. Sometimes, however, a requested taxonomy term may be low priority for our internal needs because we don’t have a lot of content in that area. For example, a customer in Australia might be interested in tagging Australian Rules football squads, which the AP doesn’t cover heavily. We wouldn’t add these terms in this case, but Metadata Services customers can still customize AP’s taxonomy with terms most useful to their content production.
”“There is always going to be some nuance that a computer will get wrong. At AP we are striving to get as close to 100% precision and recall as possible.”
What are the big challenges you face as part of your work?
VZ: Maintenance is a major concern. We want to maintain the quality of content enrichment at enterprise scale including the automated taxonomy elements. Trimming and keeping the taxonomy relevant and up to date is not a trivial exercise. We need high-quality tagging and to ensure it attracts other organizations who want to obtain the service. We find that the expertise and judgement provided by our taxonomists provides significant value to both our content and Metadata Services customers.
The nature of human language makes it difficult for machines to understand the nuances that come to us so naturally and are reflected in the way we write, so any natural language classification systems are hard pressed to reach 100% precision and recall. There is always going to be some nuance that a computer will get wrong. At AP we are striving to get as close to 100% precision and recall as possible.
Maintaining a list of geographic and political entities requires diplomacy. How does AP negotiate these issues?
VZ: The Associated Press has strong editorial standards on all content we generate, and these are outlined in the AP Stylebook. Whether it’s the name of a political organization or geographic location, our taxonomy labels follow the Stylebook’s guidance and editorial standards. We also regularly consult with AP editors and authoritative sources when developing taxonomy terms, scope and definitions.
”“As taxonomists, our role is to organize and scope concepts, and to create relationships between them. Having an open mind and broad scope of others' understanding is often crucial, as a taxonomy needs to be more than one’s own interpretation.”
In your view what makes a good taxonomist?
VZ: Consider what the taxonomy is supposed to solve. What business problem is the taxonomy addressing and what types of assets will it describe? Who is the end-user and how will they interact with your taxonomy? In our case, we need to be grounded in the body of content that AP produces and knowledgeable of our domain. Keeping stakeholder goals in mind and understanding what you want to solve is incredibly important.
Often, we are working with a stakeholder who has a certain idea of how their content or product should be organized. They may have a search-specific use case or an end-user in mind. Their approach and perspective on the problem they are trying to solve may offer a different and valuable insight that you may not have considered. Approach a taxonomy development project with an open mind and listen to the needs of your team, colleagues and stakeholders. Listen to the various ways something can be described and the various ways certain language is used in the organization. As taxonomists, our role is to organize and scope concepts, and to create relationships between them. Having an open mind and broad scope of others’ understanding is often crucial, as a taxonomy needs to be more than one’s own interpretation.
What advice would you share with others developing a similar project?
VZ: You can build a beautiful, complex, well-structured taxonomy but if no one is going to use it, then it’s a loss. One important aspect of our work is ensuring the taxonomy works for its users. Look at the larger organization and evaluate whether your taxonomy will be seen as valuable on that scale – can the value of the taxonomy be directly tied to larger organizational priorities? You need buy-in, organizational support and understanding – and part of that is being able to explain and expose the value very clearly. You want your colleagues to understand the necessity of the work. There are a lot of internal relations you need to undertake as part of the project process.
”“You can build a beautiful, complex, well-structured taxonomy but if no one is going to use it, then it’s a loss. One important aspect of our work is ensuring the taxonomy works for its users”
What do you think are the biggest challenges for your industry for the future?
VZ: In the news industry, we are all aware of the impact of shrinking budgets across newsrooms but there is still an objective to grow news production at the local and regional level. We don’t want to lose that local coverage, and AP has made efforts to help in this area though the challenge remains. Another ever-growing challenge to news is also the spread of misinformation across the world, the rapid emergence and refinement of technologies that allow for misleading and false visuals to be spread as facts. Information Managers can help by creating standardized vocabularies for fact checking across media types, for example, so that deep fakes and other misleading media can be identified consistently across various outlets.