Stanford Knowledge Graph Course Not-Quite-Live-Blog 4

Bob Kasenchak and Ahren Lehnert, Synaptica

The fourth installment of the Stanford Knowledge Graphs Course featured speakers discussing “What are some knowledge graph inference algorithms?” and included discussions of inference algorithms, entity disambiguation, and resolving heterogeneous/diverse graph structures.

NB: Once again, although it would be helpful to post screengrabs of the presenters’ slides it does not seem appropriate.

The video for this session is not available yet.

AnHai Doan, University of Wisconsin-Madison and Informatica

Recent Advances in Entity Matching

Entity matching (or entity linking, entity resolution, entity disambiguation) has been and continues to be a problem of ongoing interest for both researchers and practitioners. Doan considers this problem to be a “Fundamental Operation in Building Knowledge Graphs” and discussed two software-based solutions developed at the University of Wisconsin.

The essential problem is to find and merge duplicate/identical entities (perhaps customers or authors) from two or more datasets, given that names (of people and also places) can appear in different forms, and collapse them into single records.

Doan discussed the current state of the art (over the past five decades): algorithmic solutions are often not end-to-end entity matching systems and feature lots of handcrafted rules, and more recent machine-learning-based solutions do not always scale well (to, for example, millions of triple statements and entities).

The Magellan Project at UW-Madison, launched in 2015, seeks to develop end-to-end platforms to solve real-world problems and currently includes two Python-based platforms: PyMatcher, an open-source solution, and CloudMatcher, the close-sourced enterprise-level cloud-based solution.

CloudMatcher features an interactive interface that lets users help teach the system to match entities from their datasets. It uses a blocking or clustering technique to first identify names that are similar enough to compare (harnessing a variety of text-matching scores to this end) and applies the Random Forest method to sort out the potential matches.

——————————

Yuxiao Dong, Microsoft Research

Learning with Academic Knowledge Graphs

Dong’s talk focused on his work with the Microsoft Academic Knowledge Graph, which is an open (!) dataset driven by a knowledge graph to enable semantic search.

Currently comprising over 200 million papers, 250 million authors, over 25,000 institutional affiliations and almost 1.4 billion (!) references, the interface features a single search box that allows users to mix and match authors, topics, dates, and affiliations to find research articles. The live demo of the interface was very impressive and showed how any text is fair game as input for the search, returning a selection of results filterable by various facets (if you will) including associated entities, topics, authors, etc.

The discussion included entity resolution (again), the essentially social network-style connection of co-authors and references, and automatic topic/field-of-study modeling (about which we would have liked to hear more: surely there’s a taxonomy underlying the subtopic structure).

Dong discussed how they are trying to extract information (facts) to reason about from the content: for example, whether in a paper about diabetes they can extract causes or treatment information using co-occurring text; this included a timely discussion of COVID-19 research data.

Lastly, Dong talked about methods for resolving heterogeneous graph structures to be able to reliably combine information contained in disparate datasets.

——————————

Georg Gottlob, University of Oxford and Vienna Technical University

VADALOG: A Swift Logic for Big Data and a System Combining Datalog Reasoning with Machine Learning

Gottlob’s talk described a system for taking data from various sources and combining them for reasoning and queries in a single system; for example, data may be in triples in a graph database, in tables in a relational database, extracted via NLP, acquired from the Internet of Things, stored in NoSQL databases, and many other sources. The disparate formats of these data are challenging to combine and query, and  VADALOG is the language and system used to resolve and query (and apply reasoning) across data types.

Gottlob discussed how the built-in reasoners in, for example, graph databases, can be insufficient to answer some kinds of questions. VADALOG allows the user to build, essentially, business rules to combine and query heterogeneous (there’s that word again) data, including the ability to apply machine learning operations, in a single system.

——————————

Another installment in this interesting series of lectures showing how graph technologies are incorporated into real-world systems for problem solving and, crucially, data integration.

Stanford Blog 4

You can read the full series of blogs on the Stanford University Knowledge Graphs course here.  You can also follow Bob and Ahren on Twitter.

Bob Kasenchak @TaxoBob

Ahren Lehnert @AhrenLehnert