Bob Kasenchak and Ahren Lehnert, Synaptica
The second session of the course, “How to Create a Knowledge Graph” featured three speakers with very different perspectives on creating (and using) knowledge graphs.
Video of this session is available here.
Juan Sequeda (data.world) – The Socio-Technical Phenomena of Data
Integration and Knowledge Graphs
Juan Sequeda’s talk focused on creating enterprise knowledge graphs for business intelligence: answering specific business questions. Much of his approach involves integrating information in disparate relational database structures (and, crucially, other sources) and integrating it in a graph.
Importantly, he emphasized the need to involve the knowledge and siloed systems and datasets owned by various people throughout the enterprise, as oftentimes need-based integration is done on the fly in spreadsheets and desktop databases. This information is critical to constructing a useful graph, which is based on answering specific business questions (“how many orders went out today?”) as different people will have different definitions and data around a given question. In a large organization, various databases may have many hundreds or thousands of tables and attributes that were constructed (and named!) to serve specific needs but not with an eye towards integration.
We thought his definition of a knowledge graph is a useful one. A knowledge graph
- Integrates knowledge across diverse sources
- Into a structure in which concepts and relationships are first-class citizens, and
- Includes Linked Data and metadata in a graph, which
- Integrates knowledge and data at scale.
He also emphasized that integrated toolsets are required for designing, mapping, and integrating data as multiple tools that don’t communicate with one another are problematic.
Chris Ré (Stanford) – Theory and Systems for Weak Supervision
The second talk, by Chris Ré of Stanford’s Computer Science Department, focused on graphs as a tool for machine learning. The problem to be solved is essentially to label a large quantity of data (perhaps documents or images) to reduce the effort required for hand-tagging such data.
Of the three components required for this kind of exercise—models, training data, and tools—training data (pre-tagged datasets, for whatever application) is the hardest to come by. Machine learning algorithms use such training data to infer tags for the rest of the dataset, with the understanding that such tagging data could be incomplete or noisy.
Ré’s assertion is that graph structures are useful for providing this kind of training data for machine learning applications, as RDF-based data is widely reusable and shareable and well-suited to machine learning use cases.
Xiao Ling (Apple Siri) – Creating the Knowledge Graph for SIRI
Ling discussed knowledge base construction at Apple Siri, where their stated goals are to (1) build a graph that represents all human knowledge (!), and (2) to be able to answer domain questions using Siri. The discussion of automatic knowledge base construction (AKBC) revolved around two specific problems: extracting structured information from infoboxes and entity resolution or resolving conflicting data extracted from disparate sources.
Extracting information from infoboxes involves mapping the scraped information to target predicates; linked information is easier to resolve correctly than unlinked information, and sometimes the information is incorrect.
Comparing information derived from various sources (text, semi-structured information, structured data, and human-curated input) to look for potential duplicates and resolve conflicts is solved by inferencing and algorithms, which helps to validate the accuracy.
Overall, the message running throughout all three talks is that data and content must be prepared and rationalized if a coherent, useful knowledge graph is to be constructed.