Bob Kasenchak and Ahren Lehnert, Synaptica
The latest installment of the Stanford class on Knowledge Graphs was entitled “What are some of the knowledge graph engines prevalent in the industry today?” and again featured three interesting speakers from diverse parts of the field. Video of the session is available here.
Neo4j and the property graph data model
Rathle’s talk was broken into three topic areas
- Property Graphs (and how they compare to RDF graph databases)
- Graph Algorithms, and
- Practical applications of graphs
Neo4j is a labeled property graph (not RDF), and powers a large number of implemented industrial-scale knowledge graphs, such as at organizations like NASA, eBay, the German Center for Diabetes Research, and the International Consortium of Investigative Journalism.
At NASA, the graph is derived from documents and these are the source of graph information. NASA is using natural language processing (NLP) to scan documents for data to include in the knowledge graph.
eBay has a meanings and concepts knowledge graph. Speech recognition is parsed and tied back to a knowledge graph of meaning and concepts representing over a billion products.
The German Center for Diabetes Research posits that enough research has been done but the information is scattered in disparate papers. They use a knowledge graph to bring results together.
The International Consortium of Investigative Journalism (ICIJ) won a Pulitzer Prize for their work on the Panama Papers. The papers were released through a 2.6 TB drive leaked to a German newspaper of 11.5 million documents including emails, scanned documents, bank statements, etc. They curated this information into a graph of people, banks, account numbers, shared accounts, and companies. This allowed them to trace patterns of hiding money.
The final example was Covid Graph, a nonprofit collaboration of researchers, software developers, data scientists, and medical researchers working on COVID-19.
Rathle went on to discuss the popularity of graphs and graph databases, citing some history and current popularity of searches and papers on the topic. He then went on to compare labeled property graphs to RDF graphs and state their motivations. Property Graphs are motivated by: data storage and management, querying, and developers and applications. RDF Graphs are motivated by data exchange, interoperability, inference, and machine consumption. He also compared the query languages of Cypher (www.opencypher.org) to SPARQL.
Finally, he noted a strong synergy between graphs, machine learning (ML), and artificial intelligence (AI) and cited a huge increase in research papers on graphs.
Putting data into Context using AWS Neptune
Bebee started with a brief overview of the evolution of graphs and the semantic web starting in about 1999 to the present.
Amazon Neptune is Amazon’s Graph solution, and customers are excited about graphs, linking things, and building applications on them.
He said, “graphs are all around us” and cited use cases for when graphs should be used. When your application is about relationships, and you need to ask questions about them, these are good use cases. So, for example, in
- Social networking
- Knowledge Graphs
- Fraud Detection
- Life sciences, and
- Network & IT operations
He then talked about whether many problems couldn’t be solved by relational databases (RDBMS) or key-value stores. They could, but they may not be good for querying, processing, expressing graph patterns, or include inflexible/rigid schemas that are hard to change.
When a customer thinks of a graph, they think of something like associating people with products purchased or social networks for recommendations, etc. Maybe people visited locations and want to know about art museums or facts about locations and travel destinations. However, they’re not typically thinking about graph models and frameworks, especially whether they are Property Graphs or RDF Graphs. He then also compared the two, stating that RDF has standards, frameworks, schema, query languages, W3C standards for interchange, and so on.
At the end of the day, many customers want both. Amazon’s Neptune supports both models and is supported by cloud-native storage, but also supports surfacing/interfaces/APIs on both the property graph and RDF sides.
Information is stored in Neptune in Quads, which are Subject, Predicate, Object (SPO) + Graph (SPOG). Edges also have IDs (like labeled property graphs, edges can have properties). This can be represented in RDF (SPOG) using three indices: SPOG index for fast/efficient lookups (bound subjects and predicates); POGS, good for when predicates and objects are bound; and GPSO, graphs (edge ID) and property bound objects.
He went on to talk about the size of the graph market segment. There are many more customers who could benefit from graphs than know they want or need a graph. Customers want graphs and have use cases across graph models. Customers also want to know how to model data in graphs, how to query graph data, and how to exchange data between applications (graph and non-graph).
Bebee note that “a rising graph floats all nodes, edges and properties”, which makes it easier to model, query, and exchange information.
Large-Scale Graph Analytics with Apache Spark
Apache Spark is an open-source computing engine, and Databricks is a startup that Zaharia is part of providing cloud-based data and ML platforms and services for thousands of enterprises. Very large graphs (VLGs) (terabytes+) arise in many cases, especially when there is auto-generated or machine-collected data. They process exabytes of data per day.
There are many practical and interesting applications in these domains. He went on to describe some use cases for very large graphs.
The first use case was Finra, a regulatory organization trying to trace illegal trading activity. Their data source includes up to 100 billion events per day from trading plus 30 petabytes of historical data. Their main task was to identify trading patterns across exchanges and commodities that indicate illegal activity. There is a lot of automated trading going on and people trying to manipulate stock prices, perform insider trading, etc. It involves both “hard” rules as well as ML algorithms to detect this illegal behavior. They include historical data, because once a pattern is detected, they want to retroactively search for past patterns. There is a massive scale of events, data (nodes/edges), monitoring actions, firms, brokers, markets, and exchanges. They used Apache Spark to ingest and prepare the data and SQL for ad-hoc queries. They then used pattern matching and algorithms to detect possible illegal activity.
Similarly, in the realm of drug discovery, AstraZeneca built and query a knowledge graph to aid drug discovery. Their data sources include genomics, proteomics, medical records, research papers and databases, and chemistry. Their task was to recommend new compounds to test at various stages and/or new targets to use on which to test existing compounds. Some of the issues they face are identifying the correct target, safety, having the correct patients, and the commercial potential. As with the other use case, there are massive data sources to combine and resolve and lots of work to keep them current and maintain. They used Use Apache Spark and SQL for preparation. They then used NLP to extract info from content, neural networks, and lots of custom data types and algorithms.
Finally, Apple’s use case involved network security. They want to collect and query information about computer system events to find and counter security threats (breaches, infiltration, etc.). Their data sources included events on company systems and networks which involves lots of information flowing. Their task was to collect detailed information on events, find patterns, prevent breaches, and other bad behavior. They then perform ad-hoc analysis to try to answer how it happened and whether there are similar events. It’s important to be quick and have interactive data. Apple found that graph is a natural way to organize this information for modeling as well as analysis. This data has to be stored long term for retroactive analysis — and some attackers work over a course of years.
Zaharia summarized that graphs with machine-generated data can be very large, and there needs to be a good way to model, store, and query this information.