Knowledge Graph Demo
Written by: Dr. Clair J. Sullivan, Graph Data Science Advocate, Neo4j
clair.sullivan@neo4j.com
email:Twitter: @CJLovesData1
Last updated: 2021-05-13
Introduction
This repository contains a demonstration for how to create and query a knowledge graph using Neo4j. It is based off of a Docker container that establishes the Neo4j database as well as a Jupyter Lab instance. There are two methods that will be in this demo of how to do this:
- A version based on natural language processing (NLP) using Spacy to extract (subject, verb, object) triples from Wikipedia and the Google Knowledge Graph via their API.
- A version that queries Wikidata given a series of items (based on the Wikidata Q-values) and their claims (using the Wikidata P-values). The Q-values are used to create the subjects and objects of the triples while the P-values are used to create the verbs.
General Comment
While either of these methods works, the benefit of the first approach is that you will have limitless numbers of verbs (since they are somewhat automatically detected from text), but you will have a problem with entity disambiguation. The beneift of the second approach is that Wikidata is able to handle the entity disambiguation for you, but you have to supply the list of verbs (claims) that you care about.
Personally, I prefer the second approach. The reason is that you don't have to do too much NLP on the unstructured text. You will still use named entity resolution on the input text, but Spacy handles that pretty easily. The first approach, on the other hand, relies on the ability to accurately detect the verbs and attribute them to subjects and objects, which is very complicated. The second approach is much cleaner. Further, complicated NLP approaches like the first require much more tuning. NLP is not a so-called "silver bullet." It requires a lot of tuning and is very specific to the language and vocabulary. If the vocabulary is particularly technical, it is likely that you will find Wikidata to provide you with superior results.
A note on the use of the Google Knowledge Graph
(This is only used for the first approach above and just for demonstration purposes. You can easily substitue any additional data source, including Wikidata.)
We will be working with the Google Knowledge Graph REST API in this example. Users are permitted 100,000 calls per day for free to the API, but will require an API key for the API calls. A link on how to create this API key is below. Once the key is created, it is recommended that you store in in a file named .api_key
at the root level of this repo. This should go in the notebooks/
subdirectory.
A note on scraping Wikidata with a bot
We will be using Pywikibot to scrape entries from Wikidata. In order to do this, you will need to create a token for this bot. Directions on how to do so can be found here. Once you have that token, save it into a file named .wiki_api_token
in the notebooks/
subdirectory.
How to run the code
With Docker and docker-compose installed, from the CLI:
docker-compose build
docker-compose up
Take the link for Jupyter Lab from the terminal (it has the notebook token with it) and copy and paste that into your web browser. To open the Neo4j browser, navigate to localhost:7474
. The login is neo4j
and the password is kgDemo
. These are set on line 15 in docker-compose.yml
and you can change them to anything you like.
When you are done, you can shut down the container by hitting CTRL-C
in the terminal and then at the prompt:
docker-compose down
Note on using multiple databases (like are shown in this repo)
You might find it convenient to have two different databases, one for each method. In order to achieve this, edit lines 8 and 9 in docker-compose.yml
to reflect that (i.e. make a different directory for each graph). You might find this helpful if, like me, you screw up one and don't want to recreate the other. :)
Useful links
- Docker for Data Science -- A Step by Step Guide
- Google Knowledge Graph Search API
- Neo4j
- spacy Documentation
- Wikipedia package docs
Final Note
This notebook is heavily based off of a talk I gave at the 2021 Open Data Science Conference East, whose repository you can find here.