cj2001 / knowledge_graph_demo

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Knowledge Graph Demo

Written by: Dr. Clair J. Sullivan, Graph Data Science Advocate, Neo4j

email: clair.sullivan@neo4j.com

Twitter: @CJLovesData1

Last updated: 2021-05-13

Introduction

This repository contains a demonstration for how to create and query a knowledge graph using Neo4j. It is based off of a Docker container that establishes the Neo4j database as well as a Jupyter Lab instance. There are two methods that will be in this demo of how to do this:

  1. A version based on natural language processing (NLP) using Spacy to extract (subject, verb, object) triples from Wikipedia and the Google Knowledge Graph via their API.
  2. A version that queries Wikidata given a series of items (based on the Wikidata Q-values) and their claims (using the Wikidata P-values). The Q-values are used to create the subjects and objects of the triples while the P-values are used to create the verbs.

General Comment

While either of these methods works, the benefit of the first approach is that you will have limitless numbers of verbs (since they are somewhat automatically detected from text), but you will have a problem with entity disambiguation. The beneift of the second approach is that Wikidata is able to handle the entity disambiguation for you, but you have to supply the list of verbs (claims) that you care about.

Personally, I prefer the second approach. The reason is that you don't have to do too much NLP on the unstructured text. You will still use named entity resolution on the input text, but Spacy handles that pretty easily. The first approach, on the other hand, relies on the ability to accurately detect the verbs and attribute them to subjects and objects, which is very complicated. The second approach is much cleaner. Further, complicated NLP approaches like the first require much more tuning. NLP is not a so-called "silver bullet." It requires a lot of tuning and is very specific to the language and vocabulary. If the vocabulary is particularly technical, it is likely that you will find Wikidata to provide you with superior results.

A note on the use of the Google Knowledge Graph

(This is only used for the first approach above and just for demonstration purposes. You can easily substitue any additional data source, including Wikidata.)

We will be working with the Google Knowledge Graph REST API in this example. Users are permitted 100,000 calls per day for free to the API, but will require an API key for the API calls. A link on how to create this API key is below. Once the key is created, it is recommended that you store in in a file named .api_key at the root level of this repo. This should go in the notebooks/ subdirectory.

A note on scraping Wikidata with a bot

We will be using Pywikibot to scrape entries from Wikidata. In order to do this, you will need to create a token for this bot. Directions on how to do so can be found here. Once you have that token, save it into a file named .wiki_api_token in the notebooks/ subdirectory.

How to run the code

With Docker and docker-compose installed, from the CLI:

docker-compose build
docker-compose up

Take the link for Jupyter Lab from the terminal (it has the notebook token with it) and copy and paste that into your web browser. To open the Neo4j browser, navigate to localhost:7474. The login is neo4j and the password is kgDemo. These are set on line 15 in docker-compose.yml and you can change them to anything you like.

When you are done, you can shut down the container by hitting CTRL-C in the terminal and then at the prompt:

docker-compose down

Note on using multiple databases (like are shown in this repo)

You might find it convenient to have two different databases, one for each method. In order to achieve this, edit lines 8 and 9 in docker-compose.yml to reflect that (i.e. make a different directory for each graph). You might find this helpful if, like me, you screw up one and don't want to recreate the other. :)

Useful links

Final Note

This notebook is heavily based off of a talk I gave at the 2021 Open Data Science Conference East, whose repository you can find here.

About


Languages

Language:Jupyter Notebook 99.9%Language:Dockerfile 0.1%