Creating Knowledge graphs from the literature: the case of health resilience in Green Building Neighbourhoods
The Horizon 2020 PROBONO project (Grant agreement 101037075) aims at demonstrating "strong examples of how Green Building Neighbourhoods (GBNs) technological and social innovations can be applied, with a vision focused on building infrastructure and a renewed focus on people and sustainability, taking full advantage of digitization and smart technologies for the benefit of society". The Task 3.5 of this projects aims at reviewing ”Interventions to mitigate diseases outbreaks”.
In this document, we summarize how we used new technologies, including Large Language Models (LLMs), to consolidate a knowledge graph of (KG) this topic, based on the literature, to demonstrate the feasibility of building a body of knowledge pertaining to a certain domain, with a specific angle. This opens the door to more opportunities the growing space existing between LLMs and KGs.
The main objective of the task is to review the scientific literature to identify key risks, stakeholders, technologies and mitigations measures both at building and neighbourhood scales.
The present knowledge graph has been created using new tools, helping to streamline, faster and more consistently, information from the literature:
- Parsing of the literature was done with GROBID. This provided structured text (XML) from the article PDFs;
- The data was processed and structured in an RDF that was produced with the Owlready2 python library;
- Vector embedding based on Pinecone because of the early possibility to integrate with Langchain, and because of its ease of use.
Once the data was prepared, information structuring was done using different solutions:
- NLTK was used to extract topics and themes of the articles;
- Spacy was used for entity recognition, and CoreferenceResolver to tackle disambiguation ;
- Text was processed using a combination of OpenAI API (both using GPT3.5 and GPT4 endpoints) as well as running local models (NOUS/LLaMa), using the python requests or langchain libraries.
We started with defining a basic ontology based on classes of interest for mitigation measures, namely ’Risks’, ’Mitigations’, ’People’, and ’Technologies’. These constitute an initial body of knowledge, which is then used to build ’Blueprints’, possible interventions to mitigate diseases outbreaks.
To date, the Knowledge Graph (v0.3) contains information on 376 articles, from which were programmaticaly derived 3418 risks, 5295 mitigation measures, 2640 stakeholders and 3915 technologies. The team used these to build 24 blueprints manually, and automated the production of 50 others.
We hope that releasing the knowledge graph under an open-source license (CC BY-NC-SA) will drive use of this knowledge graph and that health professionals can use this to derive useful, professionally-approved mitigation measures.
This repository branch contains:
- A zipped knowledge graph
- Support materials
- Examples of work on the knowledge graph
- Example of SPARQL queries
- Helper tools to
- manage the knowledge graph
- work with LLMs
- simplified kgs and focusing on blueprints
- Ordering properties when displaying
- export of KGs in different formats, including rdf, ttl, nt.
- Tools to process data from a body of knowledge
- ... and serve this body of knowledge to a fastapi app used for further work on the KG.
- A technical note describing the work
Streamlit site lives here online -- source code there.
Static pages live here.
This activity yielded expected outcomes, consisting in mapping out risks, technologies and stakeholders, as well as suggesting mitigation measures.
This however is an asset that has a highly reusable potential, and this list, albeit listing possible actions from a project perspective, could be undertaken by other parties:
- We plan on integrating more robust graph management solutions, possibly Neo4j or similar, to continually review and enrich the knowledge graph;
- We would want to enrich the semantic content of the graph to make it more usable and possible an input to KG-backed LLMs;
- Reusing existing semantic assets (eg Wikidata) might help structure
- Connect this knowledge graph to other KGs or ontologies.
Please note that this document is a research-based exploration compiled by knowledge management researchers, not medical professionals. Our findings are presented with the intention to inform and contribute to the dialogue on public health strategies. They are indicative and should serve as a preliminary guide. We encourage all readers to consult with qualified health professionals for expert advice and to confirm and enrich these insights.
- Environment is defined in the requirements.txt
- Using python 3.10.2 (TBC)
Copyright 2022-2024 LJ