araobp / bach-network

J. S. Bach's network with spaCy(NLP)

Home Page:https://araobp.github.io/bach-network/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

J. S. Bach's Network with spaCy(NLP)

This web app is hosted on: https://araobp.github.io/bach-network/bach_network.html

Note: This project is something I worked on independently as self-study during Y2023-2024 winter break.

Background and Motivation

25 years ago, I lived in Berlin, Germany. Since Berlin is located in the northeastern part of Germany, I listened to a lot of Bach's music. I heard organ music and Christmas Oratorio in Protestant churches. Using Natural Language Processing (NLP) with spaCy, I created a social network diagram of Bach based on a book on his works obtained from Project Gutenberg. I referred to Thu Vu's video on YouTube to learn the technique to generate such a network from a book. To create a more satisfying network, I made further improvements myself.

Processing Pipeline

This is a pipeline I devised to generate the network from the web book and visualize it on a browser with no external databases.

<---- beauifulsoap ---->  <----- spaCy ------>  <--- networkx ---->  <-- graphology.js --->  <-- vis.js --->
[Web book]=>[Paragraphs]=>[NER/DepenencyGraph]=>[Network Formation]=>[Graph DB]=>[Subgraph]=>[Visualization]
<-- paragraphs.ipynb -->  <-------- bach_network.ipynb ----------->  <------ bach_network.html ------------>

Jupyter Notebook

All the resulting data is stored in SQLite.

Single Page App without SQLite3

All the resulting data is imported to the web app in the form of JSON.

Single Executable App with SQLite3 (Work in Progress)

Client-Server archtecture in a single executable app with help from PyInstaller and Flask

Generating the Network

The network was generated by extracting Named Entity pairs for each paragraph.

Grouping of Personal Names

The following is a famous passage also found in the book:

"The most renowned Clavier composers of that day were Froberger, Fischer, Johann Caspar Kerl, Pachelbel, Buxtehude, Bruhns, and Böhm. Johann Christoph possessed a book containing several pieces by these masters, and Bach begged earnestly for it, but without effect. Refusal increasing his determination, he laid his plans to get the book without his brother's knowledge. It was kept on a book-shelf which had a latticed front. Bach's hands were small. Inserting them, he got hold of the book, rolled it up, and drew it out. As he was not allowed a candle, he could only copy it on moonlight nights, and it was six months before he finished his heavy task. As soon as it was completed he looked forward to using in secret a treasure won by so much labour. But his brother found the copy and took it from him without pity, nor did Bach recover it until his brother's death soon after."

Regarding the above passage, it is necessary to create a network that is conscious of listing names, as follows:

Froberger ---------------+---------+--------- Johann Christoph Bach ------- Johann Sebastian Bach
                         |         |                                                 |
Fischer -----------------+         +-------------------------------------------------+
                         |
Johann Caspar Kerl ------+
                         |
Pachelbel ---------------+
                         |
Buxtehude ---------------+
                         |
Bruhns ------------------+
                         |
Böhm --------------------+

How can we accurately perform relation extraction? I devised the method below.

When names are listed, it might be beneficial to consider them as groups of nodes and connect them with edges to describe relationships between names (nodes) that are mentioned a little apart:

  • Named Entity extraction within a paragraph.
  • If Named Entities are enumerated, perceive them as a group, and adjust the edge weights between Named Entities within the group to be weaker.

Dependencies

Extra

About

J. S. Bach's network with spaCy(NLP)

https://araobp.github.io/bach-network/


Languages

Language:HTML 73.3%Language:Jupyter Notebook 26.6%Language:Python 0.0%Language:JavaScript 0.0%Language:CSS 0.0%Language:Shell 0.0%