grzegorz-aniol / neo4jscraperproc

Neo4j Scraper Procedures

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Neo4jscraperproc

Preface

This is a pretotype implementation of an idea for the Global GraphHack 2019 competition. It is not "ready", it is not nice, but It works.

Description

If we use Neo4j, sometimes we need textual information from a web page. Sometimes we need the links of a website to build a graph from it. It is quite common use case where we need to follow the links, and get the information from the links. Nowadays, when NLP is getting more mainstream, it is also usually required to collect the textual information from the web. We would like to do this in exactly the same step when we are creating or modifying our graphs with cypher commands. This is why I create this tiny tool, to be able to do web scraping with cypher commands via stored procedures in Neo4j. I used jSoup Java library in my procedures to be able to scrape.

Install

To use this plugin you will need the .jar file (you can download here) dropped into the plugins directory of your Neo4j installation.

Configuration

Insert this line into neo4j.conf to be able to use the scraper procedures:

dbms.security.procedures.unrestricted=scraper.*

Examples

Node from random Wikipedia page

call scraper.select("https://en.wikipedia.org/wiki/Special:Random","body") yield element create (:Wikinode {url:element.url,text:element.text})

Reference URL list from a Wikipedia page

call scraper.select('https://en.wikipedia.org/wiki/Budapest','div.reflist cite a.external') yield element with element.attributes.`abs:href` as url return url

Get content from reference URL list

This way you can get the content of the URLs from a reference section of a Wikipedia page. Note: this can be long, based on the number of URLs and your internet connection, cpu, memory, etc.

call scraper.select('https://en.wikipedia.org/wiki/Budapest','div.reflist cite a.external') yield element with element.attributes.`abs:href` as url call scraper.getPlainText(url) yield value create (w:Page {url: url, text: value})

Trick to get Ebay prices of something

Sometimes you want to get specific elements from an html file. You can use the selector syntax to get them.

call scraper.select('https://www.ebay.com/sch/i.html?_nkw=seiko+turtle&rt=nc&LH_BIN=1','.s-item__price') yield element return element.text

More advanced usage of scraping Ebay listing

You can scrape out title, link and price information easily from result page.

call scraper.select('https://www.ebay.com/sch/i.html?_nkw=seiko+turtle&rt=nc&LH_BIN=1','.s-item__wrapper') yield element with element as row
call scraper.selectInHtml(row.html,'.s-item__link') yield element with element.attributes.href as url,row
call scraper.selectInHtml(row.html,'.s-item__title') yield element with element.text as title,url,row
call scraper.selectInHtml(row.html,'.s-item__price') yield element with element.text as price, title,url
return title, url, price

All the procedures

scraper.getDocument(url) YIELD value - Return the content of an url
scraper.select(url,selector) YIELD element - Find elements that match the Selector CSS query, with this element as the starting context.
scraper.selectInHtml(html,selector) YIELD element - Find elements that match the Selector CSS query, with this element as the starting context.
scraper.getLinks(url) YIELD element - Get link elements from an url.
scraper.getLinksInHtml(html) YIELD element - Get link elements from a html.
scraper.getMediaLinks(url) YIELD element - Get media link elements.
scraper.getMediaLinksInHtml(html) YIELD element - Get media link elements.
scraper.getPlainText(url,selector) YIELD value - Get plain text version of a given page.
scraper.getPlainTextInHtml(url,selector) YIELD value - Get plain text version of a given page.
scraper.getElementById(url,id) YIELD element - Find an element by ID, including or under this element.
scraper.getElementByIdInHtml(html,id) YIELD element - Find an element by ID, including or under this element.
scraper.getElementsByTag(url,tag) YIELD element - Finds elements, including and recursively under this element, with the specified tag name.          
scraper.getElementsByTagInHtml(html,tag) YIELD element - Finds elements, including and recursively under this element, with the specified tag name.
scraper.getElementsByClass(url,className) YIELD element - Find elements that have this class, including or under this element.
scraper.getElementsByClassInHtml(html,className) YIELD element - Find elements that have this class, including or under this element.
scraper.getElementsByAttribute(url,key) YIELD element - Find elements that have a named attribute set.
scraper.getElementsByAttributeInHtml(html,attribute) YIELD element - Find elements that have a named attribute set.
scraper.getElementsByAttributeStarting(url,keyPrefix) YIELD element - Find elements that have an attribute name starting with the supplied prefix. Use data- to find elements that have HTML5 datasets.
scraper.getElementsByAttributeStartingInHtml(html,keyPrefix) YIELD element - Find elements that have an attribute name starting with the supplied prefix. Use data- to find elements that have HTML5 datasets.
scraper.getElementsByAttributeValue(url,key,value) YIELD element - Find elements that have an attribute with the specific value.
scraper.getElementsByAttributeValueInHtml(html,key,value) YIELD element - Find elements that have an attribute with the specific value.
scraper.getElementsByAttributeValueContaining(url,key,match) YIELD element - Find elements that have attributes whose value contains the match string.
scraper.getElementsByAttributeValueContainingInHtml(html,key,match) YIELD element - Find elements that have attributes whose value contains the match string.
scraper.getElementsByAttributeValueEnding(url,key,valueSuffix) YIELD element - Find elements that have attributes that end with the value suffix.
scraper.getElementsByAttributeValueEndingInHtml(html,key,valueSuffix) YIELD element - Find elements that have attributes that end with the value suffix.
scraper.getElementsByAttributeValueMatching(url,key,regex) YIELD element - Find elements that have attributes whose values match the supplied regular expression.
scraper.getElementsByAttributeValueMatchingInHtml(html,key,regex) YIELD element - Find elements that have attributes whose values match the supplied regular expression.
scraper.getElementsByAttributeValueNot(url,key,value) YIELD element - Find elements that either do not have this attribute, or have it with a different value.
scraper.getElementsByAttributeValueNotInHtml(html,key,value) YIELD element - Find elements that either do not have this attribute, or have it with a different value.
scraper.getElementsByAttributeValueStarting(url,key,valuePrefix) YIELD element - Find elements that have attributes that start with the value prefix.
scraper.getElementsByAttributeValueStartingInHtml(html,key,valuePrefix) YIELD element - Find elements that have attributes that start with the value prefix.
scraper.getElementsByIndexEquals(url,index) YIELD element - Find elements whose sibling index is equal to the supplied index.
scraper.getElementsByIndexEqualsInHtml(html,index) YIELD element - Find elements whose sibling index is equal to the supplied index.
scraper.getElementsByIndexGreaterThan(url,index) YIELD element - Find elements whose sibling index is greater than the supplied index.
scraper.getElementsByIndexGreaterThanInHtml(html,index) YIELD element - Find elements whose sibling index is greater than the supplied index.
scraper.getElementsByIndexLessThan(url,index) YIELD element - Find elements whose sibling index is less than the supplied index.
scraper.getElementsByIndexLessThanInHtml(html,index) YIELD element - Find elements whose sibling index is less than the supplied index.
scraper.getElementsContainingOwnText(url,searchText) YIELD element - Find elements that directly contain the specified string.
scraper.getElementsContainingOwnTextInHtml(html,searchText) YIELD element - Find elements that directly contain the specified string.
scraper.getElementsContainingText(url,searchText) YIELD element - Find elements that contain the specified string.
scraper.getElementsContainingTextInHtml(html,searchText) YIELD element - Find elements that contain the specified string.
scraper.getElementsMatchingOwnText(url,regex) YIELD element - Find elements whose text matches the supplied regular expression.
scraper.getElementsMatchingOwnTextInHtml(html,pattern) YIELD element - Find elements whose text matches the supplied regular expression.
scraper.getElementsMatchingText(url,pattern) YIELD element - Find elements whose text matches the supplied regular expression.
scraper.getElementsContainingTextInHtml(html,pattern) YIELD element - Find elements whose text matches the supplied regular expression.
scraper.getAllElements(url) YIELD element - Find all elements under this element (including self, and children of children).
scraper.getAllElementsInHtml(html) YIELD element - Find all elements under this element (including self, and children of children).

Useful links

Jsoup selector syntax

About

Neo4j Scraper Procedures

License:MIT License


Languages

Language:Java 97.7%Language:HTML 2.3%