Recipe based scraping of website content.
Create recipes for any website and extract any content from the html source in a simple and straightforward matter. Use transformers to convert the extracted data in a way most suitable for your further processing.
Detailed description needs to be written.
This is your basic recipe json-file. It contains all the datapoints you want to extract.
/path/to/config.json
[
{
"name": "title",
"pointer": "{find:h1.title}.text",
"transformers": [
{
"name": "string_replace",
"find": "an Example",
"replace": "awesome"
}
]
}
]
This is an example usecase using a mockup html markug.
example.py
from bs4 import BeautifulSoup
# import create_schema and string transformers
from parsley.schema import create_schema_from_file
from parsley.transformers import string
# register string transformers
string.register()
# create schema from json file
schema = create_schema_from_file("/path/to/config.json")
# create html example mockup
example_source = (
"<html>"
"<head>"
"<title>Example</title>"
"</head>"
"<body>"
"<h1 class=\"title\">This is an Example</h1>"
"</body>"
"</html>"
)
soup = BeautifulSoup(example_source, "lxml")
# apply schema to soup
data = schema.apply(soup)
print(data)
# prints: 'This is awesome'
string_replace
Find value in string and replace with new value.
{
"name": "string_replace",
"find": "value_to_find",
"replace": "value_to_replace
}
string_split
Split string at delimiter and return list.
{
"name": "string_split",
"delimiter": ","
}
string_strip
Strip whitespace from input.
{
"name": "string_strip"
}
cast_float
Cast input to float.
{
"name": "cast_float"
}
cast_int
Cast input to int.
{
"name": "cast_int"
}
cast_string
Cast input to string.
{
"name": "cast_string"
}
math_multiply
Multiply input by value.
{
"name": "math_multiply",
"value": 2
}