fboecker / airflow_demo_dags

airflow_demo_dags

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DAGS

Website-Crawler -> crawler_dag.py

A web crawler that, visits HTML pages within the same domain for a given url. Web crawler will output a file (csv|xml) and for each page a list of assets (e.g. CSS, Images, Javascripts) and links between pages.

To access the file check that a Airflow -> s3 Connection ("aws_conn") exist with format like that:

Name: aws_conn

Type: S3

Extra: {"aws_access_key_id":"your_aws_access_key_id", "aws_secret_access_key": "your_aws_secret_access_key"}

ToDo: Bucketname not unique: airflow-crawler-test-bucket

About

airflow_demo_dags


Languages

Language:Python 100.0%