SirJson / crawl2mark

A bunch of scrambled scripts that output some markdown in the end

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

crawl2mark

NOTE: This project got replaced with MagicMark which resembles more of a useable application. This repository never grew out of the just a bunch of undocumented scripts phase.


This is a hacked together solution for fetching a webpage and creating a readable markdown file from it.

Basically what those scripts are doing is:

- First compile a markdown friendly CSS file from SCSS (Shoutout to sakura.css)

NEW: Already included. No gulp dependency required.

  • Navigate to the desired target via Headless Chrome and Puppeteer.

  • Inject Readability.js into it and execute some more js code. It will transform the DOM into something Firefox would do in "reader mode".

  • Wait for a console event that contains a tagged JSON object with the results of Readability.js

  • Use a mustache template and the received content to render a HTML file into a temporary folder

  • Run the tool "prettier" after all HTML is written

  • The last step is calling pandoc with all desired extensions

Not a elegant solution but it works.

The biggest downside is that I don't know if you could even package this random collection of scripts into something that can just run without installing a bunch of developer dependencies.

If you know a way how to do something like this let me know

Usage

You don't want to use this, but if you really really want: Just install yarn, nodejs, pandoc, typescript and gulp-cli.

After installing all dependencies run this shell script to start converting a page

./crawl2mark.sh https://example.com

and enjoy your Github flavoured Markdown :)

About

A bunch of scrambled scripts that output some markdown in the end

License:MIT License


Languages

Language:JavaScript 85.7%Language:TypeScript 13.1%Language:PowerShell 1.1%Language:Shell 0.1%