TeamHG-Memex

TeamHG-Memex

Geek Repo

Github PK Tool:Github PK Tool

TeamHG-Memex's repositories

eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions

Language:Jupyter NotebookLicense:MITStargazers:2720Issues:67Issues:257

scrapy-rotating-proxies

use multiple proxies with Scrapy

Language:PythonLicense:MITStargazers:713Issues:21Issues:52

tensorboard_logger

Log TensorBoard events without touching TensorFlow

Language:PythonLicense:MITStargazers:632Issues:30Issues:24

sklearn-crfsuite

scikit-learn inspired API for CRFsuite

aquarium

Splash + HAProxy + Docker Compose

Language:PythonLicense:MITStargazers:194Issues:16Issues:30

deep-deep

Adaptive crawler which uses Reinforcement Learning methods

Language:Jupyter NotebookStargazers:168Issues:25Issues:5

arachnado

Web Crawling UI and HTTP API, based on Scrapy and Tornado

html-text

Extract text from HTML

Language:HTMLLicense:MITStargazers:122Issues:15Issues:15

autologin

A project to attempt to automatically login to a website given a single seed

Language:PythonLicense:Apache-2.0Stargazers:119Issues:15Issues:18

Formasaurus

Formasaurus tells you the type of an HTML form and its fields using machine learning

autopager

Detect and classify pagination links

page-compare

Simple heuristic for measuring web page similarity (& data set)

undercrawler

A generic crawler

scrapy-crawl-once

Scrapy middleware which allows to crawl only new content

Language:PythonLicense:MITStargazers:77Issues:8Issues:4

soft404

A classifier for detecting soft 404 pages

Language:Jupyter NotebookStargazers:55Issues:12Issues:10

agnostic

Agnostic Database Migrations

Language:PythonLicense:MITStargazers:52Issues:10Issues:18

autologin-middleware

Scrapy middleware for the autologin

json-lines

Read JSON lines (jl) files, including gzipped and broken

Language:PythonLicense:MITStargazers:34Issues:10Issues:5

scrapy-kafka-export

Scrapy extension which writes crawled items to Kafka

Language:PythonLicense:MITStargazers:29Issues:13Issues:2

MaybeDont

A component that tries to avoid downloading duplicate content

Language:PythonLicense:MITStargazers:27Issues:6Issues:2

sitehound-frontend

Site Hound (previously THH) is a Domain Discovery Tool

Language:HTMLLicense:Apache-2.0Stargazers:23Issues:16Issues:5

domain-discovery-crawler

Broad crawler for domain discovery

Language:PythonLicense:MITStargazers:19Issues:6Issues:3

url-summary

Show summary of a large number of URLs in a Jupyter Notebook

Language:PythonLicense:MITStargazers:17Issues:14Issues:0

sitehound

This is the facade for installation and access to the individual components

Language:ShellLicense:Apache-2.0Stargazers:16Issues:5Issues:0

docker-tor-rotator

A rotating socks proxy using Tor, Delegate and Haproxy

hh-page-classifier

Headless Horseman Page Classifier service

Language:PythonLicense:MITStargazers:7Issues:10Issues:0

scrapy-cdr

Item definition and utils for storing items in CDR format for scrapy

Language:PythonLicense:MITStargazers:7Issues:7Issues:4

scrash-lua-examples

A collection of example LUA scripts and JS utilities

Language:JavaScriptStargazers:7Issues:9Issues:1

sitehound-backend

Sitehound's backend

Language:HTMLLicense:Apache-2.0Stargazers:6Issues:12Issues:0

sshadduser

A simple tool to add a new user with OpenSSH keys.

Language:PythonLicense:MITStargazers:2Issues:4Issues:1