maivanteo / CrawlerMaster

爬蟲管理大師(誤)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CrawlerMaster

你的爬蟲大統領


TODOs

  • index => 列出所有的 crawlers

    • endpoint: /crawlers
    • show last_run_at
    • show running workers in queue (Sidekiq::Queue find class name)
    • [*] show how many courses each crawler had done
  • show => 顯示單一爬蟲的資訊 name / crawling status

    • endpoint: /crawlers/ntust, /crawler/{school name}
    • track each worker job progress and status
    • Start crawler anytime => track job ids => maybe save it to another model?
    • ScheduledSet / RetrySet / DeadSet status (filtered by class name)
    • Limiting queueing crawler (eg. each class for 5 instances)
    • Manage/Track Rufus Scheduled Job
    • Unschedule EveryJob / CronJob(EveryJob first)
  • setting => 設定單一爬蟲的 api secrets / retry interval / scheduling

    • endpoint: /crawlers/{school name}/setting (ediqt page)
    • understanding sidekiq scheduler usage and parameters
    • Schedule crawler (whenever, .etc)
  I use rufus-scheduler eventually
  • initializer setup existing scheduling behavior
  • Course Model

    • [*] Copy and Paste from Colorgy/Book :p
    • Check data integrity (no blank class name / no blank class period data / no invalid period data......)
    • Check course_code
    • [*] Sync data to Core
  • 後期調教

    • Redis Namespace
    • queue namespace(Sidekiq::Client push specific queue name)
    • Limiting retry count
    • limit queue number
    • we can't kill workers orz
    • sidekiq-limit_fetch set limit
    • Check sidekiq proccess
Sidekiq::Client.push('queue' => 'NtustCourseCrawler', 'class' => CourseCrawler::Worker, 'args' => ['NtustCourseCrawler'])
  • 有閒套個 AdminLTE 吧 ww

About

爬蟲管理大師(誤)


Languages

Language:Ruby 78.2%Language:HTML 19.8%Language:CSS 1.0%Language:JavaScript 0.6%Language:CoffeeScript 0.4%