handanchen's repositories
awesome-wechat-weapp
微信小程序开发资源汇总 :100:
canal
阿里巴巴mysql数据库binlog的增量订阅&消费组件
CDHExample
CDH集群环境Hdfs、MapReduce、Hive、Hbase、Kafka、Solr、Spark、Zookeeper、Mahout示例代码
CDS
Content Data Store (HDFS/HBase)
ChatBotCourse
自己动手做聊天机器人教程
clouderasizer
Multipurpose tool for discovering and collecting Cloudera Manager metrics.
django-dynamic-scraper
Creating Scrapy scrapers via the Django admin interface
dw_etl
dw etl 工具 mysql 增量、全量抽取 to hive. 合并 hive 数据表, 等数据平台清洗工具
FinancialNewsSearchEngine
Very simple search engine "specialised" in searching financial news (written using Nutch, Hbase, Solr, SpringBoot, Bootstrap and AngularJS)
hbase-increment-index
hbase+solr实现hbase的二级索引
hbase-indexer
Lily HBase Indexer - indexing HBase, one row at a time
hive-third-functions
Some useful custom hive udf functions, especial array and json functions.
kafka-example-in-scala
a kafka producer and consumer example in scala and java
kafka-offset-manager
Move Consumer offsets as you please
kafkaLowLevelConsumer
kafka low level consumer api
KafkaProducerTool
对kafka自定义producer进行封装
maxwell
Maxwell's daemon, a mysql-to-json kafka producer
papers-we-love
Papers from the computer science community to read and discuss.
puppet-cdh
Puppet module for Hadoop and the rest of Cloudera's CDH 5.
reair
ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.
scrapyd
A service daemon to run Scrapy spiders
show-me-the-code
Python 练习册,每天一个小程序
streamingpro
Build Spark Streaming Application by SQL
ThinkBayes
Code repository for Think Bayes.
wechat_sogou_crawl
基于搜狗微信的公众号文章爬虫
wechat_spider
基于搜狗微信入口的微信爬虫程序。 由基于phantomjs的python实现。 使用了收费的动态代理。 采集包括文章文本、阅读数、点赞数、评论以及评论赞数。 效率:500公众号/小时。 根据采集的公众号划分为多线程,可以实现并行采集。
weixin
scrapy搜狗微信文章爬取
yugong
阿里巴巴去Oracle数据迁移同步工具(全量+增量,目标支持MySQL/DRDS)