jeremy-feng / investment_data

Scripts and doc for https://www.dolthub.com/repositories/chenditc/investment_data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Chinese blog about this project: 量化系列2 - 众包数据集

Table of contents generated with markdown-toc

How to use it

  1. Download tar ball from latest release page on github
  2. Extract tar file to default qlib directory
wget https://github.com/chenditc/investment_data/releases/download/2023-04-20/qlib_bin.tar.gz
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2

Developement Setup

If you want to contribute to the set of scripts or the data, here is what you should do to set up a dev environment.

Install dolt

Follow https://github.com/dolthub/dolt

Clone data

Raw data hosted on dolt: https://www.dolthub.com/repositories/chenditc/investment_data

To download as dolt database:

dolt clone chenditc/investment_data

Export to qlib format

docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/

Run Daily Update

You will need tushare token to use tushare api. Get tushare token from https://tushare.pro/

export TUSHARE=<Token>
bash daily_update.sh

Daily update and output

docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/

Extract tar file to qlib directory

tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2

Initiative

  1. Try to fill in missing data by combining data from multiple data source. For example, delist company's data.
  2. Try to correct data by cross validate against multiple data source.

Project Detail

Data Source

The database table on dolthub is named with prefix of data source, for example ts_a_stock_eod_price. The meaning of the prefix:

Initial import

  • w(wind): Use one_time_db_scripts to import w_a_stock_eod_price table, used as initial price standard
  • c(caihui): SQL import to c_a_stock_eod_price table
  • ts(tushare):
    1. Use tushare/update_stock_list.sh to load stock list
    2. Use tushare/update_stock_price.sh to load stock price
  • yahoo
    1. Use yahoo collector to load stock price

Daily Update

Currently the daily update is only using tushare data source and triggered by github action.

  1. I maintained a offline job whcih runs daily_update.sh every 30 mins to collect data and push to dolthub.
  2. A github action .github/workflows/upload_release.yml is triggered daily, which then calls bash dump_qlib_bin.sh to generate daily tar file and upload to release page.

Merge logic

  1. Use w data source as baseline, use other data source to validate against it.
  2. Since w data's adjclose is different from ts data's adjclose, we will use a "link date" to calculate a ratio to map ts adjclose to w adjclose. This can be the maximum first valid data for each data source. The reason we don't use a fixed value for link date is: Some stock might not be trading at specific date, and the enlist and delist date are all different. We store the link date information and adj_ratio in link_table. adj_ratio = link_adj_close / w_adj_close;
  3. Append ts data to final dataset, the adjclose will be ts_adj_close / ts_adj_ratio

Validation logic

  1. Generate final data by concatinate w data and ts data.
  2. Run validate by pair two data source:
    • Compare high, low, open, close, volume absolute value
    • Calcualte adjclose convert ratio use a link date for each stock.
    • Calculate w data adjclose use link date's ratio, and compare it with final data.

Contribution Guide

Add more stock index

To add a new stock index, we need to change:

  1. Add index weight download script. Change tushare/dump_index_eod_price.py script to dump the index info. If the index is not available in tushare, write a new script and add to the daily_update.sh script. Example commit
  2. Add price download script. Change tushare/dump_index_eod_price.py to add the index price. Eg. Example Commit
  3. Modify export script. Change the qlib dump script qlib/dump_index_weight.py#L13, so that index will be dump and renamed to a txt file for use. Example commit

About

Scripts and doc for https://www.dolthub.com/repositories/chenditc/investment_data


Languages

Language:Python 70.2%Language:Shell 25.2%Language:Dockerfile 4.6%