How to download the full Kepler dataset and Missing Link to Documentation
exowanderer opened this issue · comments
Describe the bug
The website which originally contained the documentation http://dan.iel.fm/kplr
is
no longer valid
To Reproduce
Click here: http://dan.iel.fm/kplr
Also: goto the main repo page, and click the words the documentation!
at the bottom.
Expected behavior
For the website to load the documentation and tutorials
Additional context
Link should open https://dfm.io/kplr/ instead
thanks for this! I'll fix it, but I'd generally recommend using lightkurve instead for most things. Are there features here that you still want?
To answer your question: Not exactly.
I'm trying to download the entire 200k x 17 quarters long and short cadence data. Friends at MAST suggested that kplr
was an efficient API for downloading all of the kepler time series data (and headers). I don't want to click 200k x 17 download buttons; and I don't want the FFIs.
I have a mediocre internet connection; but I also have 5TB of storage capacity. So I decided that downloading and operating locally is probably better for me than downloading on the fly.
To my (very limited) knowledge, there is not a sequence of wget
scripts to grab the full light curve info. I want to do injection test, then train both GPs and DNNs on a subset to get a feel for the time scales involved in both analyses.
Sorry to derail this comment section into a "can you help" thread; but do you suggest using kplr
, lightkurve
, or a completely different technique to download all of this data?
ah - happy to help!
What do you mean by "headers" and "full light curve info" here? I would expect that all the information would be in the FITS files themselves. If I wanted to download all of the light curves, I would definitely use something like wget
rather than kplr
or lightkurve
because the latter two will need to send multiple requests to MAST for every target/light curve instead of just one per light curve file.
I'm sorry that I was confusing. I realize that now.
I am looking for the information contained in the fits files that thekplr
code caches in the kplr009787239-2009166043257_llc.fits
fits files. [This fits name is from the tutorial]
I like the API interface for kplr
, which is why I'm happy to use it. But when downloading TB of data, speed is of the essence.
This next questions feels less relevant to kplr
than my global objective: where do I find a list of a wget
scripts -- or at least websites -- for the Kepler light curves. The goal is to have 200k x 17 stacks of _llc.fits
and _slc.fits
files.
After googling around, I found this https://archive.stsci.edu/pub/kepler/lightcurves/ database on the archive.
I think that this is what kplr
uses as well.
I could use wget -r -np https://archive.stsci.edu/pub/kepler/lightcurves/
to get all of the light curves in there; but that would be in linear time. I have 16 cores, and "know how to use them" (wild west joke).
I wrote an asynchronous download script that takes in a list of wget
calls (one per file) and sends them to each of the 16 cores. This reduces my download time by ~5x.
I can get the full list of files too; but I'm hoping that the list already exists on the archive.
Take a look over here for the options: https://archive.stsci.edu/kepler/download_options.html
I'd probably recommend downloading the tar files, but there are lots of options here!
I hope that this is not too long for an Issue
thread; but I wanted to give back by providing my MP-download script, which is the result of our conversation here.
This python multiprocessing
script will search the MAST Kepler archive for all .tgz
files in the QXX_public
subdirectories under the Kepler TarFile Archive.
I had to do some manual web scraping; but it works at the 99% level. I will know after completion if the full download worked correctly; so there may be small bugs to be found.
Multiprocessing Download Script for Full Kepler Dataset in Quarterly Tar Files
import json # import json to store urls for later
import os # import os to check and make directories
import requests # import requests to grab the HTML data
from subprocess import call # import call to run `wget` on `bash`
from multiprocessing import Pool, cpu_count # import mp to run `call` in mp
store_quarterly_tar_filenames = True # store json for url and filenames
# Assign storage directory
HOME = os.environ['HOME']
store_dir = f'{HOME}/.kepler_data/'
# If `store_dir` does not exist, create it
if not os.path.exists(store_dir):
os.mkdir(store_dir)
# Begin tar file storage methods
tar_file_names = {}
# Archive API base URL
base_url = 'https://archive.stsci.edu/pub/kepler/lightcurves/tarfiles'
# Grab the base url and loop over to identify directories
lines = str(requests.get(base_url).content).split('\\n') # grab `base_url`
# Create subdirectory list
quarter_directories = []
for line in lines:
# Identify the subdirectory inside the archvie API
if '_public' in line:
# If `_public` then store
quarter_directories.append(line.split('a href=\"')[1].split('/')[0])
# Loop over Kepler quarters to identify all tar available files
for quarter_dir in quarter_directories:
# Allocate the url directory for the API download
url_ = f'{base_url}/{quarter_dir}'
# Store and create (as needed) the local storage directory
quarter_dir = os.path.join(store_dir, quarter_dir)
if not os.path.exists(quarter_dir):
os.mkdir(quarter_dir)
# For every quarter, check if `url_` exists.
# if not, then create a blank list dictionary entry
if url_ not in tar_file_names:
tar_file_names[url_] = []
# Grab the archive HTML page -- list of files
# and split long string into lines
lines = str(requests.get(url_).content).split('\\n')
# Loop over lines
for line in lines:
# Identify the tar.gz files with links
if '.tgz' in line and 'a href' in line:
# Store the tar.gz filename
filename = line.split('a href=\"')[1].split('\"')[0]
tar_file_names[url_].append(filename)
# If user wants to store all of the urls and filenames
if store_quarterly_tar_filenames:
# Store dictionary as json file
with open('kepler_quarterly_tar_filenames.json', 'w') as outfile:
json.dump(tar_file_names, outfile)
# Loop over urls + filenames to create the wget commands
call_commands = [] # start with a blank list
for url_quarter, filenames in tar_file_names.items():
# For each url_quarter and list of filenames
# create a unique `wget` line in the script
for filename in filenames:
url_file = f'{url_quarter}/{filename}' # Full archive url/filename
command = ['wget', '-c', '-O', filename, url_file] # wget command
call_commands.append(command) # store wget command in list
# Open `Pool` with `cpu_count` workers as `pool`
with Pool(cpu_count()) as pool:
# mp.Pool over each `wget` command in bash via `call`
pool.starmap(call, zip(call_commands))