How to download the full Kepler dataset and Missing Link to Documentation

Question

How to download the full Kepler dataset and Missing Link to Documentation

exowanderer opened this issue 5 years ago · comments

Describe the bug
The website which originally contained the documentation http://dan.iel.fm/kplr is
no longer valid

To Reproduce
Click here: http://dan.iel.fm/kplr
Also: goto the main repo page, and click the words the documentation! at the bottom.

Expected behavior
For the website to load the documentation and tutorials

Additional context
Link should open https://dfm.io/kplr/ instead

Dan Foreman-Mackey · Answer 1 · Mon Mar 30 2020 20:20:36 GMT+0800 (China Standard Time)

thanks for this! I'll fix it, but I'd generally recommend using lightkurve instead for most things. Are there features here that you still want?

Jonathan Fraine · Answer 2 · Mon Mar 30 2020 22:06:45 GMT+0800 (China Standard Time)

To answer your question: Not exactly.

I'm trying to download the entire 200k x 17 quarters long and short cadence data. Friends at MAST suggested that kplr was an efficient API for downloading all of the kepler time series data (and headers). I don't want to click 200k x 17 download buttons; and I don't want the FFIs.

I have a mediocre internet connection; but I also have 5TB of storage capacity. So I decided that downloading and operating locally is probably better for me than downloading on the fly.

To my (very limited) knowledge, there is not a sequence of wget scripts to grab the full light curve info. I want to do injection test, then train both GPs and DNNs on a subset to get a feel for the time scales involved in both analyses.

Sorry to derail this comment section into a "can you help" thread; but do you suggest using kplr, lightkurve, or a completely different technique to download all of this data?

Dan Foreman-Mackey · Answer 3 · Mon Mar 30 2020 22:14:33 GMT+0800 (China Standard Time)

ah - happy to help!

What do you mean by "headers" and "full light curve info" here? I would expect that all the information would be in the FITS files themselves. If I wanted to download all of the light curves, I would definitely use something like wget rather than kplr or lightkurve because the latter two will need to send multiple requests to MAST for every target/light curve instead of just one per light curve file.

Jonathan Fraine · Answer 4 · Mon Mar 30 2020 22:38:23 GMT+0800 (China Standard Time)

I'm sorry that I was confusing. I realize that now.

I am looking for the information contained in the fits files that thekplr code caches in the kplr009787239-2009166043257_llc.fits fits files. [This fits name is from the tutorial]

I like the API interface for kplr, which is why I'm happy to use it. But when downloading TB of data, speed is of the essence.

This next questions feels less relevant to kplr than my global objective: where do I find a list of a wget scripts -- or at least websites -- for the Kepler light curves. The goal is to have 200k x 17 stacks of _llc.fits and _slc.fits files.

Jonathan Fraine · Answer 5 · Mon Mar 30 2020 22:41:31 GMT+0800 (China Standard Time)

After googling around, I found this https://archive.stsci.edu/pub/kepler/lightcurves/ database on the archive.
I think that this is what kplr uses as well.

I could use wget -r -np https://archive.stsci.edu/pub/kepler/lightcurves/ to get all of the light curves in there; but that would be in linear time. I have 16 cores, and "know how to use them" (wild west joke).

I wrote an asynchronous download script that takes in a list of wget calls (one per file) and sends them to each of the 16 cores. This reduces my download time by ~5x.

I can get the full list of files too; but I'm hoping that the list already exists on the archive.

Dan Foreman-Mackey · Answer 6 · Mon Mar 30 2020 22:59:32 GMT+0800 (China Standard Time)

Take a look over here for the options: https://archive.stsci.edu/kepler/download_options.html

I'd probably recommend downloading the tar files, but there are lots of options here!

Jonathan Fraine · Answer 7 · Tue Mar 31 2020 02:08:34 GMT+0800 (China Standard Time)

I hope that this is not too long for an Issue thread; but I wanted to give back by providing my MP-download script, which is the result of our conversation here.

This python multiprocessing script will search the MAST Kepler archive for all .tgz files in the QXX_public subdirectories under the Kepler TarFile Archive.

I had to do some manual web scraping; but it works at the 99% level. I will know after completion if the full download worked correctly; so there may be small bugs to be found.

Multiprocessing Download Script for Full Kepler Dataset in Quarterly Tar Files

import json  # import json to store urls for later
import os  # import os to check and make directories
import requests  # import requests to grab the HTML data

from subprocess import call  # import call to run `wget` on `bash`
from multiprocessing import Pool, cpu_count  # import mp to run `call` in mp

store_quarterly_tar_filenames = True  # store json for url and filenames

# Assign storage directory
HOME = os.environ['HOME']
store_dir = f'{HOME}/.kepler_data/'

# If `store_dir` does not exist, create it
if not os.path.exists(store_dir):
    os.mkdir(store_dir)

# Begin tar file storage methods
tar_file_names = {}

# Archive API base URL
base_url = 'https://archive.stsci.edu/pub/kepler/lightcurves/tarfiles'

# Grab the base url and loop over to identify directories
lines = str(requests.get(base_url).content).split('\\n')  # grab `base_url`

# Create subdirectory list
quarter_directories = []
for line in lines:
    # Identify the subdirectory inside the archvie API
    if '_public' in line:
        # If `_public` then store
        quarter_directories.append(line.split('a href=\"')[1].split('/')[0])

# Loop over Kepler quarters to identify all tar available files
for quarter_dir in quarter_directories:
    # Allocate the url directory for the API download
    url_ = f'{base_url}/{quarter_dir}'

    # Store and create (as needed) the local storage directory
    quarter_dir = os.path.join(store_dir, quarter_dir)
    if not os.path.exists(quarter_dir):
        os.mkdir(quarter_dir)

    # For every quarter, check if `url_` exists.
    #   if not, then create a blank list dictionary entry
    if url_ not in tar_file_names:
        tar_file_names[url_] = []

    # Grab the archive HTML page -- list of files
    #   and split long string into lines
    lines = str(requests.get(url_).content).split('\\n')

    # Loop over lines
    for line in lines:
        # Identify the tar.gz files with links
        if '.tgz' in line and 'a href' in line:
            # Store the tar.gz filename
            filename = line.split('a href=\"')[1].split('\"')[0]
            tar_file_names[url_].append(filename)

# If user wants to store all of the urls and filenames
if store_quarterly_tar_filenames:
    # Store dictionary as json file
    with open('kepler_quarterly_tar_filenames.json', 'w') as outfile:
        json.dump(tar_file_names, outfile)

# Loop over urls + filenames to create the wget commands
call_commands = []  # start with a blank list
for url_quarter, filenames in tar_file_names.items():
    # For each url_quarter and list of filenames
    #   create a unique `wget` line in the script
    for filename in filenames:
        url_file = f'{url_quarter}/{filename}'  # Full archive url/filename
        command = ['wget', '-c', '-O', filename, url_file]  # wget command
        call_commands.append(command)  # store wget command in list

# Open `Pool` with `cpu_count` workers as `pool`
with Pool(cpu_count()) as pool:
    # mp.Pool over each `wget` command in bash via `call`
    pool.starmap(call, zip(call_commands))