Fetch metadata information from the following databases:
- GEO: Gene Expression Omnibus,
- SRA: Sequence Read Archive,
- EMBL-EBI: European Molecular BIology Laboratory’s European BIoinformatics Institute,
- DDBJ: DNA Data Bank of Japan,
- NIH Biosample: Biological source materials used in experimental assays,
- ENCODE: The Encyclopedia of DNA Elements.
ffq
receives an accession and returns the metadata for that accession as well as the metadata for all downstream accessions following the connections between GEO, SRA, EMBL-EBI, DDBJ, and Biosample:
By default, ffq returns all downstream metadata down to the level of the SRR record. However, the desired level of resolution can be specified.
ffq
can also skip returning the metadata, and instead return the raw data download links from any available host (FTP
, AWS
, GCP
or NCBI
) for GEO and SRA ids.
The latest release can be installed with
pip install ffq
The development version can be installed with
pip install git+https://github.com/pachterlab/ffq
ffq [accession]
where [accession]
is either:
-
an SRA/EBI/DDJ accession
- (
SRR
,SRX
,SRS
orSRP
) - (
ERR
,ERX
,ERS
orERP
) - (
DRR
,DRS
,DRX
orDRP
)
- (
-
a GEO accession (
GSE
orGSM
) -
an ENCODE accession (
ENCSR
,ENCSB
orENCSD
) -
a Bioproject accession (
CXR
) -
a Biosample accession (
SAMN
') -
a DOI
$ ffq SRR9990627
#=> Returns metadata for the SRR9990627 run.
$ ffq SRX7347523
#=> Returns metadata for the experiment SRX7347523 and for its associated SRR run.
$ ffq GSE129845
#=> Returns metadata for GSE129845 and for its 5 associated GSM, SRS, SRX and SRR ids.
$ ffq DRP004583
#=> Returns metadata for the study DRP004583 and its 104 associated DRS, DRX and SRR ids.
$ ffq ENCSR998WNE
#=> Returns metadata for the ENCODE experiment ENCSR998WNE.
ffq [accession 1] [accession 2] ...
where [accession 1]
and [accession 2]
are accessions belonging to any of the above usage example categories.
$ ffq SRR11181954 SRR11181954 SRR11181956
#=> Returns metadata for the three SRR runs.
$ ffq GSM4339769 GSM4339770 GSM4339771
#=> Returns metadata for the three GSM accessions, as well as for their corresponding downstream SRS, SRX and SRR accessions.
ffq -l [level] [accession]
where [level]
is the number of downstream accessions you want to fetch
$ ffq -l 1 GSM4339769
#=> Returns metadata only for GSM4339769, and not from any downstream accession.
$ ffq -l 3 GSE115469
#=> Returns metadata for GSE115469 and its downstream GSM and SRS accessions.
ffq --ftp [accession(s)]
where [accession(s)]
is either a single accession or a space-delimited list of accessions.
ffq --aws [accession(s)]
ffq --gcp [accession(s)]
ffq --ncbi [accession(s)]
# FTP with an SRR
$ ffq --ftp SRR10668798
[
{
"accession": "SRR10668798",
"filename": "SRR10668798_1.fastq.gz",
"filetype": "fastq",
"filesize": 31876537192,
"filenumber": 1,
"md5": "bf8078b5a9cc62b0fee98059f5b87fa7",
"urltype": "ftp",
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_1.fastq.gz"
},
...
# FTP with a GSE
$ ffq --ftp GSE115469
[
{
"accession": "SRR7276474",
"filename": "P1TLH.bam",
"filetype": "bam",
"filesize": 48545467653,
"filenumber": 1,
"md5": "d0fde6bf21d9f97bdf349a3d6f0a8787",
"urltype": "ftp",
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/SRA716/SRA716608/bam/P1TLH.bam"
},
...
# AWS with SRX
$ ffq --aws SRX7347523
[
{
"accession": "SRR10668798",
"filename": "T84_S1_L001_R1_001.fastq.1",
"filetype": "fastq",
"filesize": null,
"filenumber": 1,
"md5": null,
"urltype": "aws",
"url": "s3://sra-pub-src-6/SRR10668798/T84_S1_L001_R1_001.fastq.1"
},
...
# GCP with ERS
$ ffq --gcp ERS3861775
[
{
"accession": "ERR3585496",
"filename": "4834STDY7002879.bam.1",
"filetype": "bam",
"filesize": null,
"filenumber": 1,
"md5": null,
"urltype": "gcp",
"url": "gs://sra-pub-src-17/ERR3585496/4834STDY7002879.bam.1"
}
]
# NCBI with GSM
$ ffq --ncbi GSM2905292
[
{
"accession": "SRR6425163",
"filename": "SRR6425163.1",
"filetype": "sra",
"filesize": null,
"filenumber": 1,
"md5": null,
"urltype": "ncbi",
"url": "https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-13/SRR6425163/SRR6425163.1"
}
]
ffq -o [JSON_PATH] [accession(s)]
where [JSON_PATH]
is the path to the JSON file that will contain the information
and [accession(s)]
is either a single accession or a space-delimited list of accessions.
ffq -o [OUT_DIR] --split [accessions]
where [OUT_DIR]
is the path to directory to which to write the JSON files and [accessions]
is a space-delimited list of accessions.
Information about each accession will be written to its own separate JSON file named [accession].json
.
ffq [DOIS]
where [DOIS]
is a space-delimited list of one or more DOIs. The output is a JSON-formatted string (or a JSON file if -o
is provided) with SRA study accessions as keys. When --split
is also provided, each study is written to its own separate JSON.
Examples of complete outputs are available in the examples directory.
ffq
is specifically designed to download metadata and to facilitate obtaining links to sequence files. To download raw data from the links obtained with ffq
you can use one of the following:
cURL
andwget
for FTP links,aws
for AWS links,gsutil
for GCP links,fasterq dump
for converting SRA files to FASTQ files.
By default, cURL
is installed on most computers and can be used to download files with FTP links. Alternatively, wget
can be used.
# Obtain FTP links
$ ffq --ftp SRR10668798
[
{
"accession": "SRR10668798",
"filename": "SRR10668798_1.fastq.gz",
"filetype": "fastq",
"filesize": 31876537192,
"filenumber": 1,
"md5": "bf8078b5a9cc62b0fee98059f5b87fa7",
"urltype": "ftp",
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_1.fastq.gz"
},
{
"accession": "SRR10668798",
"filename": "SRR10668798_2.fastq.gz",
"filetype": "fastq",
"filesize": 43760586944,
"filenumber": 2,
"md5": "351df47dca211c1f66ef327e280bd4fd",
"urltype": "ftp",
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_2.fastq.gz"
}
]
# Download the files one-by-one
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_1.fastq.gz
$ curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR106/098/SRR10668798/SRR10668798_2.fastq.gz
Alternatively, the url
s can be extracted from the json output with jq
and then piped into cURL
.
$ ffq --ftp SRR10668798 | jq -r '.[] | .url' | xargs curl -O
If you don't have jq
installed, you can use the default program grep
.
$ ffq --ftp SRR10668798 | grep -Eo '"url": "[^"]*"' | grep -o '"[^"]*"$' | xargs curl -O
In order to download files from AWS, the aws
tool must be installed and credentials must be setup.
# Pipe AWS links to aws s3 cp and download
$ ffq --aws SRX7347523 | jq -r '.[] | .url' | xargs -I {} aws s3 cp {} .
In order to download files from GCP, the gsutil
tool must be install and credentials must be setup.
# Pipe GCP links to gsutil cp and download
$ ffq --gcp ERS3861775 | jq -r '.[] | .url' | xargs -I {} gsutil cp {} .
SRA files downloaded from NCBI can be converted to FASTQ files using fasterq-dump
which is installed as part of SRA Toolkit.
# Pipe SRA link to curl and download the SRA file
$ ffq --ncbi GSM2905292 | jq -r '.[] | .url' | xargs curl -O
# Convert the SRA file to FASTQ files
$ fastq-dump ./SRR6425163.1 --split-files --include-technical --gzip -O ./SRR6425163