nicolo-tellini / S.cerevisiaeData

1,674 S.cerevisiae genomics data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

phylo logo

Licence Release release date commit

NEWS:

Easer access to the collection and the manuscripts

CONTENT

What's inside:

- pipeline
- scerstrains.csv # info on S.cer. strains
- sceradditionalstrains.txt # sequencing not available in ENA but downloadable elsewhere
- scerlowcoveragestrains.txt # scer strains without DP filtering
- publicdata.txt # link to the data
- QRCode.png # access to the data
- collectionIncluded.txt # collection references
- short illumina reads data of the hybrids ScerxSpar OS162 and OS2389

COLLECTIONS

First Name Year Population Paper
Dunn cooming soon 2012 - link
Skelly 2013 Beer link
Marsit (GENOWINE proj) 2015 Wine link
Marsit (EVOLYA proj) cooming soon 2015 Wine link
Almeida 2015 Mediterranean/North America/Japan link
Strope (100-genome proj) 2015 Multi link
Barbosa 2016 Brasilian link
cooming soon 201? Clinical link
Gonçalves 2016 wine/beer link
cooming soon 2016 Wine link
Gallone 2016 Beer link
Almeida coming soon 2017 - link
Barbosa 2018 Cachaça link
Peter and De Chiara (1011 proj) 2018 Multi link
Legras 2018 Wine/Fermented food environments link
Duan 2018 Chinese link
Ramazzotti 2019 Wine/insect link
Pontes 2019 Alpechin link
Gallone 2019 Beer link
Fay coming soon 2019 Beer link
cooming soon 202? NZ link
cooming soon 202? African Fermented Food link
cooming soon 202? Wild link
cooming soon 202? ??? link

SOFTWARE

gVCF/BCF FEATURES

The fastqs have been aligned against S288C_reference_genome_R64-3-1_20210421.fa (Scer.genome.fa in pipeline/rep).

The files are provided in text format (.gvcf).

All the genomic positions are included (as long as at least 1 strain has been genotyped at that position).

Chromosome names are lower case and mantain roman numerals eg. chrIII (chrMT is the only exception).

Strain names are replaced by the ENA archive Run Accession.

The HOWTO below allows to rename the strains in the header.

Example

The strain MTZ13.12 in the gVCF is named SRR7851920. Renaming SRR7851920 results in MTZ13.12

The use of the Run Accession facilitates the filtering phase. This prevents the misselection of strains with overlapping, similar or multisymbolic names.

The gVCFs/BCFs were filtered as follow:

  • MQ >= 5

  • QUAL >= 20

  • DP >= 10

NOTE: some of the S.cerevisiae isolates were made available via custom website; these strains are listed in sceradditionalstrains.txt and genomic data stored in sceradditionalstrains files;

NOTE: some of the S.cerevisiae isolates were low coverage (DP filtering was not applied); these strains are listed in scerlowcoveragestrains.txt and genomic data stored in scerlowcoveragestrains files.

DATA ACCESS

🔗 Public link

QRCODE

HOWTO

  • extract per-sample/s data
bcftools view -S thisFIELcontainsONEstrainPERline.txt gvcf.gz -Oz -o myfavoritesamples.gvcf.gz
  • extract per-region/s data
bcftools view -R thisFILEcontainsCHRstartENDtabSEPARATEDcoordinates.bed gvcf.gz -Oz -o myfavoritesamples.gvcf.gz
  • extract only variant positions (SNPs)
bcftools view -e 'ALT="."' gvcf.gz -Oz -o vcf.gz
  • replace ENA archive Run Accession codes with the original strain names

Before proceed: the order of the ENA archive Run Asccession in scerstrains.csv must be the same of the output given by

bcftools query -l gvcf.gz
Important

The file we provide is already ordered but, if you subsetted by samples you need to subset scerstrains.csv and be sure the order is mantained as intended.

If/when the order is the same you can move to the next step.

cut -f2 scerstrains.csv | grep -v strainName > fromENAtoStrainName.txt

bcftools reheader --samples fromENAtoStrainName.txt -Oz -o gvcf.renamedstrains.gz gvcf.gz
  • renaming the strains in any other .txt file from downstream analyses.

Make a copy (backup copy) of the .txt file before running sed (sed is as powerful as dangerous).

for j in $(cut -f1 DATAonSCER.csv | grep -v vcfname)
do
 k=$(grep -w $j DATAonSCER.csv | cut -f2)
 sed -i "s+\<${j}\>+${k}+g" myresults.txt
done

ADDITIONAL DATA

Additional data are stored in a network attached storage (NAS) and shared through a personal link protected by password; both will be provided by email.

The password is personal and unique.

The access to the data is restricted to a few devices for security reasons.

The validity period of the link is limited to the date the download is ultimated.

Additional data:

  • single-strain gVCF (filtered as described below) [available]

    Filters

    MQ >= 5

    QUAL >= 20

    DP >= 10

  • single-strain BAM files [not available but can be request]

  • Any other intermediate file [not available but can be request]

CONTACTS

Short-term contact:
To: nicolo.tellini.2@gmail.com
Subject: DATAEXT-yourname-DD/MM/YYYY

Long-term contact:
To: matnamo@gmail.com
Subject: DATAEXT-yourname-DD/MM/YYYY

VERSION

Version Date N. isolates
1.0 09/12/2022 1,674

CITATION

Please cite the paper below when using the data for your publications, along with the papers to the collections you use and the software listed above.

Tellini, N., De Chiara, M., Mozzachiodi, S., Tattini, L., Vischioni, C., Naumova, E., Warringer, J., Bergström, A. and Liti, G., 2023. Ancient and recent origins of shared polymorphisms in yeast. (preprint)

About

1,674 S.cerevisiae genomics data

License:MIT License


Languages

Language:Shell 100.0%