nicolo-tellini/S.cerevisiaeData

NEWS:

Easer access to the collection and the manuscripts

CONTENT

What's inside:

- pipeline
- scerstrains.csv # info on S.cer. strains
- sceradditionalstrains.txt # sequencing not available in ENA but downloadable elsewhere
- scerlowcoveragestrains.txt # scer strains without DP filtering
- publicdata.txt # link to the data
- QRCode.png # access to the data
- collectionIncluded.txt # collection references
- short illumina reads data of the hybrids ScerxSpar OS162 and OS2389

COLLECTIONS

First Name	Year	Population	Paper
Dunn cooming soon	2012	-	link
Skelly	2013	Beer	link
Marsit (GENOWINE proj)	2015	Wine	link
Marsit (EVOLYA proj) cooming soon	2015	Wine	link
Almeida	2015	Mediterranean/North America/Japan	link
Strope (100-genome proj)	2015	Multi	link
Barbosa	2016	Brasilian	link
cooming soon	201?	Clinical	link
Gonçalves	2016	wine/beer	link
cooming soon	2016	Wine	link
Gallone	2016	Beer	link
Almeida coming soon	2017	-	link
Barbosa	2018	Cachaça	link
Peter and De Chiara (1011 proj)	2018	Multi	link
Legras	2018	Wine/Fermented food environments	link
Duan	2018	Chinese	link
Ramazzotti	2019	Wine/insect	link
Pontes	2019	Alpechin	link
Gallone	2019	Beer	link
Fay coming soon	2019	Beer	link
cooming soon	202?	NZ	link
cooming soon	202?	African Fermented Food	link
cooming soon	202?	Wild	link
cooming soon	202?	???	link

SOFTWARE

bwa v. 0.7.17-r1198-dirty
samtools v. 1.14
bcftools v. 1.15.1

gVCF/BCF FEATURES

The fastqs have been aligned against S288C_reference_genome_R64-3-1_20210421.fa (Scer.genome.fa in pipeline/rep).

The files are provided in text format (.gvcf).

All the genomic positions are included (as long as at least 1 strain has been genotyped at that position).

Chromosome names are lower case and mantain roman numerals eg. chrIII (chrMT is the only exception).

Strain names are replaced by the ENA archive Run Accession.

The HOWTO below allows to rename the strains in the header.

Example

The strain MTZ13.12 in the gVCF is named SRR7851920. Renaming SRR7851920 results in MTZ13.12

The use of the Run Accession facilitates the filtering phase. This prevents the misselection of strains with overlapping, similar or multisymbolic names.

The gVCFs/BCFs were filtered as follow:

MQ >= 5
QUAL >= 20
DP >= 10

NOTE: some of the S.cerevisiae isolates were made available via custom website; these strains are listed in sceradditionalstrains.txt and genomic data stored in sceradditionalstrains files;

NOTE: some of the S.cerevisiae isolates were low coverage (DP filtering was not applied); these strains are listed in scerlowcoveragestrains.txt and genomic data stored in scerlowcoveragestrains files.

DATA ACCESS

🔗 Public link

HOWTO

extract per-sample/s data

bcftools view -S thisFIELcontainsONEstrainPERline.txt gvcf.gz -Oz -o myfavoritesamples.gvcf.gz

extract per-region/s data

bcftools view -R thisFILEcontainsCHRstartENDtabSEPARATEDcoordinates.bed gvcf.gz -Oz -o myfavoritesamples.gvcf.gz

extract only variant positions (SNPs)

bcftools view -e 'ALT="."' gvcf.gz -Oz -o vcf.gz

replace ENA archive Run Accession codes with the original strain names

Before proceed: the order of the ENA archive Run Asccession in scerstrains.csv must be the same of the output given by

bcftools query -l gvcf.gz

Important

The file we provide is already ordered but, if you subsetted by samples you need to subset scerstrains.csv and be sure the order is mantained as intended.

If/when the order is the same you can move to the next step.

cut -f2 scerstrains.csv | grep -v strainName > fromENAtoStrainName.txt

bcftools reheader --samples fromENAtoStrainName.txt -Oz -o gvcf.renamedstrains.gz gvcf.gz

renaming the strains in any other .txt file from downstream analyses.

Make a copy (backup copy) of the .txt file before running sed (sed is as powerful as dangerous).

for j in $(cut -f1 DATAonSCER.csv | grep -v vcfname)
do
 k=$(grep -w $j DATAonSCER.csv | cut -f2)
 sed -i "s+\<${j}\>+${k}+g" myresults.txt
done

ADDITIONAL DATA

Additional data are stored in a network attached storage (NAS) and shared through a personal link protected by password; both will be provided by email.

The password is personal and unique.

The access to the data is restricted to a few devices for security reasons.

The validity period of the link is limited to the date the download is ultimated.

Additional data:

single-strain gVCF (filtered as described below) [available]

Filters

MQ >= 5

QUAL >= 20

DP >= 10
single-strain BAM files [not available but can be request]
Any other intermediate file [not available but can be request]

CONTACTS

Short-term contact:
To: nicolo.tellini.2@gmail.com
Subject: DATAEXT-yourname-DD/MM/YYYY

Long-term contact:
To: matnamo@gmail.com
Subject: DATAEXT-yourname-DD/MM/YYYY

VERSION

Version	Date	N. isolates
1.0	09/12/2022	1,674

CITATION

Please cite the paper below when using the data for your publications, along with the papers to the collections you use and the software listed above.

Tellini, N., De Chiara, M., Mozzachiodi, S., Tattini, L., Vischioni, C., Naumova, E., Warringer, J., Bergström, A. and Liti, G., 2023. Ancient and recent origins of shared polymorphisms in yeast. (preprint)

nicolo-tellini / S.cerevisiaeData