Customize database guidance

Question

Customize database guidance

dcm9123 opened this issue 6 months ago · comments

Hello!

I am new to Picrust2.0 and I am trying to figure out how to use a customized database for my ASVs. We originally designed two consortia from two set of bacteria (consortia #1 and 2). Each consortium has different bacteria with a slight overlap, but each roughly has around ~60 different bacteria. We performed WGS on these, assembled their genomes, and estimated their metabolic pathways using DRAM. Then we used these consortia and performed a lavage on animal models to see their engraftment efficiency on WT and transgenic mice. We used their fecal pellets to determine this by extracting the DNA, amplified 16S v3v4, sequenced it (MiSeq 2x250 PE), and estimate their ASVs using DADA2 and GTDBTK for taxonomy annotation.

Now I have with me both the .biom table with the abundances of each ASV as well as a fasta file for each ASV sequence. Now I would like to estimate the functional abundance across these two types of mice. Now, my question is, what kind of modifications do I need to make for estimating the functional abundance properly? In the FAQ, I noticed there are these instructions:

A multiple-sequence alignment (with the extension .fna or .fasta and can optionally be gzipped)
A tree in newick format (extension .tre)
A hidden-markov model of the multiple-sequence alignment (extension .hmm)
A modelfile output by RaXmL specifying the best parameters for the tree (extension .model)

I can easily create a MSA of my ASVs, as well as create a RaXmL tree and a hmm from the MSA for this to work, but do I also need to modify the /Users/danielcm/Desktop/Sycuro/Projects/Diabetes/picrust2/default_files/prokaryotic/ files for EC, KO, etc from the metabolic pathway generated from DRAM? I could adapt a KO table for this if that will help me reduce the bias of calling a protein that perhaps could not be in the strains found in the reference files. However I am unsure if I need to modify every file to my DRAM output (pfam, ko, cog, ec, 16s, pheno, tigfram)

Thanks in advance for all your help!!! (:

Robyn Wright · Answer 1 · Sat Dec 09 2023 09:11:40 GMT+0800 (China Standard Time)

Hi there, For your purposes of customising a database, you would want to create those files (the MSA, tree, HMM and RaXmL model) using your MAGs (well, with the 16S sequences of your MAGs). And yes, you’d then need to modify some of those files with your DRAM output, although you won’t need all of them, depending on which of those classifications you want your output to be for your 16S amplicon sequencing ASVs. You’ll need the 16S and then 1+ of the other ones. Let me know if you have other questions or that didn’t cover your question properly! Best wishes, Robyn From: dcm9123 ***@***.***> Date: Thursday, December 7, 2023 at 6:04 PM To: picrust/picrust2 ***@***.***> Cc: Subscribed ***@***.***> Subject: [picrust/picrust2] Customize database guidance (Issue #335) CAUTION: The Sender of this email is not from within Dalhousie. Hello! I am new to Picrust2.0 and I am trying to figure out how to use a customized database for my ASVs. We originally designed two consortia from two set of bacteria (consortia #1<#1> and 2). Each consortium has different bacteria with a slight overlap, but each roughly has around ~60 different bacteria. We performed WGS on these, assembled their genomes, and estimated their metabolic pathways using DRAM. Then we used these consortia and performed a lavage on animal models to see their engraftment efficiency on WT and transgenic mice. We used their fecal pellets to determine this by extracting the DNA, amplified 16S v3v4, sequenced it (MiSeq 2x250 PE), and estimate their ASVs using DADA2 and GTDBTK for taxonomy annotation. Now I have with me both the .biom table with the abundances of each ASV as well as a fasta file for each ASV sequence. Now I would like to estimate the functional abundance across these two types of mice. Now, my question is, what kind of modifications do I need to make for estimating the functional abundance properly? In the FAQ, I noticed there are these instructions: A multiple-sequence alignment (with the extension .fna or .fasta and can optionally be gzipped) A tree in newick format (extension .tre) A hidden-markov model of the multiple-sequence alignment (extension .hmm) A modelfile output by RaXmL specifying the best parameters for the tree (extension .model) I can easily create a MSA of my ASVs, as well as create a RaXmL tree and a hmm from the MSA for this to work, but do I also need to modify the /Users/danielcm/Desktop/Sycuro/Projects/Diabetes/picrust2/default_files/prokaryotic/ files for EC, KO, etc from the metabolic pathway generated from DRAM? I could adapt a KO table for this if that will help me reduce the bias of calling a protein that perhaps could not be in the strains found in the reference files. However I am unsure if I need to modify every file to my DRAM output (pfam, ko, cog, ec, 16s, pheno, tigfram) Thanks in advance for all your help!!! (: — Reply to this email directly, view it on GitHub<#335>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEAYQDQO3DG43QB7HCYCHCTYII4PPAVCNFSM6AAAAABALYGNLCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTCNRQGQYDANA>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

dcm9123 · Answer 2 · Fri Jan 12 2024 06:29:03 GMT+0800 (China Standard Time)

Hello,
Thank you for your fast response! I was finally able to get back to this and I am encountering a new problem. So far I was able to get my four files for my new reference folder, which has the .hmm, .tre, .model, and .fna alignded ASVs. While I was able to run the same data on the original reference folder of picrust2, I cannot run it in the new local folder I have made with the following command:
picrust2_pipeline.py -s test_diabetes_data/raw_dada2_asv.fasta -i test_diabetes_data/feature_table_gtdb.biom -o picrust2_with_local_files_output --ref_dir local_files_picrust

I get the following error:
Warning - non-default reference files specified with default pathway mapfile of prokaryote-specific MetaCyc pathways (--pathway_map option). This usage may be unintended.

Error running this command:

place_seqs.py --study_fasta test_diabetes_data/raw_dada2_asv.fasta --ref_dir local_files_picrust --out_tree picrust2_with_local_files_output/out.tre --processes 1 --intermediate picrust2_with_local_files_output/intermediate/place_seqs --min_align 0.8 --chunk_size 5000 --placement_tool epa-ng

Standard error of the above failed command:
Stopping - all 125 input sequences aligned poorly to reference sequences (--min_align option specified a minimum proportion of 0.8 aligning to reference sequences).

I have not modified yet the KO, EC, and metabolic pathways folder as I wanted to see if it's running smoothly with the current ones. I've checked the FAQ page and it seems that the solution is to fix the fasta headers of my ASVs, but the thing is that they ran perfectly fine before (plus the headers are "ASV1", "ASV2", ...). Do I need to modify the MetaCyc pathways too?

This is what I've done so far:

Created an MSA file from my original ASVs and put it in the local reference folder with the .fna extension.
Using the file from the previous step, I created a tree model and best tree using raxml-ng-mpi:
raxml-ng-mpi --msa aligned_asv.fna --model GTR+G --threads 5 --prefix raxml_ng_mpi_local_asv
Changed the extension of .bestTree to .tre as it seems it is in newick format.
Put the .model file in the local reference folder.
Created a .hmm file from my ASV file using hmmbuild
Finally, I ran the pipeline as shown before using as input the .fasta of the same ASVs that I used to generate the MSA and tree.

Thanks in advance,

Daniel CM

Robyn Wright · Answer 3 · Tue Jan 23 2024 03:03:06 GMT+0800 (China Standard Time)

Hi Daniel,

When you say that you created the MSA using your original ASVs, is that what you mean? The MSA in the local reference folder needs to be with the 16S sequences of your MAGs. Also note that your file names need to have the same prefix as the folder name (as shown here).

If that is all correct, I'd recommend running through step-by-step rather than just using the full pipeline (so start with the place_seqs.py script, etc.)

If that is all what you've done already, could you send the files you're using please?

Thanks,
Robyn