Clarifying the file input format

Question

Clarifying the file input format

anncsun opened this issue 9 months ago · comments

Hi, I recently came across your article in Blood Advances and I'm interested in trying your program. I installed everything and the test ran successfully with all the outputs specified by your wiki. However, I'm confused about what input I should be giving to the program. Your wiki says that it should be raw counts, but when I looked through your text.csv file, it can't possibly be raw counts since they're not all integer values. In your supp, you converted your data to logCPM, normalized to TMM, and then standardized. Can you please clarify the format of my input? Also, will it be alright to feed in a dataset with all ensemble genes or should I also filter out noncoding/low count genes first? Thank you for your help and for this interesting software! It's an analysis that I had discussed with my PI before, but I lacked the informatics skills to do it.

AllenZPGu · Answer 1 · Wed Nov 08 2023 16:07:25 GMT+0800 (China Standard Time)

Hi Ann, thanks for your interest! To answer your questions:

The counts in the test files are quantified by Salmon, which reports estimated counts and may be non-integers. TALLSorts doesn't enforce a need for counts inputs to be integers, and is robust to different counts quantification methods.
The default TALLSorts classification model essentially picks out and weights the most informative genes, a list of which can be found in Supp Table S6. So it doesn't matter if your input matrix is pre-filtered or not.

Please let me know if I can be of more help!

dkapadia612 · Answer 2 · Wed Jan 03 2024 02:40:11 GMT+0800 (China Standard Time)

Hello! I have a follow-up query on the normalization question posed by @anncsun. Is converting data to logCPM, normalizing to TMM, and then standardizing required before running TALLSorts, or are these steps performed by the TALLSorts -s sample.csv function? Thank you!