cellTypeSpecificAlleleSpecificExpression: filename and sample names

Question

cellTypeSpecificAlleleSpecificExpression: filename and sample names

colindaven opened this issue 8 years ago · comments

Colin Davenport commented 8 years ago

Hi,

which filename is meant here by the error message ? A few different ones have to be declared.

Also: is snp location None also a critical error ?

Thanks,
Colin

coupling file format:

x_rep1x_rep1.bam

also tried

x_rep1x_rep1

java -Xmx24g -jar cellTypeSpecificAlleleSpecificExpression-0.1-jar-with-dependencies.jar -A 1 --bam_file x_rep1.bam --coupling_file coupling1.txt --genotype_location x.vcf.gz -O test1_ASreads_output.txt
---- Starting ASREADS for the following settings: ----
input bam: x_rep1.bam
genotype location: x.vcf.gz
coupling file: coupling1.txt
output location: test1_ASreads_output.txt

snp Location: NONE

Location of bam file:
x_rep1.bam
Location of bam file is an existing file, will continue.
Sample names were loaded.
sample_name: x_rep1
sample_map: {}
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't find the filename in the sample names. Quitting.
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.readGenoAndAsFromIndividual(ReadGenoAndAsFromIndividual.java:180)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.MainEntryPoint.main(MainEntryPoint.java:354)

Adriaan van der Graaf · Answer 1 · Mon May 16 2016 18:54:28 GMT+0800 (China Standard Time)

Hi colindaven!

Thanks, for your question, nice to see you are interested in the program.

Based on the output of the program, I'm pretty sure you haven't input a correct coupling file. Make sure there is a tab character between the individual name (in the genotype file) and the sample name (the bam file).
Then the sample_map line will contain the individual and the sample name, and it should run.

A simple coupling file should look like this: click
Could you try that and see if this solves the problem.

The line "snp Location: NONE", you should not be worried about and is only used if you are interested in only taking ASreads from a defined set of SNPs.
I plan to change this line in a future commit so it is not confusing.

Thanks, Adriaan

Colin Davenport · Answer 2 · Tue May 17 2016 15:45:37 GMT+0800 (China Standard Time)

Adriaan, thanks for the reply.

The formatting above didn't work too well, apologies

`coupling file format:(with tabs)

x_rep1 x_rep1.bam

also tried

x_rep1 x_rep1`

If the coupling file is ok, then might the command line not be correct ? Do you always have all source bams and files in the same working directory for example ? I don't, but I excluded the paths for simplicity.

To come back to my original question too, could you change this error message please (if I'm correct)
"Couldn't find the filename in the sample names. Quitting."

Suggest
"Couldn't find the BAM filename in the sample names specified in the coupling file. Quitting."

Also, am I now using the most up to date version ?
cellTypeSpecificAlleleSpecificExpression-0.1-jar-with-dependencies.jar

Thanks,
Colin

Adriaan van der Graaf · Answer 3 · Tue May 17 2016 17:54:36 GMT+0800 (China Standard Time)

Dear Colin,

First, I don't see why you still have this problem, because I don't see the problem. In my opinion the problem is still the coupling file. I've attached a minimum working example, that is also used in testing of the file. This works on my machine. Could you see if this program works on your side?

minimumASReadsExample.tar.gz

Second: I will update the error messages and will update the program subsequently.

Third: You are using the latest version of the program if you have downloaded it from this link.

Thanks, Adriaan

Colin Davenport · Answer 4 · Tue May 17 2016 19:52:56 GMT+0800 (China Standard Time)

Hi Adriaan,

thanks for this.

The minimal example works fine:

java -jar ../../cellTypeSpecificAlleleSpecificExpression-0.1-jar-with-dependencies.jar -A 1 -C coupling.txt -O test1 --bam_file testBam.bam --genotype_location TriTyperFolder/
---- Starting ASREADS for the following settings: ----
input bam: testBam.bam
genotype location: TriTyperFolder/
coupling file: coupling.txt
output location: test1

snp Location: NONE

Location of bam file:
testBam.bam
Location of bam file is an existing file, will continue.
2016-05-17 13:40:32,141 INFO [TriTyperGenotypeData] Loaded 1 out of 1 samples.
2016-05-17 13:40:32,170 INFO [TriTyperGenotypeData] Loaded 12 out of 12 SNPs, 12 of loaded SNPs have annotation.
Sample names were loaded.
sample_name: testBam
sample_map: {testBam=0}
sample_index: 0
Initialized for reading bam file

The coupling file is so simple, I have tried multiple different versions and all possible variants which occur to me (with .bam, without .bam, with empty lines after - parse error, without empty lines after), yet nothing works. This leads me to believe it is another issues.

I am using a VCF file not the TriTyper format. I will try to create TriTyper format from the VCF and retry with this (no idea how tricky this will be!). The VCF is classically bgzipped and tabixed. I have also now got all files in the same directory, so this cannot be the problem. Also, does the VCF need to be phased (it's not!) ?

The VCF "sample" on the "#CHROM" header line might be an (additional?) issue. Does this have to be the same as that given to the JAR in the coupling.txt or original BAM file name ?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT x_rep1.bam

Sorry for this!
Colin

Adriaan van der Graaf · Answer 5 · Tue May 17 2016 20:03:37 GMT+0800 (China Standard Time)

Hey Colin,

First off, I'm happy you're using my program. So please don't feel sorry, it's nice to see that I did not only make it for myself. Issues will crop up and that is fine.

The information you've given me leads me to believe that the coupling file is fine, but that there is a problem in parsing the VCF, I guess.
My suggestion is to make it work using the trityper formatting this is a lot faster than a vcf.
It's really easy to do using genotype harmonizer. Using the commands in the wiki.

If you are allowed and comfortable, I'd like to see what kind of data you are using so that I can debug this Issue on my side. I understand this may be problematic, but I'm really interested in what goes wrong, so I can change my program subsequently.
Should you be willing to provide me with this, I'd be very interested in the vcf file (should only contain 100 variants) and a bam file with transcripts that overlap some of these variants. In this way, I may be able to solve the program in a later iteration.

Thanks, Adriaan

Colin Davenport · Answer 6 · Tue May 17 2016 20:32:56 GMT+0800 (China Standard Time)

Thanks, I'll try the TriTyper format. Keep in mind this is very sketchy non model org data (plants) so all a lot less sophisticated to what you might be useful coming from the human arena. I need to build a public test case but don't have one ready as yet.

Another observation:

This is interesting: when I purposefully destroy the example coupling file, replacing testBam as testBamXXX, and renaming as coupling_colin I get an entry in the sample_map output (sample_map: {testBamXXX=0}),
which I didn't have with my own data (see above, sample_map: {}). That is a bit strange isn't it ?

Might this not be due to the use of TriTyper as opposed to VCF format ?

java -jar ../../cellTypeSpecificAlleleSpecificExpression-0.1-jar-with-dependencies.jar -A 1 -C coupling_colin.txt -O test2 --bam_file testBam.bam --genotype_location TriTyperFolder/
---- Starting ASREADS for the following settings: ----
input bam: testBam.bam
genotype location: TriTyperFolder/
coupling file: coupling_colin.txt
output location: test2

snp Location: NONE

Location of bam file:
testBam.bam
Location of bam file is an existing file, will continue.
2016-05-17 13:57:25,306 INFO [TriTyperGenotypeData] Loaded 1 out of 1 samples.
2016-05-17 13:57:25,337 INFO [TriTyperGenotypeData] Loaded 12 out of 12 SNPs, 12 of loaded SNPs have annotation.
Sample names were loaded.
sample_name: testBam
sample_map: {testBamXXX=0}
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't find the filename in the sample names. Quitting.
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.readGenoAndAsFromIndividual(ReadGenoAndAsFromIndividual.java:180)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.MainEntryPoint.main(MainEntryPoint.java:354

Adriaan van der Graaf · Answer 7 · Tue May 17 2016 20:38:34 GMT+0800 (China Standard Time)

It is strange, but I'm afraid I don't want to go into it before I have a minimal test case here.
The fact that it is plant data is fine.
Responding to an earlier question, non-phased data is fine as well.

Thanks, Adriaan

Colin Davenport · Answer 8 · Wed May 18 2016 18:40:02 GMT+0800 (China Standard Time)

Yep, can understand that. I'll be building a test case with pure public data in the next days/weeks as far as I can get the time together.

Interestingly, using Trityper input (created from the VCF as per the default instructions for GenotypeHarmonizer you pointed out) came up with a naming error. This will hopefully come up in the test case as well though.

Location of bam file is an existing file, will continue.
2016-05-18 08:07:10,796 INFO [TriTyperGenotypeData] Loaded 1 out of 1 samples.
2016-05-18 08:07:11,438 INFO [TriTyperGenotypeData] Loaded 126532 out of 126532 SNPs, 126532 of loaded SNPs have annotation.
2016-05-18 08:07:11,438 INFO [GeneticVariantRange] Variants where not sorted, loading will be faster if the data is pre-sorted
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.convert_individual_names(ReadGenoAndAsFromIndividual.java:560)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.readGenoAndAsFromIndividual(ReadGenoAndAsFromIndividual.java:160)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.MainEntryPoint.main(MainEntryPoint.java:354)

Colin Davenport · Answer 9 · Thu May 19 2016 20:51:28 GMT+0800 (China Standard Time)

Interestingly, the test case I build passed the test with flying colours ...... (dammit!).

So no need to pass that one on to you as yet. I'll try the rest of the pipeline and report back. Potentially I'll just need to rename my true samples (they have a lot of underscores in them which may be a reserved character or split somewhere in the tool).

Adriaan van der Graaf · Answer 10 · Fri May 20 2016 16:10:41 GMT+0800 (China Standard Time)

Hmm, That could be it I will look into it further.
I'm still interested in what brings the problem, I do have some funky logic
in the removal of the bam name, but this should not matter much really,
unless there are point characters in the bam names.
Thank you for checking out the program. If you can find a reproduction,
I'll definitely take a look at it, but I'm gonna close this issue now.

Thanks, Adriaan

Op do 19 mei 2016 om 14:51 schreef colindaven notifications@github.com:

Interestingly, the test case I build passed the test with flying colours
...... (dammit!).

So no need to pass that one on to you as yet. I'll try the rest of the
pipeline and report back. Potentially I'll just need to rename my true
samples (they have a lot of underscores in them which may be a reserved
character or split somewhere in the tool).

—
You are receiving this because you were assigned.

Reply to this email directly or view it on GitHub
#454 (comment)

sbombin · Answer 11 · Fri Sep 09 2016 01:29:29 GMT+0800 (China Standard Time)

Hello,

I have exactly the same problem.

my error massage is:
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't find the filename in the sample names. Quitting.
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.readGenoAndAsFromIndividual(ReadGenoAndAsFromIndividual.java:185)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.MainEntryPoint.main(MainEntryPoint.java:354)
Did you find the solution already?

Adriaan van der Graaf · Answer 12 · Fri Sep 09 2016 08:04:49 GMT+0800 (China Standard Time)

Hey sbombin,

Thanks for having a look at CS-ASE, do you have any dot characters "." in your filenames, except for the .bam extension? if you remove these from the bam filename, it should solve the problem. Just to be sure, could you paste in the terminal command you used , and the coupling file?

Thanks, Adriaan

sbombin · Answer 13 · Tue Sep 13 2016 11:03:03 GMT+0800 (China Standard Time)

Hello,
I do not have any dots in my file names.
The command that I used is:
java -jar cellTypeSpecificAlleleSpecificExpression-0.1-jar-with-dependencies.jar --action ASreads --output bc_a-ASreads.txt --coupling_file bc_a_bc_b_coupling.txt --genotype_location triyper_files/bc_a_final/ --bam_file bam_files/bc_a_split.bam > bc_a_ASreads_output.txt

I actualy transformed my .vcf reference file to TriTyper format. I attached a screenshot of my coupling file because it looks not the same after downloading

Thanks for your assistance.

bc_a_bc_b_coupling.txt

Adriaan van der Graaf · Answer 14 · Tue Sep 13 2016 12:21:26 GMT+0800 (China Standard Time)

Could you try removing the underscore characters in the bam names?
Then it should work.

Adriaan

Op di 13 sep. 2016 om 13:03 schreef sbombin notifications@github.com:

Hello,
I do not have any dots in my file names.
The command that I used is:
java -jar
cellTypeSpecificAlleleSpecificExpression-0.1-jar-with-dependencies.jar
--action ASreads --output bc_a-ASreads.txt --coupling_file
bc_a_bc_b_coupling.txt --genotype_location triyper_files/bc_a_final/
--bam_file bam_files/bc_a_split.bam > bc_a_ASreads_output.txt

I actualy transformed my .vcf reference file to TriTyper format. I
attached a screenshot of my coupling file because it looks not the same
after downloading

[image: image]
https://cloud.githubusercontent.com/assets/17582109/18460097/84e5c410-7923-11e6-9f43-7909af519f8d.png

Thanks for your assistance.

bc_a_bc_b_coupling.txt
https://github.com/molgenis/systemsgenetics/files/468775/bc_a_bc_b_coupling.txt

—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#454 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABabM-PoOYJ05BKwqiGsHiVhAmlP1mcVks5qphJogaJpZM4Ie2p_
.

sbombin · Answer 15 · Tue Sep 13 2016 23:11:21 GMT+0800 (China Standard Time)

Hello,

I am sorry but it still does not work. I deleted all underscores for my file names and fore names inside the coupling file. I even removed underscore in the name of coupling file but I got the same error message:
Exception in thread "main" java.lang.IllegalArgumentException: Couldn't find the filename in the sample names. Quitting.
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.readGenoAndAsFromIndividual(ReadGenoAndAsFromIndividual.java:185)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.MainEntryPoint.main(MainEntryPoint.java:354)

thats

sbombin · Answer 16 · Wed Sep 14 2016 00:00:19 GMT+0800 (China Standard Time)

I also tried to use vcf file instead of triyper file. When my vcf file was unzipped I got the error that it could not be found even when I specified exact pathway and vcf file name. And if the vcf file was zipped and indexed I got such error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.convert_individual_names(ReadGenoAndAsFromIndividual.java:582)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.ReadGenoAndAsFromIndividual.readGenoAndAsFromIndividual(ReadGenoAndAsFromIndividual.java:165)
at nl.systemsgenetics.cellTypeSpecificAlleleSpecificExpression.MainEntryPoint.main(MainEntryPoint.java:354)

Sincerely,
Sergei