Don't enforce "Non-unique gene name" and "Annotation" columns

Question

Don't enforce "Non-unique gene name" and "Annotation" columns

AdmiralenOla opened this issue 7 years ago · comments

Remove enforcing of the columns "Non-unique gene name" and "Annotation" in the output. Some users might have input file with only a single identifier column (Gene ID) before sample info starts, and wants to run with -s 2.

In the current version, this will cause Scoary to fill in the "Non-unique Gene name" and "Annotation" columns with sample data. (Because it automatically assumes that this info can be found in columns 2 and 3). There is really no need to enforce any other columns than Gene ID.

dutchscientist · Answer 1 · Fri Apr 21 2017 21:32:05 GMT+0800 (China Standard Time)

Actually, I was just about to suggest an alternative, allowing the user to specify column numbers to be included in the output (so I can see the gene numbers of specific strains in the dataset in the Scoary output).

I am now modifying the "Non-unique gene name" column for this and then split that one out.

Ola Brynildsrud · Answer 2 · Thu May 04 2017 20:18:33 GMT+0800 (China Standard Time)

Hi! Trying to wrap my head around this, but I don't quite see how it would work. I think I'm confused by "gene numbers of specific strains in the dataset". Do you mean grabbing columns from the input Roary file or producing some kind of aggregate column? Would you mind giving an example?

dutchscientist · Answer 3 · Thu May 04 2017 20:48:36 GMT+0800 (China Standard Time)

The way I envisage it is similar to the switch included that Scoary starts counting from column 15 in the Roary output. Say that these are the headers from a Roary output: Gene Non-unique Gene name Annotation No. isolates No. sequences Avg sequences per isolate Genome Fragment Order within Fragment Accessory Fragment Accessory Order with Fragment QC Min group size nuc Max group size nuc Avg group size nuc Sample1 Sample2 Sample3 The Scoary output will contain the first 3 columns followed by the counts, etc: Gene Non-unique gene name Annotation I would like to be able to have a switch where I can also include the information in the rows for Sample1, Sample2 and/or Sample3, something like "--columns_included 16,17,18". The group output of Roary is not always informative, the gene number can be. Will see whether I can upload an example.

Ola Brynildsrud · Answer 4 · Tue May 09 2017 17:19:08 GMT+0800 (China Standard Time)

OK, I think I understand what you mean now. Sure, I can implement that, should be fairly easy! I will schedule it for the next release.

dutchscientist · Answer 5 · Tue May 09 2017 17:20:38 GMT+0800 (China Standard Time)

Cool! I aim to get you a lot of citations and help you increase your h-index ;-)

Ola Brynildsrud · Answer 6 · Mon Jul 03 2017 20:14:06 GMT+0800 (China Standard Time)

Hi @dutchscientist. This functionality is included in the latest version. Hope you like it!

dutchscientist · Answer 7 · Thu Jul 06 2017 10:19:12 GMT+0800 (China Standard Time)

Hi Ola, great! Will try it soon (currently travelling for a few weeks) :) From: Ola Brynildsrud [mailto:notifications@github.com] Sent: 04 July 2017 00:14 To: AdmiralenOla/Scoary <Scoary@noreply.github.com> Cc: dutchscientist <dutchscientist@gmail.com>; Mention <mention@noreply.github.com> Subject: Re: [AdmiralenOla/Scoary] Don't enforce "Non-unique gene name" and "Annotation" columns (#57) Hi @dutchscientist<https://github.com/dutchscientist>. This functionality is included in the latest version. Hope you like it! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#57 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AJ8e0G6eC3c6rwHLlHNxcWfklaqGlGOWks5sKNsOgaJpZM4M6B8y>.

dutchscientist · Answer 8 · Wed Jul 12 2017 15:05:03 GMT+0800 (China Standard Time)

Yes, this is great! Exactly what I wanted, the --include_input_columns is just what I needed. Thanks very much!