veg / tn93

TN93 fast distance calculator

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid character results in wrong error message ("All sequences must have the same length")

niemasd opened this issue · comments

I have one sequence (hCoV_19_Norway_1539_2020_EPI_ISL_417487) that tn93 keeps thinking has one fewer characters than it actually has (or at least seems to have). I have attached a minimal working example below:

example.txt

I tried to run tn93 as follows:

cat example.aln | tn93 -l 1 -t 1

But I get the following error message:

All sequences must have the same length (29811), but sequence 'hCoV_19_Norway_1539_2020_EPI_ISL_417487' had length 29810

However, I tried checking it in Python (lines[3] is the problematic sequence):

lines = open('example.txt').readlines()

len(lines[1])  # prints 29812 (includes the newline at the end)
lines[1][:10]  # 'CTTCCCAGGT'
lines[1][-10:] # 'AATTTTAGT\n'
set(lines[1])  # {'\n', 'R', 'G', 'A', 'C', 'T', 'M'}

len(lines[3])  # prints 29812 (includes the newline at the end)
lines[3][:10]  # 'CTTCCCAGGT'
lines[3][-10:] # 'AATTTTAGT\n'
set(lines[3])  # {'V', 'S', '\n', 'R', 'G', 'I', 'A', 'C', 'Y', 'T'}

len(lines[5])  # prints 29812 (includes the newline at the end)
lines[5][:10]  # '----------'
lines[5][-10:] # 'AATTTTAGT\n'
set(lines[5])  # {'\n', 'G', 'A', '-', 'C', 'T'}

Excluding the newline character after every line (which is included in the lengths printed by the above code), each sequence has exactly 29811 characters.

The only weird character I see in the problematic sequence is I, which doesn't seem to be a standard IUPAC character. Thoughts?

Actually, yes, it seems as though the I was the culprit. Replacing I with N makes tn93 run properly.

example_replaced.txt

Read 3 sequences of length 29811
Will perform 3 pairwise distance calculations
Progress: ID1,ID2,Distance
Progress:       0% (       0 links found,         -nan evals/sec)hCoV_19_Norway_1539_2020_EPI_ISL_417487,hCoV_19_Pakistan_Gilgit1_2020_EPI_ISL_417444,0.000335814
Progress:    33.3% (       1 links found,          inf evals/sec)hCoV_19_Norway_1538_2020_EPI_ISL_417486,hCoV_19_Norway_1539_2020_EPI_ISL_417487,6.70927e-05
hCoV_19_Norway_1538_2020_EPI_ISL_417486,hCoV_19_Pakistan_Gilgit1_2020_EPI_ISL_417444,0.000268644
Progress:     100% (       3 links found,          inf evals/sec)
{
        "Actual comparisons performed" :3,
        "Comparisons accounting for copy numbers " :3,
        "Total comparisons possible" : 3,
        "Links found" : 3,
        "Maximum distance" : 0.000336,
        "Sequences" : 3,
        "Mean distance" : 0.000224,
        "Histogram" : [[0.005,3],[0.01,0],[0.015,0],[0.02,0],[0.025,0],[0.03,0],[0.035,0],[0.04,0],[0.045,0],[0.05,0],[0.055,0],[0.06,0],[0.065,0],[0.07,0],[0.075,0],[0.08,0],[0.085,0],[0.09,0],[0.095,0],[0.1,0],[0.105,0],[0.11,0],[0.115,0],[0.12,0],[0.125,0],[0.13,0],[0.135,0],[0.14,0],[0.145,0],[0.15,0],[0.155,0],[0.16,0],[0.165,0],[0.17,0],[0.175,0],[0.18,0],[0.185,0],[0.19,0],[0.195,0],[0.2,0],[0.205,0],[0.21,0],[0.215,0],[0.22,0],[0.225,0],[0.23,0],[0.235,0],[0.24,0],[0.245,0],[0.25,0],[0.255,0],[0.26,0],[0.265,0],[0.27,0],[0.275,0],[0.28,0],[0.285,0],[0.29,0],[0.295,0],[0.3,0],[0.305,0],[0.31,0],[0.315,0],[0.32,0],[0.325,0],[0.33,0],[0.335,0],[0.34,0],[0.345,0],[0.35,0],[0.355,0],[0.36,0],[0.365,0],[0.37,0],[0.375,0],[0.38,0],[0.385,0],[0.39,0],[0.395,0],[0.4,0],[0.405,0],[0.41,0],[0.415,0],[0.42,0],[0.425,0],[0.43,0],[0.435,0],[0.44,0],[0.445,0],[0.45,0],[0.455,0],[0.46,0],[0.465,0],[0.47,0],[0.475,0],[0.48,0],[0.485,0],[0.49,0],[0.495,0],[0.5,0],[0.505,0],[0.51,0],[0.515,0],[0.52,0],[0.525,0],[0.53,0],[0.535,0],[0.54,0],[0.545,0],[0.55,0],[0.555,0],[0.56,0],[0.565,0],[0.57,0],[0.575,0],[0.58,0],[0.585,0],[0.59,0],[0.595,0],[0.6,0],[0.605,0],[0.61,0],[0.615,0],[0.62,0],[0.625,0],[0.63,0],[0.635,0],[0.64,0],[0.645,0],[0.65,0],[0.655,0],[0.66,0],[0.665,0],[0.67,0],[0.675,0],[0.68,0],[0.685,0],[0.69,0],[0.695,0],[0.7,0],[0.705,0],[0.71,0],[0.715,0],[0.72,0],[0.725,0],[0.73,0],[0.735,0],[0.74,0],[0.745,0],[0.75,0],[0.755,0],[0.76,0],[0.765,0],[0.77,0],[0.775,0],[0.78,0],[0.785,0],[0.79,0],[0.795,0],[0.8,0],[0.805,0],[0.81,0],[0.815,0],[0.82,0],[0.825,0],[0.83,0],[0.835,0],[0.84,0],[0.845,0],[0.85,0],[0.855,0],[0.86,0],[0.865,0],[0.87,0],[0.875,0],[0.88,0],[0.885,0],[0.89,0],[0.895,0],[0.9,0],[0.905,0],[0.91,0],[0.915,0],[0.92,0],[0.925,0],[0.93,0],[0.935,0],[0.94,0],[0.945,0],[0.95,0],[0.955,0],[0.96,0],[0.965,0],[0.97,0],[0.975,0],[0.98,0],[0.985,0],[0.99,0],[0.995,0],[1,0]]
}

I would suggest perhaps having a more descriptive error message (e.g. "Invalid Character: I")

Dear @niemasd,

The acceptable list of characters is shown at

const char ValidChars[] = "ACGTURYSWKMBDHVN?-",

And they are indeed IUPAC based. I agree that a more descriptive error message might be in order (to suggest that the user looks at non-IUPAC letters), but the current assumption is that most FASTA files are gonna have some non-sequence characters (e.g. new lines, spaces, etc).

Best,
Sergei