single sequence input fails with IndexError
kiramt opened this issue · comments
Hi - I saw your MLCB talk and was hoping to try out fastism. I was working through the code in your tutorial, as below:
model = tf.keras.models.load_model("deepseabeluga.h5")
chr3_enhancer = "CCGGTGATTTTCTGGAGTCTATATCCTTCATCAGATTTTCCAAGGGGTGTCTGTCCCCTCAAAAGAATGATTGTCATTATTTGAAAGACTAG" \
"TTCCAGACAGATATTTTATACAAATTTTCCCAGCATTGACATCCCTGAACCAAACTGTTTTTCTTCCCAACATTACTGTTTTCTTCCTTTCT" \
"GTCGAGTTTGTTGTTTTGTAATATCAGAATCTCCAGCTCACCTGAGTAAATGGTAACAAGGTGCCACCACCTTTGAATTCTCCCAGAATCCA" \
"CCCCACCCTCCGTCAGAGCCACTGCCAAGGCACTCTTACTGATTTCTCCCACACTGCTGGCTCATTGCAAGTGGGAAGACAGCATGTGGAGT" \
"GGGTGTGCGGCTTATTAAAGTGAGAACTCAGGGTCAGGGCAGAACCAGGAAGAGAGCAGTGAGATATCCTGCTACCTAATCCAATTCTCCTT" \
"TTTGTGCATTTAGCACCCTCCCCTCCGCCTGCATAACAATGGAAGGAAAGAGGAAGTGGGAAAAAAGAAAGTCATGTAATTGAGTTAGAAGA" \
"GGTAATGACCAAGACCCTGGAGCAGAGGGAAAGCGGGTTACAAAAGGTGGGTTAAAGAAATCACAAGAGTATGAAGAGCTGGGAAATTACTA" \
"ACAAATATTTGCTTGTGTGGGAAAGCAAAAAAGTAAAAACTTCAGTGCTGAATTGGGGCGCTGAGCCACCAGGGAAATTTGAGATTGGCATC" \
"AAGGACCGTGTTGAAGCAGGGTGGGCGGAGAAGGAGGGAAAACTACCAGCCAGCTGAGATTTTGCAGCTAGGCTGTGGCCTGATACCGAGTA" \
"TCGATGCCGCAAGGGAGGGATGAGTCAGTCCTAGCACGTCCAAGTTTAGAATAATAGACTGTTTGCCACTGGGAAGGCAAACACCTTTCCTG" \
"TGAGAGGGCTTGCTGACAGTTCCAATGTCCAAAGTCCAATGCCGACCCAGAAAACTGAGGAGGCCCTGGCCCCTGCAGGAAGGGCTCATTTA" \
"CATGGAGACTGAGTAAAGTGCTGTCTTAAACCCTCCTTCCTTCCCCCACTGGGAGGTTTCAGCCAGATATGCCACCCTTTGTAGGATTTCAT" \
"AGGGTTGTCTAAAGCCAGGGTTGGCACAGAGCAGAAGCCACAGGGCTAAGTACCAGATTATAATTGTCAATGTCACACCTTACTGCAGAAGC" \
"CAGGGAAGGGAGCTAGGAAACTGAAGAGCTTTCTTGGTTATGGGCGGGGCTGTAAATGCAGAGTGTGCCCTGGTGACTCATGGGAGACAGTG" \
"AGAAACACTGTGGGGATCTGGTCAACCGGGTACTGATTCCTTTGAGGAAGGTATACTCCACATGCCAACCTGATACTCATGGCTAGTGAAGA" \
"GATGGCAGGATTGGGTTGCATCAGCCAGCCTAACTCGACTTGGAAACACAGAAAATAACCCAGAGCAGGTCTCAAGCACTGTGTAACTTTAT" \
"TAGTTCATAGTGGCTGAACAGCCATGTTTAGGGCCTCTCAGAAGAAAGAGTTTCATCTTTGGGAAGAAATTTGTGTTGGGTGATTTTGTTCA" \
"TATAATTTTGTGTTTTTTGTTTTGTTTTGGTGTTTGAGACAGGGCCTCACTCTCTCACACAGGCTGGAGTGCAGTGGCACCATCTTAGCTCA" \
"CTGCAACCTCTACCTTCCTGCCTCAAGCGATCCTCCTACTTCAGCCTCCTGCATAGCTGGGACTACAGGCACGTATCACTCAACCCAGCTAA" \
"TTTTTTTTTTTTCGAGATGCAGTCTTGCTCTGTCACCCAGGCTGGAGAGCAATGGCACTATCTTGGCTCACTGTAACCCCCGCCTCCCAGTC" \
"TCTGCCTCCTGAGTAGCTGGGATTACAGGCTCCTGCCACCACCCCCGGCTCAGCTAATTATTTCTTTCTTTCTTTTTTCTGAGATGAAGTTT" \
"CACTCTTGTTGCCCAGGCTGGAGTGCAATGGCACGATCTCAGCTCACTGCAATGTCTGCTTCTGGGGT"
sequences = [chr3_enhancer]*1
#We define a function to do the one-hot encoding
onehot_mapping = {
'A': [1,0,0,0],
'C': [0,1,0,0],
'G': [0,0,1,0],
'T': [0,0,0,1],
'N': [0,0,0,0],
'a': [1,0,0,0],
'c': [0,1,0,0],
'g': [0,0,1,0],
't': [0,0,0,1],
}
def one_hot_encode(sequence):
return np.array([onehot_mapping[x] for x in sequence])
onehot_sequences = np.array([one_hot_encode(x) for x in sequences])
x = tf.constant(onehot_sequences, dtype=model.input.dtype)
mutations = [[1,0,0,0],
[0,1,0,0],
[0,0,1,0],
[0,0,0,1]]
from fastism import FastISM
fast_ism_model = FastISM(model, test_correctness=False)
fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]
It runs fine when I supply 5 x chr3_enhancer but if I make it a batch of 1 sequence I get the following error:
Traceback (most recent call last):
File "...test.py", line 328, in test_example
fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]
File "...test.py", line 328, in <listcomp>
fast_ism_out = [fast_ism_model(x, replace_with=mut) for mut in mutations]
File "...python3.7/site-packages/fastism/ism_base.py", line 78, in __call__
ism_ith_output = self.get_ith_output(inp_batch, i, idxs_to_mutate)
File "...python3.7/site-packages/fastism/fast_ism.py", line 68, in get_ith_output
fast_ism_inputs = self.prepare_ith_input(self.padded_inputs, i, idxs_to_mutate)
File "...python3.7/site-packages/fastism/fast_ism.py", line 73, in prepare_ith_input
num_to_mutate = idxs_to_mutate.shape[0]
File "...python3.7/site-packages/tensorflow/python/framework/tensor_shape.py", line 887, in __getitem__
return self._dims[key].value
IndexError: list index out of range
Hi Kira, thanks for trying fastISM out! You're right, it seems to be bugging out for a batch size of 1. I'll look into it.
fastISM runs optimally when GPU memory is maxed out and is run on the most sequences possible in a batch. For small batch sizes it is quite possible it would end up being slower than a standard implementation (due to overheads). If you could describe your use case roughly I may be able to offer more help.
Thanks Surag. I'd expect mostly I'd be running with larger batch sizes anyway, and could fall back on the standard implementation if a small batch was required. I was using the single sequence (with my own model etc) just as a check that I had my input and output processing set up correctly, so then went back to the tutorial when I was getting an error to see if I had done something wrong.
Sounds good! Please don't hesitate to reach out if you get stuck. I'll get to the batch size 1 case soon.
Hi Surag, I also get the same error if I input 2 sequences which are not identical e.g. if I set chr3_enhancer_a to chr3_enhancer but with the first base set to G instead, and have sequences = [chr3_enhancer, chr3_enhancer_a].
Hi Kira, I've pushed some fixes to v0.4.2
. Please give it a try and let me know if it works. Thanks!
Thanks Surag - that seems to have fixed it!
Great, thanks!