Problem loading counts matrices - HPCs tutorial

Question

Problem loading counts matrices - HPCs tutorial

sejjbia opened this issue 5 years ago · comments

Hi
I am a python novice and I am interested in using SPRING
With some adaptations I successfully managed to run the pbmc4k tutorial on python 3.7.3
Instead when trying the HPCs tutorial, upon loading the counts matrices at this stage...

for s in sample_name:
print '_________________', s

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'):
    print('Loading from npz file')
    D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz')
else:
    print('Loading from text file')
    E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
    D[s]['E'] = E
    D[s]['cell_bcs'] = cell_bcs
    scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True)
print(D[s]['E'].shape)

....I get the following error:

UnicodeDecodeError Traceback (most recent call last)
in
10 else:
11 print('Loading from text file')I cannot
---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
13 D[s]['E'] = E
14 D[s]['cell_bcs'] = cell_bcs

~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs)
149 start_column = -1
150 start_row = -1
--> 151 for row_ix, dat in enumerate(file_data):
152 dat = dat.strip('\n').split(delim)
153 if start_row == -1:

~/anaconda3/lib/python3.7/gzip.py in readline(self, size)
372 def readline(self, size=-1):
373 self._check_not_closed()
--> 374 return self._buffer.readline(size)
375
376

~/anaconda3/lib/python3.7/_compression.py in readinto(self, b)
66 def readinto(self, b):
67 with memoryview(b) as view, view.cast("B") as byte_view:
---> 68 data = self.read(len(byte_view))
69 byte_view[:len(data)] = data
70 return len(data)

~/anaconda3/lib/python3.7/gzip.py in read(self, size)
461 # jump to the next member, if there is one.
462 self._init_read()
--> 463 if not self._read_gzip_header():
464 self._size = self._pos
465 return b""

~/anaconda3/lib/python3.7/gzip.py in _read_gzip_header(self)
404
405 def _read_gzip_header(self):
--> 406 magic = self._fp.read(2)
407 if magic == b'':
408 return False

~/anaconda3/lib/python3.7/gzip.py in read(self, size)
89 self._read = None
90 return self._buffer[read:] +
---> 91 self.file.read(size-self._length+read)
92
93 def prepend(self, prepend=b''):

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I tried with no luck several workarounds on the file_opener module of spring_helper.py (I got the latest version with - Added cell_BC export - ) as my hypothesis is that this is somehow related to the GzipFile missing the utf-8 encoding.

I might be completely wrong and/or missing something totally obvious...Could you help me?

Caleb Weinreb · Answer 1 · Sat Jun 08 2019 03:55:23 GMT+0800 (China Standard Time)

Hi, Thanks for your message! Can you send the file that you are having trouble opening?

…

On Fri, Jun 7, 2019 at 1:08 PM sejjbia ***@***.***> wrote: Hi I am a python novice and I am interested in using SPRING With some adaptations I successfully managed to run the pbmc4k tutorial on python 3.7.3 When trying the HPCs tutorial instead, upon loading the counts matrices at this stage... for s in sample_name: print '_________________', s if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'): print 'Loading from npz file' D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz') else: print 'Loading from text file' E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) D[s]['E'] = E D[s]['cell_bcs'] = cell_bcs scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True) print D[s]['E'].shape ....I get the following error: UnicodeDecodeError Traceback (most recent call last) in 10 else: 11 print('Loading from text file')I cannot ---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) 13 D[s]['E'] = E 14 D[s]['cell_bcs'] = cell_bcs ~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs) 149 start_column = -1 150 start_row = -1 --> 151 for row_ix, dat in enumerate(file_data): 152 dat = dat.strip('\n').split(delim) 153 if start_row == -1: ~/anaconda3/lib/python3.7/gzip.py in readline(self, size) 372 def readline(self, size=-1): 373 self._check_not_closed() --> 374 return self._buffer.readline(size) 375 376 ~/anaconda3/lib/python3.7/_compression.py in readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data) ~/anaconda3/lib/python3.7/gzip.py in read(self, size) 461 # jump to the next member, if there is one. 462 self._init_read() --> 463 if not self._read_gzip_header(): 464 self._size = self._pos 465 return b"" ~/anaconda3/lib/python3.7/gzip.py in _read_gzip_header(self) 404 405 def _read_gzip_header(self): --> 406 magic = self._fp.read(2) 407 if magic == b'': 408 return False ~/anaconda3/lib/python3.7/gzip.py in read(self, size) 89 self._read = None 90 return self._buffer[read:] + ---> 91 self.file.read(size-self._length+read) 92 93 def prepend(self, prepend=b''): ~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final) 320 # decode input (taking the buffer into account) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final) 323 # keep undecoded input until the next call 324 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte I tried several workarounds on the *file_opener* module of *spring_helper.py* as my hypothesis is that this is somehow related to the GzipFile missing the utf-8 encoding. As I said I am a novice and I might be completely wrong and/or missing something totally obvious...Could you help me? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#13?email_source=notifications&email_token=ABO45MXA2BN4ZHGLODGAEITPZKI2DA5CNFSM4HVY2KVKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GYJSA2Q>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABO45MQ7G5LYVKEWLP2HANTPZKI2DANCNFSM4HVY2KVA> .

sejjbia · Answer 2 · Mon Jun 10 2019 17:43:32 GMT+0800 (China Standard Time)

Thank you for your reply.
The files are the ones you provided as samples for this analyses (see below)

P9A.counts.tsv.gz
P11A.counts.tsv.gz
P11B.counts.tsv.gz
P12A.counts.tsv.gz

Note: the files are not corrupt because I can actually open them with the following...
if fname.endswith('.gz'):
os.system('gunzip -c ' + fname + ' > tmp')
f = open('tmp')

...but I would really need to use your code for the barcodes extraction and for the whole downstream processing.
Let me know

sejjbia · Answer 3 · Mon Jun 10 2019 18:31:34 GMT+0800 (China Standard Time)

Additional note:
using the corresponding four *unfiltered.npz files you provided it works up until

D[s]['cell_bcs'].shape, D[s]['total_counts'].shape

where I get the following error.

KeyError Traceback (most recent call last)
in
----> 1 D[s]['cell_bcs'].shape

KeyError: 'cell_bcs'

Again, I would need to make it work starting from the *counts.tsv.gz files also for future projects

Sam Wolock · Answer 4 · Mon Jun 10 2019 20:50:57 GMT+0800 (China Standard Time)

Hi @sejjbia,

I think that Python 3 requires you to read the file in binary mode from the very beginning, which means you need to modify the file_opener() function. See here for a version that works for me (this link also includes most of the same helper functions and more, all Python 3 compatible).

sejjbia · Answer 5 · Mon Jun 10 2019 22:08:12 GMT+0800 (China Standard Time)

Thank you swolock but no luck yet.
I tried the new function you gave me both embedding it in in my spring_helper.py or by using the whole spring_helper.py version from your link.

This is the function I am using now
def file_opener(filename):
'''Open file and return a file object, automatically decompressing zip and gzip
Arguments
- filename : str
Name of input file
Returns
- outData : file object
(Decompressed) file data
'''
if filename.endswith('.gz'):
fileData = open(filename, 'rb')
import gzip
outData = gzip.GzipFile(fileobj = fileData, mode = 'rb')
elif filename.endswith('.zip'):
fileData = open(filename, 'rb')
import zipfile
zipData = zipfile.ZipFile(fileData, 'r')
fnClean = filename.strip('/').split('/')[-1][:-4]
outData = zipData.open(fnClean)
else:
outData = open(filename, 'r')
return outData

Your workaround is similar to one I tried before
However, now I get the following:

_________________ P9A
Loading from text file

TypeError Traceback (most recent call last)
in
10 else:
11 print('Loading from text file')
---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
13 D[s]['E'] = E
14 D[s]['cell_bcs'] = cell_bcs

~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs)
164 start_row = -1
165 for row_ix, dat in enumerate(file_data):
--> 166 dat = dat.strip('\n').split(delim)
167 if start_row == -1:
168 current_col = 0

TypeError: a bytes-like object is required, not 'str'

I guess it still doesn't read it in binary mode...

Sam Wolock · Answer 6 · Mon Jun 10 2019 22:24:58 GMT+0800 (China Standard Time)

Actually, I think it is now reading in binary mode (dat is a bytes-like object), but you're using a string (delim) to split it.

You need to decode the input data before treating it like a string:

for row_ix, dat in enumerate(file_data):
    if type(dat) == bytes:
        dat = dat.decode('utf-8')
    dat = dat.strip('\n').split(delim)

Or see this example.

sejjbia · Answer 7 · Mon Jun 10 2019 23:31:24 GMT+0800 (China Standard Time)

Thank you swolock
It seemed is was going through with your solution BUT it took several minutes to load the first file only to end up with this (BTW I got the very same error when in one of my attempts I tried to load pre-decompressed .tsv files)

_________________ P9A
Loading from text file

ValueError Traceback (most recent call last)
in
10 else:
11 print('Loading from text file')
---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
13 D[s]['E'] = E
14 D[s]['cell_bcs'] = cell_bcs

ValueError: too many values to unpack (expected 2)

Sam Wolock · Answer 8 · Tue Jun 11 2019 00:16:59 GMT+0800 (China Standard Time)

I'm not quite sure why you're getting this particular error, but it's likely there are other changes you need to make this python3-compatible. For example:

rowdat = np.array(map(float, dat[current_col:]))

becomes:

rowdat = np.array(list(map(float, dat[current_col:])))

Unless you're excited about going through this exercise, you're probably better off just using my function load_annotated_text().

Use it like so:

E, cell_bcs, gene_names = hf.load_annotated_text(
    hf.file_opener(input_path + s + '.counts.tsv.gz'),
    delim='\t', 
    read_row_labels=True, 
    read_column_labels=True)

Another thing: in your previous comment, I noticed that you're using spring-of-rebirth. Although we will eventually merge this PR, I think it is still buggy.

sejjbia · Answer 9 · Tue Jun 11 2019 01:29:36 GMT+0800 (China Standard Time)

Thank you swolock
I embedded your last solution in the module and it worked perfectly well!!
`for s in sample_name:
print('_________________', s)

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'):
    print('Loading from npz file')
    D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz')
else:
    print('Loading from text file')
    E, cell_bcs, gene_names = load_annotated_text(file_opener(input_path + s + '.counts.tsv.gz'), delim='\t', read_row_labels=True, read_column_labels=True)
    D[s]['E'] = E
    D[s]['cell_bcs'] = cell_bcs
    scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True)
print(D[s]['E'].shape)`

Just a note
once the .npz files are created I am still getting the following error at this stage:

D[s]['cell_bcs'].shape, D[s]['total_counts'].shape

KeyError Traceback (most recent call last)
in
----> 1 D[s]['cell_bcs'].shape

KeyError: 'cell_bcs'

I solved it by removing from the raw_counts folder the *.npz files that were generated and starting over from the *.tsv.gz

Problem loading counts matrices - HPCs tutorial

_________________ P9A Loading from text file

_________________ P9A
Loading from text file