AllonKleinLab / SPRING_dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem loading counts matrices - HPCs tutorial

sejjbia opened this issue · comments

Hi
I am a python novice and I am interested in using SPRING
With some adaptations I successfully managed to run the pbmc4k tutorial on python 3.7.3
Instead when trying the HPCs tutorial, upon loading the counts matrices at this stage...

for s in sample_name:
print '_________________', s

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'):
    print('Loading from npz file')
    D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz')
else:
    print('Loading from text file')
    E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
    D[s]['E'] = E
    D[s]['cell_bcs'] = cell_bcs
    scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True)
print(D[s]['E'].shape)

....I get the following error:

UnicodeDecodeError Traceback (most recent call last)
in
10 else:
11 print('Loading from text file')I cannot
---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
13 D[s]['E'] = E
14 D[s]['cell_bcs'] = cell_bcs

~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs)
149 start_column = -1
150 start_row = -1
--> 151 for row_ix, dat in enumerate(file_data):
152 dat = dat.strip('\n').split(delim)
153 if start_row == -1:

~/anaconda3/lib/python3.7/gzip.py in readline(self, size)
372 def readline(self, size=-1):
373 self._check_not_closed()
--> 374 return self._buffer.readline(size)
375
376

~/anaconda3/lib/python3.7/_compression.py in readinto(self, b)
66 def readinto(self, b):
67 with memoryview(b) as view, view.cast("B") as byte_view:
---> 68 data = self.read(len(byte_view))
69 byte_view[:len(data)] = data
70 return len(data)

~/anaconda3/lib/python3.7/gzip.py in read(self, size)
461 # jump to the next member, if there is one.
462 self._init_read()
--> 463 if not self._read_gzip_header():
464 self._size = self._pos
465 return b""

~/anaconda3/lib/python3.7/gzip.py in _read_gzip_header(self)
404
405 def _read_gzip_header(self):
--> 406 magic = self._fp.read(2)
407 if magic == b'':
408 return False

~/anaconda3/lib/python3.7/gzip.py in read(self, size)
89 self._read = None
90 return self._buffer[read:] +
---> 91 self.file.read(size-self._length+read)
92
93 def prepend(self, prepend=b''):

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I tried with no luck several workarounds on the file_opener module of spring_helper.py (I got the latest version with - Added cell_BC export - ) as my hypothesis is that this is somehow related to the GzipFile missing the utf-8 encoding.

I might be completely wrong and/or missing something totally obvious...Could you help me?

Thank you for your reply.
The files are the ones you provided as samples for this analyses (see below)

P9A.counts.tsv.gz
P11A.counts.tsv.gz
P11B.counts.tsv.gz
P12A.counts.tsv.gz

Note: the files are not corrupt because I can actually open them with the following...
if fname.endswith('.gz'):
os.system('gunzip -c ' + fname + ' > tmp')
f = open('tmp')

...but I would really need to use your code for the barcodes extraction and for the whole downstream processing.
Let me know

Additional note:
using the corresponding four *unfiltered.npz files you provided it works up until

D[s]['cell_bcs'].shape, D[s]['total_counts'].shape

where I get the following error.

KeyError Traceback (most recent call last)
in
----> 1 D[s]['cell_bcs'].shape

KeyError: 'cell_bcs'

Again, I would need to make it work starting from the *counts.tsv.gz files also for future projects

Hi @sejjbia,

I think that Python 3 requires you to read the file in binary mode from the very beginning, which means you need to modify the file_opener() function. See here for a version that works for me (this link also includes most of the same helper functions and more, all Python 3 compatible).

Thank you swolock but no luck yet.
I tried the new function you gave me both embedding it in in my spring_helper.py or by using the whole spring_helper.py version from your link.

This is the function I am using now
def file_opener(filename):
'''Open file and return a file object, automatically decompressing zip and gzip
Arguments
- filename : str
Name of input file
Returns
- outData : file object
(Decompressed) file data
'''
if filename.endswith('.gz'):
fileData = open(filename, 'rb')
import gzip
outData = gzip.GzipFile(fileobj = fileData, mode = 'rb')
elif filename.endswith('.zip'):
fileData = open(filename, 'rb')
import zipfile
zipData = zipfile.ZipFile(fileData, 'r')
fnClean = filename.strip('/').split('/')[-1][:-4]
outData = zipData.open(fnClean)
else:
outData = open(filename, 'r')
return outData

Your workaround is similar to one I tried before
However, now I get the following:

_________________ P9A
Loading from text file

TypeError Traceback (most recent call last)
in
10 else:
11 print('Loading from text file')
---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
13 D[s]['E'] = E
14 D[s]['cell_bcs'] = cell_bcs

~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs)
164 start_row = -1
165 for row_ix, dat in enumerate(file_data):
--> 166 dat = dat.strip('\n').split(delim)
167 if start_row == -1:
168 current_col = 0

TypeError: a bytes-like object is required, not 'str'

I guess it still doesn't read it in binary mode...

Actually, I think it is now reading in binary mode (dat is a bytes-like object), but you're using a string (delim) to split it.

You need to decode the input data before treating it like a string:

for row_ix, dat in enumerate(file_data):
    if type(dat) == bytes:
        dat = dat.decode('utf-8')
    dat = dat.strip('\n').split(delim)

Or see this example.

Thank you swolock
It seemed is was going through with your solution BUT it took several minutes to load the first file only to end up with this (BTW I got the very same error when in one of my attempts I tried to load pre-decompressed .tsv files)

_________________ P9A
Loading from text file


ValueError Traceback (most recent call last)
in
10 else:
11 print('Loading from text file')
---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
13 D[s]['E'] = E
14 D[s]['cell_bcs'] = cell_bcs

ValueError: too many values to unpack (expected 2)

I'm not quite sure why you're getting this particular error, but it's likely there are other changes you need to make this python3-compatible. For example:

rowdat = np.array(map(float, dat[current_col:]))

becomes:

rowdat = np.array(list(map(float, dat[current_col:])))

Unless you're excited about going through this exercise, you're probably better off just using my function load_annotated_text().

Use it like so:

E, cell_bcs, gene_names = hf.load_annotated_text(
    hf.file_opener(input_path + s + '.counts.tsv.gz'),
    delim='\t', 
    read_row_labels=True, 
    read_column_labels=True)

Another thing: in your previous comment, I noticed that you're using spring-of-rebirth. Although we will eventually merge this PR, I think it is still buggy.

Thank you swolock
I embedded your last solution in the module and it worked perfectly well!!
`for s in sample_name:
print('_________________', s)

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'):
    print('Loading from npz file')
    D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz')
else:
    print('Loading from text file')
    E, cell_bcs, gene_names = load_annotated_text(file_opener(input_path + s + '.counts.tsv.gz'), delim='\t', read_row_labels=True, read_column_labels=True)
    D[s]['E'] = E
    D[s]['cell_bcs'] = cell_bcs
    scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True)
print(D[s]['E'].shape)`

Just a note
once the .npz files are created I am still getting the following error at this stage:

D[s]['cell_bcs'].shape, D[s]['total_counts'].shape

KeyError Traceback (most recent call last)
in
----> 1 D[s]['cell_bcs'].shape

KeyError: 'cell_bcs'

I solved it by removing from the raw_counts folder the *.npz files that were generated and starting over from the *.tsv.gz