Converting gene name error?

Question

Converting gene name error?

camelest opened this issue a year ago · comments

Hi, thank you so much for the great tool. I'm trying MntJULiP using GENCODE gene model and getting the error as below.

Traceback (most recent call last): File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/dask/dataframe/utils.py", line 193, in raise_on_meta_error yield File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/dask/dataframe/core.py", line 6470, in elemwise meta = partial_by_order(*parts, function=op, other=other) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/dask/utils.py", line 1327, in partial_by_order return function(*args2, **kwargs) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/ops/common.py", line 81, in new_method return method(self, other) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/arraylike.py", line 60, in __ge__ return self._cmp_method(other, operator.ge) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/series.py", line 6097, in _cmp_method res_values = ops.comparison_op(lvalues, rvalues, op) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 286, in comparison_op res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 75, in comp_method_OBJECT_ARRAY result = libops.scalar_compare(x.ravel(), y, op) File "pandas/_libs/ops.pyx", line 107, in pandas._libs.ops.scalar_compare TypeError: '>=' not supported between instances of 'str' and 'int'

I'm using GENCODE v41 now. Is is related to some format problem of the annotation file? Thank you so much for your help.

Ed Lui · Answer 1 · Sat Apr 08 2023 00:43:08 GMT+0800 (China Standard Time)

Hi camelest, thanks for using MntJULiP! I can reproduce the error message when the input bam-list has non-existing paths. Could you please double-check your input bam-list?

Raku Son · Answer 2 · Sat Apr 08 2023 01:00:54 GMT+0800 (China Standard Time)

Thank you so much for your quick response.

I double-check it and all seem to look fine. It actually gives the first two lines before the errors above:
mnt-JULiP: 07-Apr-23 14:37:36: Generating splice files (or reusing splice files if save-tmp set to true and splice files exist) ... mnt-JULiP: 07-Apr-23 14:37:36: Processing 6 samples ... which corresponds to the #samples I listed in the bam list.

Ed Lui · Answer 3 · Sat Apr 08 2023 01:09:50 GMT+0800 (China Standard Time)

Thanks got it. Would it be possible to have your input bam-list and example splice files/bam files to reproduce the error?

Raku Son · Answer 4 · Sat Apr 08 2023 01:20:42 GMT+0800 (China Standard Time)

Thank you so much, let me check it since it's some collaborative data. I will check whether I have the same error using the example data as well.

Ed Lui · Answer 5 · Sat Apr 08 2023 01:28:17 GMT+0800 (China Standard Time)

Appreciate it. Simulated bam-list and splice files/bam files would work too.

Melisa Acun · Answer 6 · Fri Dec 01 2023 02:48:34 GMT+0800 (China Standard Time)

I fixed the error that @camelest mentioned. However, I got something else afterward:
Traceback (most recent call last):
File "run.py", line 153, in
main()
File "run.py", line 113, in main
df, index_df, anno_intron_dict = process_introns_with_annotation(out_data_dir, num_samples, anno_intron_dict, start_site_genes_dict,
File "/home/melisa/MntJULiP/utils.py", line 184, in process_introns_with_annotation
df['index'] = df[[coord_columns]].apply(lambda x: tuple(x), axis=1)
File "/home/melisa/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3767, in getitem
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/home/melisa/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/melisa/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5938, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index([('chromosome', 'strand', 'start', 'end')], dtype='object')] are in the [columns]"

A gft file does not have those column names. Do you have any idea what the reason is? Thank you.

Ed Lui · Answer 7 · Fri Dec 01 2023 14:51:53 GMT+0800 (China Standard Time)

Hi melisa-r,

You are right, 'chromosome', 'strand', 'start', 'end' are the dataframe column names for the input of splice files. It's probably the issue of the splice files, it would be great if you could have a check or share your command with any examples of input files.

Melisa Acun · Answer 8 · Sat Dec 02 2023 11:35:00 GMT+0800 (China Standard Time)

Hello Ed,

I didn't use any splice files. I used bam files with the command:

python run.py --bam-list bam_list.txt --anno-file /home/melisa/Mus_musculus.GRCm39.110.gtf --num-threads 8

bam_list.txt is like this:
sample condition
/home/melisa/MntJULiP/S1.bam uninjured
/home/melisa/MntJULiP/S2.bam injured

Then it generates the splice files in the out/data folder, they look like this:
sample_1.splice.gz sample_2.splice.gz
and the data in them:

1 4068912 4928042 1 ? 1 0 7 0
1 4816651 4928017 1 ? 0 1 0 0
1 4816651 4928035 1 ? 0 1 0 0
1 4844739 4847748 2 ? 2 0 2 0

no column names at all. So, it cannot process the bam files properly for some reason or the issue is because of the gtf file? Thank you.

Melisa Acun · Answer 9 · Sat Dec 02 2023 12:24:28 GMT+0800 (China Standard Time)

Hi again. I realized that the out/data folder has been generated before with the false input files I have given. Apparently, it was using those falsely generated splice files over and over. So, I fixed the column names error now. Sorry, you can ignore my previous question. However, I got this error:

File "run.py", line 153, in
main()
File "run.py", line 127, in main
diff_nb_intron_dict, pred_intron_dict = NB_model(df, conditions, model_dir,
File "/home/melisa/MntJULiP/models.py", line 72, in NB_model
ys.append(np.array(row_list[1:-1], dtype=np.int))
File "/home/melisa/.local/lib/python3.8/site-packages/numpy/init.py", line 305, in getattr
raise AttributeError(former_attrs[attr])
AttributeError: module 'numpy' has no attribute 'int'.
np.int was a deprecated alias for the builtin int. To avoid this error in existing code, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I am using Python 3.8. I fixed the error according to the recommendation (replaced np.int to np.int32 in models.py) . Unfortunately, I have a new error now:

File "run.py", line 153, in
main()
File "run.py", line 127, in main
diff_nb_intron_dict, pred_intron_dict = NB_model(df, conditions, model_dir,
File "/home/melisa/MntJULiP/models.py", line 84, in NB_model
results_batches = list(compute(delayed_results, traverse=False, num_workers=num_workers, scheduler="processes"))
File "/home/melisa/.local/lib/python3.8/site-packages/dask/base.py", line 599, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/melisa/.local/lib/python3.8/site-packages/dask/multiprocessing.py", line 233, in get
result = get_async(
File "/home/melisa/.local/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
raise_exception(exc, tb)
File "/home/melisa/.local/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
raise exc
File "/home/melisa/.local/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
result = _execute_task(task, data)
File "/home/melisa/.local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
return func((_execute_task(a, cache) for a in args))
File "/home/melisa/MntJULiP/models.py", line 109, in batch_run_NB_model
results.append(run_NB_model(y, conditions, count, null_model, alt_model, aggressive_mode))
File "/home/melisa/MntJULiP/models.py", line 158, in run_NB_model
fit_alt = alt_model.optimizing(data=alt_data_dict, as_vector=False, init_alpha=1e-5)
File "/home/melisa/.local/lib/python3.8/site-packages/pystan/model.py", line 542, in optimizing
fit = self.fit_class(data, seed)
File "stanfit4anon_model_23ba00928262ba4aeb4432082f7437b8_1046099302502438532.pyx", line 474, in stanfit4anon_model_23ba00928262ba4aeb4432082f7437b8_1046099302502438532.StanFit4Model.cinit
File "/home/melisa/.local/lib/python3.8/site-packages/pystan/misc.py", line 412, in _split_data
raise ValueError(msg.format(k))
ValueError: Variable z is neither int nor float nor list/array thereof

In models.py, I see that z=conditions. Conditions as in bam_list.txt ?
Any suggestions? Thank you very much for your assistance.

Ed Lui · Answer 10 · Sat Dec 02 2023 12:50:36 GMT+0800 (China Standard Time)

Hi melisa-r,

Thanks for your prompt feedback and all the sample data provided! Yes, with your input, the conditions are converted from the bam_list.txt into

array([[1, 0],
[0, 1]], dtype=uint8)

However, I could not reproduce your error with the data. What version of numpy and pandas are you using please?

Melisa Acun · Answer 11 · Sat Dec 02 2023 12:58:08 GMT+0800 (China Standard Time)

Hello Ed,
Thank you for your reply.

Pandas is 2.0.3
Numpy is 2.24.4

Melisa Acun · Answer 12 · Sat Dec 02 2023 14:08:07 GMT+0800 (China Standard Time)

Hello Ed,

I resolved the issue by changing z as

z = conditions.astype(np.int32)

I guess it is because of the NumPy version I have. Thank you for your help!