splicebox / MntJULiP

Comprehensive and scalable differential splicing analyses

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Converting gene name error?

camelest opened this issue · comments

Hi, thank you so much for the great tool. I'm trying MntJULiP using GENCODE gene model and getting the error as below.

Traceback (most recent call last): File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/dask/dataframe/utils.py", line 193, in raise_on_meta_error yield File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/dask/dataframe/core.py", line 6470, in elemwise meta = partial_by_order(*parts, function=op, other=other) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/dask/utils.py", line 1327, in partial_by_order return function(*args2, **kwargs) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/ops/common.py", line 81, in new_method return method(self, other) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/arraylike.py", line 60, in __ge__ return self._cmp_method(other, operator.ge) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/series.py", line 6097, in _cmp_method res_values = ops.comparison_op(lvalues, rvalues, op) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 286, in comparison_op res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) File "/local/home/ubuntu/anaconda3/envs/mntjulip-env/lib/python3.9/site-packages/pandas/core/ops/array_ops.py", line 75, in comp_method_OBJECT_ARRAY result = libops.scalar_compare(x.ravel(), y, op) File "pandas/_libs/ops.pyx", line 107, in pandas._libs.ops.scalar_compare TypeError: '>=' not supported between instances of 'str' and 'int'

I'm using GENCODE v41 now. Is is related to some format problem of the annotation file? Thank you so much for your help.

commented

Hi camelest, thanks for using MntJULiP! I can reproduce the error message when the input bam-list has non-existing paths. Could you please double-check your input bam-list?

Thank you so much for your quick response.

I double-check it and all seem to look fine. It actually gives the first two lines before the errors above:
mnt-JULiP: 07-Apr-23 14:37:36: Generating splice files (or reusing splice files if save-tmp set to true and splice files exist) ... mnt-JULiP: 07-Apr-23 14:37:36: Processing 6 samples ... which corresponds to the #samples I listed in the bam list.

commented

Thanks got it. Would it be possible to have your input bam-list and example splice files/bam files to reproduce the error?

Thank you so much, let me check it since it's some collaborative data. I will check whether I have the same error using the example data as well.

commented

Appreciate it. Simulated bam-list and splice files/bam files would work too.

I fixed the error that @camelest mentioned. However, I got something else afterward:
Traceback (most recent call last):
File "run.py", line 153, in
main()
File "run.py", line 113, in main
df, index_df, anno_intron_dict = process_introns_with_annotation(out_data_dir, num_samples, anno_intron_dict, start_site_genes_dict,
File "/home/melisa/MntJULiP/utils.py", line 184, in process_introns_with_annotation
df['index'] = df[[coord_columns]].apply(lambda x: tuple(x), axis=1)
File "/home/melisa/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3767, in getitem
indexer = self.columns._get_indexer_strict(key, "columns")[1]
File "/home/melisa/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5877, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File "/home/melisa/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5938, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index([('chromosome', 'strand', 'start', 'end')], dtype='object')] are in the [columns]"

A gft file does not have those column names. Do you have any idea what the reason is? Thank you.

Hi melisa-r,

You are right, 'chromosome', 'strand', 'start', 'end' are the dataframe column names for the input of splice files. It's probably the issue of the splice files, it would be great if you could have a check or share your command with any examples of input files.

Hello Ed,

I didn't use any splice files. I used bam files with the command:

python run.py --bam-list bam_list.txt --anno-file /home/melisa/Mus_musculus.GRCm39.110.gtf --num-threads 8

bam_list.txt is like this:
sample condition
/home/melisa/MntJULiP/S1.bam uninjured
/home/melisa/MntJULiP/S2.bam injured

Then it generates the splice files in the out/data folder, they look like this:
sample_1.splice.gz sample_2.splice.gz
and the data in them:

1 4068912 4928042 1 ? 1 0 7 0
1 4816651 4928017 1 ? 0 1 0 0
1 4816651 4928035 1 ? 0 1 0 0
1 4844739 4847748 2 ? 2 0 2 0

no column names at all. So, it cannot process the bam files properly for some reason or the issue is because of the gtf file? Thank you.

Hi again. I realized that the out/data folder has been generated before with the false input files I have given. Apparently, it was using those falsely generated splice files over and over. So, I fixed the column names error now. Sorry, you can ignore my previous question. However, I got this error:

File "run.py", line 153, in
main()
File "run.py", line 127, in main
diff_nb_intron_dict, pred_intron_dict = NB_model(df, conditions, model_dir,
File "/home/melisa/MntJULiP/models.py", line 72, in NB_model
ys.append(np.array(row_list[1:-1], dtype=np.int))
File "/home/melisa/.local/lib/python3.8/site-packages/numpy/init.py", line 305, in getattr
raise AttributeError(former_attrs[attr])
AttributeError: module 'numpy' has no attribute 'int'.
np.int was a deprecated alias for the builtin int. To avoid this error in existing code, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I am using Python 3.8. I fixed the error according to the recommendation (replaced np.int to np.int32 in models.py) . Unfortunately, I have a new error now:

File "run.py", line 153, in
main()
File "run.py", line 127, in main
diff_nb_intron_dict, pred_intron_dict = NB_model(df, conditions, model_dir,
File "/home/melisa/MntJULiP/models.py", line 84, in NB_model
results_batches = list(compute(delayed_results, traverse=False, num_workers=num_workers, scheduler="processes"))
File "/home/melisa/.local/lib/python3.8/site-packages/dask/base.py", line 599, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/melisa/.local/lib/python3.8/site-packages/dask/multiprocessing.py", line 233, in get
result = get_async(
File "/home/melisa/.local/lib/python3.8/site-packages/dask/local.py", line 511, in get_async
raise_exception(exc, tb)
File "/home/melisa/.local/lib/python3.8/site-packages/dask/local.py", line 319, in reraise
raise exc
File "/home/melisa/.local/lib/python3.8/site-packages/dask/local.py", line 224, in execute_task
result = _execute_task(task, data)
File "/home/melisa/.local/lib/python3.8/site-packages/dask/core.py", line 119, in _execute_task
return func(
(_execute_task(a, cache) for a in args))
File "/home/melisa/MntJULiP/models.py", line 109, in batch_run_NB_model
results.append(run_NB_model(y, conditions, count, null_model, alt_model, aggressive_mode))
File "/home/melisa/MntJULiP/models.py", line 158, in run_NB_model
fit_alt = alt_model.optimizing(data=alt_data_dict, as_vector=False, init_alpha=1e-5)
File "/home/melisa/.local/lib/python3.8/site-packages/pystan/model.py", line 542, in optimizing
fit = self.fit_class(data, seed)
File "stanfit4anon_model_23ba00928262ba4aeb4432082f7437b8_1046099302502438532.pyx", line 474, in stanfit4anon_model_23ba00928262ba4aeb4432082f7437b8_1046099302502438532.StanFit4Model.cinit
File "/home/melisa/.local/lib/python3.8/site-packages/pystan/misc.py", line 412, in _split_data
raise ValueError(msg.format(k))
ValueError: Variable z is neither int nor float nor list/array thereof

In models.py, I see that z=conditions. Conditions as in bam_list.txt ?
Any suggestions? Thank you very much for your assistance.

Hi melisa-r,

Thanks for your prompt feedback and all the sample data provided! Yes, with your input, the conditions are converted from the bam_list.txt into

array([[1, 0],
[0, 1]], dtype=uint8)

However, I could not reproduce your error with the data. What version of numpy and pandas are you using please?

Hello Ed,
Thank you for your reply.

Pandas is 2.0.3
Numpy is 2.24.4

Hello Ed,

I resolved the issue by changing z as

z = conditions.astype(np.int32)

I guess it is because of the NumPy version I have. Thank you for your help!