pinellolab / haystack_bio

Haystack: Epigenetic Variability and Transcription Factor Motifs Analysis Pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KeyError while running tests

micdonato opened this issue · comments

Hi, I am trying to install haystack on our server and I am running into an error when running the tests:
The tests complete successfully but at the end I get this:

INFO  @ Wed, 07 Apr 2021 16:10:54:
	 Analyzing MA0724.1 from:/home/user/haystack_test_output/HAYSTACK_PIPELINE_RESULTS/HAYSTACK_MOTIFS/HAYSTACK_MOTIFS_on_K562/genes_lists/MA0724.1_motif_region_in_target.tss.bed
/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/haystack/generate_tf_activity_plane.py:189:FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
  mapped_genes = map(str.upper, list(pd.read_table(motif_gene_filename,keep_default_na=False,na_values='null').dropna()['Symbol'].values.astype(str)))
Traceback (most recent call last):
  File "/home/users/.conda/envs/hotspots/bin/haystack_tf_activity_plane", line 10, in <module>
    sys.exit(main())
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/haystack/generate_tf_activity_plane.py", line 193, in main
    ds_values = zscore_series(gene_ranking.ix[mapped_genes, :].mean())
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/pandas/core/indexing.py", line 120, in __getitem__
    return self._getitem_tuple(key)
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/pandas/core/indexing.py", line 888, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/pandas/core/indexing.py", line 1088, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/pandas/core/indexing.py", line 1205, in _getitem_iterable
    raise_missing=False)
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/pandas/core/indexing.py", line 1161, in _get_listlike_indexer
    raise_missing=raise_missing)
  File "/home/user/.conda/envs/hotspots/lib/python2.7/site-packages/pandas/core/indexing.py", line 1252, in _validate_read_indexer
    raise KeyError("{} not in index".format(not_found))
KeyError: "['BAGE5', 'GRIK1-AS2'] not in index"
INFO  @ Wed, 07 Apr 2021 16:10:54:
	 Test completed successfully

Should I be worried?

An update to pandas is causing this. I am not sure if it is a cause of worry, but to be on the safe side, I would pin the version of pandas (and potentially other packages) to the ones here https://github.com/pinellolab/haystack_bio/blob/master/Dockerfile#L35. Alternatively, you can use the Docker container.

Rick,

Thanks, it makes sense! I think I will use the docker container but I was considering building a Singularity container and pinning pandas will help.

I have no experience building Singularity containers but I think it would be a great solution for people running the pipeline on HPC clusters. Maybe @lucapinello knows more about these types of containers. I'll ask him.

They work roughly the same as Docker containers, it's just a matter to create the right recipe for building them. Usually I install a package locally to see if I am able to build everything, before going the container way.

My reason to use Singularity is mostly the root/user issue for Docker and to deal with filesystem isolation, but there are other differences as well.

Thanks!

Hi all, and thanks!

That is what I tried at first. Unfortunately, it seems that Singularity fails to actually build the image, as packages that should be installed are missing:

The command:
singularity run docker://pinellolab/haystack_bio haystack_pipeline data/data_h3k27ac_6cells/samples_names.txt hg19 --blacklist hg19

The result:

INFO:    Using cached SIF image
Traceback (most recent call last):
  File "/usr/local/bin/haystack_pipeline", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2927, in <module>
    @_call_aside
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2913, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2940, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 635, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 943, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 829, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'scipy>=1.0.0' distribution was not found and is required by haystack-bio

That is why I wanted to try rebuild the singularity image from scratch.

I can reproduce your error and it seems there is no simple solution to directly use the docker with singularity. You may want to explore this tool to convert the docker image : docker2singularity.

I have tested the docker image on my machine and it is still working as expected but I understand that this may not be a viable option for you.

You can try to downgrade pandas in the conda environment you have created previously and if necessary also the other packages:

numpy==1.13.3
scipy==1.0.0
matplotlib==2.1.0
pandas==0.21.0
&& pip install
bx-python==0.7.3
Jinja2==2.9.6
tqdm==4.19.4
weblogo==3.5.0 \

@rfarouni do you have the bandwidth to pin pandas in the next few days in the bioconda package and resubmit it so we can fix this for other users trying the package through bioconda? Of course this will require to create a separate conda env just for haystack

@lucapinello I will look into this as soon as I can

I found that the easiest way to deal with this error is to run conda install pandas==0.21 after running conda install haystack_bio. The test runs fine after that.

	 The expression values of the gene TEST1 are not present. Skipping it. 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene SCIP are not present. Skipping it. 

INFO  @ Thu, 22 Apr 2021 18:10:50:
	 Gene:POU3F1 TF z-score:0.73 Targets z-score:1.58  Correlation:0.48 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene TST-1 are not present. Skipping it. 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene OCT6 are not present. Skipping it. 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene OTF-6 are not present. Skipping it. 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene OTF6 are not present. Skipping it. 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene OCT-6 are not present. Skipping it. 

WARNING @ Thu, 22 Apr 2021 18:10:50:
	 The expression values of the gene TST1 are not present. Skipping it. 

INFO  @ Thu, 22 Apr 2021 18:10:50:
	 All done! Ciao! 

INFO  @ Thu, 22 Apr 2021 18:10:50:
	 Test completed successfully```

This seems to work as well