AllenInstitute / cell_type_mapper

Repository for storing prototype functionality implementations for the BKP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running the 2nd step of the cell type mapping pipeline

mitsiask opened this issue · comments

Greetings,

Congratulations on this very useful pipeline.

After completing the first step successfully, I'm stuck in the "Encoding marker genes" step.

The python -m cell_type_mapper.cli.marker_cache_from_csv_dir --help step indicates that a

Directory containing the lists of marker genes produced by the science team's R code (default=None) is required.

Is the respective R code available to the community? And if not, can you propose alternatives so I can proceed promptly?

Thanks in advance,
Dimitris

HI @mitsiask ,

The cell_type_mapper.cli.marker_cache_from_csv_dir utility was primarily meant for internal Allen Institute users who need to ingest a new cell type taxonomy into the online MapMyCells app.

May I ask what you are ultimately trying to accomplish? If you want to map new data onto the Yao et al. 2023 Whole Mouse Brain taxonomy, then I think just using the MapMyCells app linked above is the most straightforward way to proceed (unless there is something stopping you...?)

If you want to map unlabeled data onto a different cell type taxonomy, there is a series of commands you can run which I haven't gotten around to fully documenting yet.

Let me know what your goal is and I will try to determine the most direct way forward with this codebase. Thank you very much for your interest.

Hi @danielsf,

Thank you for your response.

My primary intention is to map a new dataset against Yao et al. 2023 Whole Mouse Brain taxonomy, which I have successfully applied using the MapMyCells app, as you proposed.

On top of this analysis, I would like to know:

(a) Which genes are driving the mapping of our unlabeled data onto the Yao et al. 2023 taxonomy? Are these the markers referenced in step 2 of this description?
(b) Additionally, if I want to use a different cell type taxonomy, what steps would I need to follow?

Cheers,
Dimitris

Regarding (a): in the extended output JSON file, which is documented here, there is an element 'marker_genes' which lists, at each level of the taxonomy, which marker genes were used to do the mapping. Note: these are the marker genes that were used to map your specific dataset. There is a lookup table of desired marker genes which MapMyCells uses. Any cells on that list that are missing from your dataset are ignored. The lookup table reported in the output JSON file is what remains after ignoring those missing genes (if any).

Regarding (b): there is a lot of discussion going on here at the Allen Institute about using this code base to map unlabeled data onto different taxonomies. It is possible, I just haven't documented it yet. Can I get back to you next week (probably around December 20th) once the documentation is ready?

Hi again,

(a) Thank you for the clarification, things are much more clear now.

(b) Great, I will be waiting for your updates then!

Thanks in advance.

@mitsiask

I have added the first draft of the documentation to this branch

https://github.com/AllenInstitute/cell_type_mapper/tree/rc/1.1.8/231213

(there are also some code changes that might make defining your taxonomy easier). Specifically, think this page

https://github.com/AllenInstitute/cell_type_mapper/blob/rc/1.1.8/231213/docs/ingesting_new_taxonomies.md

and its links should get you started towards running this code with another taxonomy. Please let me know if anything is unclear about the documentation. You are probably going to be the first user to try to engage with this functionality. I will be slow to respond before January 2024. I will merge this branch into main once I get a sense that I'm not totally speaking nonsense in the documentation.

Thank you for your patience.

The documentation has been merged into main.

Dear Scott,

I finally found some time and I went through your new documentation instructions.

I have successfully created the precompute_stats.h5 file using the cell_type_mapper.cli.precompute_stats_abc.py script, as instructed here.

But when I proceed to create the "marker gene lookup table" using the cell_type_mapper.cli.reference_markers command, as instructed here, I get an error:

python -m cell_type_mapper.cli.reference_markers --precomputed_path_list precomputed_path_list/precompute_stats.h5 --n_processors 20 --tmp_dir temp --output_dir reference_markers

Traceback (most recent call last):
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tdi/Downloads/cell_type_mapper-main/src/cell_type_mapper/cli/reference_markers.py", line 163, in <module>
    runner = ReferenceMarkerRunner()
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/site-packages/argschema/argschema_parser.py", line 160, in __init__
    argsdict = utils.args_to_dict(argsobj, self.schema)
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/site-packages/argschema/utils.py", line 138, in args_to_dict
    raise mm.ValidationError(json.dumps(errors, indent=2))
marshmallow.exceptions.ValidationError: {
  "precomputed_path_list": [
    "Command-line argument can't cast to List"
  ]
}

I get the same error if I just provide the path "precomputed_path_list", while when I try to provide the absolute path, I get another error:

Traceback (most recent call last):
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tdi/Downloads/cell_type_mapper-main/src/cell_type_mapper/cli/reference_markers.py", line 163, in <module>
    runner = ReferenceMarkerRunner()
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/site-packages/argschema/argschema_parser.py", line 160, in __init__
    argsdict = utils.args_to_dict(argsobj, self.schema)
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/site-packages/argschema/utils.py", line 128, in args_to_dict
    value = get_type_from_field(field_def)(value)
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/ast.py", line 62, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')
  File "/home/opt/mambaforge/envs/allen_cell_map/lib/python3.9/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    /lustre/projects/Allen_mapping_tools/precomputed_path_list/precompute_stats.h5
    ^
SyntaxError: invalid syntax

Any help on this will be appreciated.

Cheers,
Dimitris

Sorry. This is something that is unclear in the documentation (one of our internal users recently stumbled over it). precomputed_path_list has to be a string that can be decoded into a list of strings, so it needs to be formatted like

python -m cell_type_mapper.cli.reference_markers \
--precomputed_path_list '["precomputed_path_list/precompute_stats.h5"]' \
--n_processors 20 \
--tmp_dir temp \
--output_dir reference_markers

note the nested quotation marks.

I'll figure out a way to make the docs a little clearer on this question.