spcl / dace

DaCe - Data Centric Parallel Programming

Home Page:http://dace.is/fast

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

v15 regression: OptionalArrayInference and InlineSDFGs pass fails

FlorianDeconinck opened this issue · comments

NASA's and NOAA climate model code running DaCe is failing simplify on OptionalArrayInference and InlineSDFGs. Deactivating those two pass still fails on codegen, with a similar error.

It seems linked to a deep-copy within sdfg.utils.postdominators.

BT

dsl/pace/dsl/dace/orchestration.py:505: in __call__
    return wrapped(*arg, **kwarg)
dsl/pace/dsl/dace/orchestration.py:412: in __call__
    return _call_sdfg(
dsl/pace/dsl/dace/orchestration.py:261: in _call_sdfg
    res = _build_sdfg(daceprog, sdfg, config, args, kwargs)
dsl/pace/dsl/dace/orchestration.py:155: in _build_sdfg
    _simplify(sdfg, validate=False, verbose=True)
dsl/pace/dsl/dace/orchestration.py:115: in _simplify
    return SimplifyPass(
external/dace/dace/transformation/passes/simplify.py:106: in apply_pass
    result = super().apply_pass(sdfg, pipeline_results)
external/dace/dace/transformation/pass_pipeline.py:547: in apply_pass
    newret = super().apply_pass(sdfg, state)
external/dace/dace/transformation/pass_pipeline.py:502: in apply_pass
    r = self.apply_subpass(sdfg, p, state)
external/dace/dace/transformation/passes/simplify.py:83: in apply_subpass
    ret = p.apply_pass(sdfg, state)
external/dace/dace/transformation/passes/optional_arrays.py:65: in apply_pass
    for state in self.traverse_unconditional_states(sdfg):
external/dace/dace/transformation/passes/optional_arrays.py:102: in traverse_unconditional_states
    ipostdom = sdutil.postdominators(sdfg)
external/dace/dace/sdfg/utils.py:1541: in postdominators
    ipostdom: Dict[SDFGState, SDFGState] = nx.immediate_dominators(sdfg._nx.reverse(), sink)
.venv/lib/python3.8/site-packages/networkx/classes/digraph.py:1219: in reverse
    H.add_edges_from((v, u, deepcopy(d)) for u, v, d in self.edges(data=True))
.venv/lib/python3.8/site-packages/networkx/classes/digraph.py:676: in add_edges_from
    for e in ebunch_to_add:
.venv/lib/python3.8/site-packages/networkx/classes/digraph.py:1219: in <genexpr>
    H.add_edges_from((v, u, deepcopy(d)) for u, v, d in self.edges(data=True))
/home/fgdeconi/.pyenv/versions/3.8.10/lib/python3.8/copy.py:146: in deepcopy
[...]
E                           TypeError: cannot pickle 'PyCapsule' object

/home/fgdeconi/.pyenv/versions/3.8.10/lib/python3.8/copy.py:161: TypeError

Self contained reproducer - pulling on DaCe v0.15 in pace/external/dace. This will pull the model and execute a small numerical regression test that will fail with the above stack trace. Code referenced is in comments of the script.

# Repo is to run the FiniteVolumeTransport regression test
# Original code: fv3core/pace/fv3core/stencils/fvtp2d.py
# DaCe is applied on the FiniteVolumeTransport.__call__ function
# The failing DaCe is in "pace/external/dace"

HOME=$PWD

# Get Pace repository
git clone git@github.com:GEOS-ESM/pace
cd pace
git checkout 911368
git submodule update --recursive --init

# Setup the venv
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install external/gt4py/
pip install external/dace/
pip install -r requirements_dev.txt -c constraints.txt
cd external/dace/
git checkout v0.15
cd $HOME/pace

# Download data
mkdir -p test_data
cd test_data
wget https://portal.nccs.nasa.gov/datashare/astg/smt/pace-regression-data/8.1.3_c12_6_ranks_standard.FvTp2d.tar.gz
tar -xzvf 8.1.3_c12_6_ranks_standard.FvTp2d.tar.gz
cd $HOME/pace

# Run test of FvTp2d
export FV3_DACEMODE=BuildAndRun
export PACE_CONSTANTS=GFS
pytest -v -s --data_path=./test_data/8.1.3/c12_6ranks_standard/dycore \
       --backend=dace:cpu --which_modules=FvTp2d --which_rank=0 \
       --threshold_overrides_file=./fv3core/tests/savepoint/translate/overrides/standard.yaml \
       ./fv3core/tests/savepoint

@FlorianDeconinck @alexnick83 After some digging, it is caused by MPIResolver's adding new fields into the AST of code that ends up in the SDFG code blocks: c224013

Adding the parent field to AST nodes is dangerous and we should replace it with a dictionary that doesn't outlive preprocessing, which I now did in #1446

Tested 7ea43c3 and confirm it clears the original deep copy issue.