ibis-project / ibis-substrait

Ibis Substrait Compiler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bug: project rels are not correct

cpcloud opened this issue · comments

According to the substrait spec, project rels always append, and removal of columns is done through emit. Right now, ibis-substrait only keeps the most recently computed expressions around.

We should change the implementation to always keep the columns from the previous relation and, drop them implicitly using emit.

Thanks @cpcloud, does this explain why

t = t.mutate(c=t['a'] + t['b'])

Doesn't work as well?

Most likely. mutate is just sugar for a project

I'm working on fixing this up -- should hopefully have a draft up soon.

Hi - I tested this against Acero but still got an error.

I was testing this with https://github.com/wjones127/ibis-substrait/ and it does pick up the fix below but still doesn't work.

The Error I got is:

pyarrow.lib.ArrowInvalid: Invalid column index to add field
from
pyarrow._substrait.run_query

The code I have is wth like:

table = table[['a', ‘b', 'c’]]

Hey @icexelloss -- I'm not sure what's up, but I can run code like that without issue against a recent built of pyarrow. If you have a reproducer I'm happy to take a look.

In [1]: import pyarrow.substrait as ps

In [2]: import pyarrow as pa

In [3]: pa.__version__
Out[3]: '10.0.0.dev998+ge0ca46598'

In [4]: import ibis

In [5]: t = ibis.table(
   ...:     [("a", "float64"), ("b", "str"), ("c", "int64"), ("d", "int64")], name="table0"
   ...: )

In [6]: t = t[["a", "b", "c"]]

In [7]: from ibis_substrait.compiler.core import SubstraitCompiler

In [8]: compiler = SubstraitCompiler()

In [9]: plan = compiler.compile(t)

In [10]: table0 = pa.Table.from_pydict(
    ...:     {
    ...:         "a": [1.1, 2.2, 3.3],
    ...:         "b": ["a", "b", "c"],
    ...:         "c": [4, 5, 6],
    ...:         "d": [0, 0, 0],
    ...:     }
    ...: )
    ...: 
    ...: 
    ...: def table_provider(names):
    ...:     return table0

In [11]: ps.run_query(
    ...:     pa.py_buffer(plan.SerializeToString()), table_provider=table_provider
    ...: ).read_pandas()
Out[11]: 
     a  b  c
0  1.1  a  4
1  2.2  b  5
2  3.3  c  6

In [12]: t = ibis.table(
    ...:     [("a", "float64"), ("b", "str"), ("c", "int64"), ("d", "int64")], name="table0"
    ...: )

In [13]: t = t[["a", "b", "d"]]

In [14]: plan = compiler.compile(t)

In [15]: ps.run_query(
    ...:     pa.py_buffer(plan.SerializeToString()), table_provider=table_provider
    ...: ).read_pandas()
Out[15]: 
     a  b  d
0  1.1  a  0
1  2.2  b  0
2  3.3  c  0

In [16]: ibis.__version__
Out[16]: '3.0.2'

@icexelloss What version of pyarrow are you running against?

I am using pyarrow nightly from 2022-09-29

I can try bumping pyarrow / arrow version to latest to see

I think that's definitely worth checking -- I know there were some recent upstream fixes in Arrow for substrait stuff.

I think it would be most helpful when posting these to provide the generated substrait plan. That makes is easier to triage whether it's the consumer or the producer that's causing the issue.

Script to produce failing query
'''Write queries for each '''
import ibis
from ibis_substrait.compiler.core import SubstraitCompiler
from google.protobuf.json_format import MessageToJson

compiler = SubstraitCompiler()

t = ibis.table([("a", "double"), ("b", "int64"), ("c", "double"), ("d", "string")], name="table0")

queries = [
    # simple filter
    # Fails due to nullable literal: https://github.com/apache/arrow/blob/master/cpp/src/arrow/engine/substrait/expression_internal.cc#L299-L302
    # Either of these would fix:
    #  * Issue in Arrow: https://issues.apache.org/jira/browse/ARROW-15540
    #  * Issue in Substrait: https://github.com/ibis-project/ibis-substrait/issues/369
    t[t['a'] == 1],
    # simple column selection
    # Will be fixed by: https://github.com/ibis-project/ibis-substrait/issues/365
    t[['a', 'c']],
    # Column assignment and arithmetic
    # Will be fixed by https://issues.apache.org/jira/browse/ARROW-17994
    t.mutate(v=t['a'] + t['c']),
    # Column assignment and math (not yet mapped in ibis_substrait)
    # https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/compiler/mapping.py
    # https://github.com/ibis-project/ibis-substrait/issues/368
    # Also might not yet be supported in Arrow: https://issues.apache.org/jira/browse/ARROW-15538
    # t.mutate(v=t['a'].pow(2) / t['b'].sqrt()), 

    # Case when: works if project/emit and literals are fixed
    t.mutate(v = ibis.case().when(t['a'] > 1, t['a']).else_(0).end()),

    # ifelse: doesn't work for number
    # t.mutate(valid=(t['a'] > 0).ifelse(1, 0)),

    # ifelse on string
    t.aggregate(valid=(t['d'].like("%test%")).ifelse(1, 0).sum()),
]

substrait_queries = [compiler.compile(query) for query in queries]

for i, query in enumerate(queries):
    substrait = compiler.compile(query)
    query_bytes = substrait.SerializeToString()
    query_json = MessageToJson(substrait)

    with open(f"query-{i}.substrait", "wb") as f:
        f.write(query_bytes)
    
    with open(f"query-{i}.json", "w") as f:
        f.write(query_json)
Failing query: query-1.json
{
  "extensionUris": [
    {
      "extensionUriAnchor": 1
    }
  ],
  "extensions": [
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 1,
        "name": "equal"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 2,
        "name": "add"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 3,
        "name": "gt"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 4,
        "name": "sum"
      }
    },
    {
      "extensionFunction": {
        "extensionUriReference": 1,
        "functionAnchor": 5,
        "name": "like"
      }
    }
  ],
  "relations": [
    {
      "root": {
        "input": {
          "project": {
            "common": {
              "emit": {
                "outputMapping": [
                  4,
                  5
                ]
              }
            },
            "input": {
              "read": {
                "common": {
                  "direct": {}
                },
                "baseSchema": {
                  "names": [
                    "a",
                    "b",
                    "c",
                    "d"
                  ],
                  "struct": {
                    "types": [
                      {
                        "fp64": {
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      {
                        "i64": {
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      {
                        "fp64": {
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      },
                      {
                        "string": {
                          "nullability": "NULLABILITY_NULLABLE"
                        }
                      }
                    ],
                    "nullability": "NULLABILITY_REQUIRED"
                  }
                },
                "namedTable": {
                  "names": [
                    "table0"
                  ]
                }
              }
            },
            "expressions": [
              {
                "selection": {
                  "directReference": {
                    "structField": {}
                  },
                  "rootReference": {}
                }
              },
              {
                "selection": {
                  "directReference": {
                    "structField": {
                      "field": 2
                    }
                  },
                  "rootReference": {}
                }
              }
            ]
          }
        },
        "names": [
          "a",
          "c"
        ]
      }
    }
  ]
}
Script to try queries in Acero
"""Verify substrait queries can be executed"""
import pyarrow as pa
import pyarrow.substrait
import glob

def table_provider(names):
    if not names:
        raise Exception("No names provided")
    elif names[0] == 'table0':
        return test_table_0
    else:
        raise Exception(f"Unknown table name {names}")

test_table_0 = pa.Table.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0, 2, 4]})


# for path in glob.glob("*.substrait"):
#     with open(path, "rb") as f:
#         query_bytes = f.read()
for path in glob.glob("*.json"):
    with open(path, "rb") as f:
        query_json = f.read()
        try:
            query_bytes = pa._substrait._parse_json_plan(query_json)
            pa.substrait.run_query(pa.py_buffer(query_bytes), table_provider)
            # pa.substrait.run_query(pa.py_buffer(query_bytes), table_provider)
        except Exception as e:
            print(f"Failed query {path} with exception: {str(e)}")

Failure I see (on master branch of Arrow):

Failed query query-1.json with exception: No match for FieldRef.FieldPath(3) in a: int64
b: int64
c: int64
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/type.h:1796  CheckNonEmpty(matches, root)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/expression.cc:437  ref->FindOne(in)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/project_node.cc:68  expr.Bind(*inputs[0]->output_schema(), plan->exec_context())
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:535  MakeExecNode(this->factory_name, plan, std::move(inputs), *this->options, registry)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530  std::get<Declaration>(input).AddToPlan(plan, registry)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:530  std::get<Declaration>(input).AddToPlan(plan, registry)
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:80  decl.AddToPlan(plan_.get()).status()
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/engine/substrait/util.cc:131  executor.Execute()

I think it would be most helpful when posting these to provide the generated substrait plan. That makes is easier to triage whether it's the consumer or the producer that's causing the issue.

Definitely!

test_table_0 = pa.Table.from_pydict({"a": [1, 2, 3], "b": [4, 5, 6], "c": [0, 2, 4]})

There's no column d here to match the column d in the Ibis unbound table

t = ibis.table([("a", "double"), ("b", "int64"), ("c", "double"), ("d", "string")], name="table0")

There's no column d here to match the column d in the Ibis unbound table

Ah, you are correct. After fixing that the query runs fine 👍

So I updated Pyarrow code to 2022-10-20 nightly build and I can run this code now!

However, I observed an correctness issue - when I did

t = t[['a', 'b','c']]
and
t = t[['a', 'b', 'd']]

It gave me the same value for c and d, I checked the substrait_plan and it seems the direct reference field seems correct (2 in the first case and 3 in the second case), but still run_query give me same result for these two queries - I wonder if this could be a bug in Acero consumer? (I am using a custom named_table)

@icexelloss Could you post the Substrait JSON?

Here is the two json that gives same result

extension_uris {

  extension_uri_anchor: 1

}

relations {

  root {

    input {

      project {

        common {

          emit {

            output_mapping: 0

            output_mapping: 1

            output_mapping: 2

          }

        }

        input {

          read {

            base_schema {

              names: "time"

              names: "tid"

              names: "SQRT_DOLLAR_VOLUME"

              names: "ANNUAL_VARIANCE"

              names: "ANNUAL_RESIDUAL_VARIANCE"

              names: "RAW_PREV_63DAY_LAGGING"

              names: "IND_PREV_63DAY_LAGGING"

              names: "MKT_PREV_63DAY_LAGGING"

              names: "SEC_PREV_63DAY_LAGGING"

              names: "RES_PREV_63DAY_LAGGING"

              names: "ZMR_PREV_63DAY_LAGGING"

              struct {

                types {

                  timestamp_tz {

                    nullability: NULLABILITY_REQUIRED

                  }

                }

                types {

                  i32 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                nullability: NULLABILITY_REQUIRED

              }

            }

            named_table {

              names: "smooth:/research/user/ljin/test?begin=20100101&end=20100110"

            }

          }

        }

        expressions {

          selection {

            direct_reference {

              struct_field {

              }

            }

            root_reference {

            }

          }

        }

        expressions {

          selection {

            direct_reference {

              struct_field {

                field: 1

              }

            }

            root_reference {

            }

          }

        }

        expressions {

          selection {

            direct_reference {

              struct_field {

                field: 3

              }

            }

            root_reference {

            }

          }

        }

      }

    }

    names: "time"

    names: "tid"

    names: "ANNUAL_VARIANCE"

  }

}

And

extension_uris {

  extension_uri_anchor: 1

}

relations {

  root {

    input {

      project {

        common {

          emit {

            output_mapping: 0

            output_mapping: 1

            output_mapping: 2

          }

        }

        input {

          read {

            base_schema {

              names: "time"

              names: "tid"

              names: "SQRT_DOLLAR_VOLUME"

              names: "ANNUAL_VARIANCE"

              names: "ANNUAL_RESIDUAL_VARIANCE"

              names: "RAW_PREV_63DAY_LAGGING"

              names: "IND_PREV_63DAY_LAGGING"

              names: "MKT_PREV_63DAY_LAGGING"

              names: "SEC_PREV_63DAY_LAGGING"

              names: "RES_PREV_63DAY_LAGGING"

              names: "ZMR_PREV_63DAY_LAGGING"

              struct {

                types {

                  timestamp_tz {

                    nullability: NULLABILITY_REQUIRED

                  }

                }

                types {

                  i32 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                types {

                  fp64 {

                    nullability: NULLABILITY_NULLABLE

                  }

                }

                nullability: NULLABILITY_REQUIRED

              }

            }

            named_table {

              names: "smooth:/research/user/ljin/test?begin=20100101&end=20100110"

            }

          }

        }

        expressions {

          selection {

            direct_reference {

              struct_field {

              }

            }

            root_reference {

            }

          }

        }

        expressions {

          selection {

            direct_reference {

              struct_field {

                field: 1

              }

            }

            root_reference {

            }

          }

        }

        expressions {

          selection {

            direct_reference {

              struct_field {

                field: 2

              }

            }

            root_reference {

            }

          }

        }

      }

    }

    names: "time"

    names: "tid"

    names: "SQRT_DOLLAR_VOLUME"

  }

}

I noticed the first field is empty - don't know if this is an issue or not

 selection {

            direct_reference {

              struct_field {

              }

            }

            root_reference {

            }

          }

Sorry I don't have a fully reproducible script because this is connected to our internal data source systems

that output_mapping is wrong -- it shouldn't be 0, 1, 2 for either of those plans. What's the ibis code that generated these?

What is output_mapping generated from?

And what is the correct value for output_mapping? I can maybe try to trace it internally.

It's supposed to be a zero-indexed reference to the output columns, but also including the input columns, so we count the number of columns in the unbound tables that the expression is built on.

For three output columns, with the 11 columns in your base schema, the output_mapping should be [11, 12, 13]

The reason your second example works is just the luck that the first 3 columns in the input table are also the three you are selecting (via indices [0, 1, 2])

I see, can you point me to where output_mapping is computed?

I traced a bit deeper, here is what I found what I wonder what is wrong:
In this function:
https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/compiler/translate.py#L697

child_rel_field_offsets is {}
relation.project is None (so it didn't enter the branch in https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/compiler/translate.py#L746)
mapping_counter is 0

I found our internal ibis-substrait rule to translate the custom named table is missing this line:
https://github.com/ibis-project/ibis-substrait/blob/main/ibis_substrait/compiler/translate.py#L657

So I added it, but doesn't seem to make a different - relation.project is still None

Sorry, my comment about relation.project is wrong - It is not None but relation.project.common.ListFields() returns an empty list

The ListFields() call should be an empty list unless you have nested projections

Hmm. Ok so where should the mapping_counter be increased then if ListFields() is empty?

mapping_counter should be set to a value equal to the number of columns in your underlying table.
Does your table object not have a schema attribute?

Oh I know what the issue is - my ibis node class is not a subclass of "UnboundTable"

Yep - our node is only a subclass of _ops.TableNode and not UnboundTable

Yeah I think the issue is that our data source node don't inherent from the ibis UnboundTable (I think originally because "name" doesn't quite make sense to all of our internally data source node and we override the schema to be property for reasons I don't remember).

Is there a way to make this code more flexible to handle custom source node in additional to UnboundTable? Maybe provide a hook that I can register my table class as "unbound table" or provide an trait that I can inherent ?

Let me open a new issue for this because I think this is out of the scope of the original ticket

I opened this ticket to discussion option to support custom data sources: #393