AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'
noname77 opened this issue · comments
Hi,
I'm trying to load a list of nested objects (list of structs in pyarrow), tried both with pymongoarrow 0.7.0 and 25a8832 which results in AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'
In case it matters, I am installing with pip
EDIT: the issue turned out to be a list of float32
within a struct -> once I changed the schema to use float64
things work as expected.
rtfm with more attention if you don't want to waste hours 🤦
it would be nice if the library failed earlier with a more human friendly message, such as pa.float32 type is not supported
here is my original issue
among other things (simple types) that work as expected, here is the (simplified) object that gets parsed correctly (list of structs containing simple types only)
{
"_id" : ObjectId("someId"),
"parent" : {
"child" : [
{
"fieldA" : "valueA"
},
{
"fieldA" : "valueB"
}
]
}
}
note that I am escaping the dots with underscores by projecting the nested fields in an aggregation pipeline, so that my actual input to pymongoarrow looks like:
{
"_id" : ObjectId("someId"),
"parent_child" : [
{
"fieldA" : "valueA"
},
{
"fieldA" : "valueB"
}
]
}
and here is the corresponding schema definition
parent_child_fields = [
pa.field("fieldA", pa.string())
]
parent_child_schema = pa.list_(pa.struct(parent_child_fields))
schema_dict = {
"_id": ObjectId,
"parent_child": parent_child_schema
}
schema = Schema(schema_dict)
finally, note that this is a simplified example and in reality I have more fields (also nested) that I would like to include in the parent_child_schema
as nested structs
is this scenario supported?
what am I missing?
here is a full trace
File [opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/api.py:193), in aggregate_pandas_all(collection, pipeline, schema, **kwargs)
175 def aggregate_pandas_all(collection, pipeline, *, schema=None, **kwargs):
176 """Method that returns the results of an aggregation pipeline as a
177 :class:`pandas.DataFrame` instance.
178
(...)
191 An instance of class:`pandas.DataFrame`.
192 """
--> 193 return _arrow_to_pandas(aggregate_arrow_all(collection, pipeline, schema=schema, **kwargs))
File [/opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/api.py:118), in aggregate_arrow_all(collection, pipeline, schema, **kwargs)
100 def aggregate_arrow_all(collection, pipeline, *, schema=None, **kwargs):
101 """Method that returns the results of an aggregation pipeline as a
102 :class:`pyarrow.Table` instance.
103
(...)
116 An instance of class:`pyarrow.Table`.
117 """
--> 118 context = PyMongoArrowContext.from_schema(schema, codec_options=collection.codec_options)
120 if pipeline and ("$out" in pipeline[-1] or "$merge" in pipeline[-1]):
121 raise ValueError(
122 "Aggregation pipelines containing a '$out' or '$merge' stage are "
123 "not supported by PyMongoArrow"
124 )
File [/opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/context.py:97), in PyMongoArrowContext.from_schema(cls, schema, codec_options)
95 elif builder_cls == ListBuilder:
96 arrow_type = schema.typemap[fname]
---> 97 builder_map[encoded_fname] = ListBuilder(arrow_type, tzinfo)
98 elif builder_cls == BinaryBuilder:
99 subtype = schema.typemap[fname].subtype
File pymongoarrow[lib.pyx:806), in pymongoarrow.lib.ListBuilder.__cinit__()
File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()
File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()
File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()
File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()
File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()
File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()
File pymongoarrow[/lib.pyx:718), in pymongoarrow.lib.get_field_builder()
File pymongoarrow[/lib.pyx:806), in pymongoarrow.lib.ListBuilder.__cinit__()
File pymongoarrow[/lib.pyx:719), in pymongoarrow.lib.get_field_builder()
AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'
thank you,
wiktor
the issue turned out to be a list of float32 within a struct -> once I changed the schema to use float64 things work as expected