mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.

Home Page:https://mongo-arrow.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'

noname77 opened this issue · comments

commented

Hi,

I'm trying to load a list of nested objects (list of structs in pyarrow), tried both with pymongoarrow 0.7.0 and 25a8832 which results in AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'
In case it matters, I am installing with pip

EDIT: the issue turned out to be a list of float32 within a struct -> once I changed the schema to use float64 things work as expected.
rtfm with more attention if you don't want to waste hours 🤦

it would be nice if the library failed earlier with a more human friendly message, such as pa.float32 type is not supported

here is my original issue

among other things (simple types) that work as expected, here is the (simplified) object that gets parsed correctly (list of structs containing simple types only)

{
    "_id" : ObjectId("someId"),
    "parent" : {
        "child" : [ 
            {
                "fieldA" : "valueA"
            }, 
            {
                "fieldA" : "valueB"
            }
        ]
    }
}

note that I am escaping the dots with underscores by projecting the nested fields in an aggregation pipeline, so that my actual input to pymongoarrow looks like:

{
    "_id" : ObjectId("someId"),
    "parent_child" :  [ 
        {
            "fieldA" : "valueA"
        }, 
        {
            "fieldA" : "valueB"
        }
    ]
}

and here is the corresponding schema definition

parent_child_fields = [
    pa.field("fieldA", pa.string())
]
parent_child_schema = pa.list_(pa.struct(parent_child_fields))

schema_dict = {
    "_id": ObjectId,
    "parent_child": parent_child_schema
}

schema = Schema(schema_dict)

finally, note that this is a simplified example and in reality I have more fields (also nested) that I would like to include in the parent_child_schema as nested structs

is this scenario supported?
what am I missing?

here is a full trace
File [opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/api.py:193), in aggregate_pandas_all(collection, pipeline, schema, **kwargs)
    175 def aggregate_pandas_all(collection, pipeline, *, schema=None, **kwargs):
    176     """Method that returns the results of an aggregation pipeline as a
    177     :class:`pandas.DataFrame` instance.
    178 
   (...)
    191       An instance of class:`pandas.DataFrame`.
    192     """
--> 193     return _arrow_to_pandas(aggregate_arrow_all(collection, pipeline, schema=schema, **kwargs))

File [/opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/api.py:118), in aggregate_arrow_all(collection, pipeline, schema, **kwargs)
    100 def aggregate_arrow_all(collection, pipeline, *, schema=None, **kwargs):
    101     """Method that returns the results of an aggregation pipeline as a
    102     :class:`pyarrow.Table` instance.
    103 
   (...)
    116       An instance of class:`pyarrow.Table`.
    117     """
--> 118     context = PyMongoArrowContext.from_schema(schema, codec_options=collection.codec_options)
    120     if pipeline and ("$out" in pipeline[-1] or "$merge" in pipeline[-1]):
    121         raise ValueError(
    122             "Aggregation pipelines containing a '$out' or '$merge' stage are "
    123             "not supported by PyMongoArrow"
    124         )

File [/opt/venv/src/pymongoarrow/bindings/python/pymongoarrow/context.py:97), in PyMongoArrowContext.from_schema(cls, schema, codec_options)
     95 elif builder_cls == ListBuilder:
     96     arrow_type = schema.typemap[fname]
---> 97     builder_map[encoded_fname] = ListBuilder(arrow_type, tzinfo)
     98 elif builder_cls == BinaryBuilder:
     99     subtype = schema.typemap[fname].subtype

File pymongoarrow[lib.pyx:806), in pymongoarrow.lib.ListBuilder.__cinit__()

File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()

File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()

File pymongoarrow[/lib.pyx:716), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:750), in pymongoarrow.lib.DocumentBuilder.__cinit__()

File pymongoarrow[/lib.pyx:718), in pymongoarrow.lib.get_field_builder()

File pymongoarrow[/lib.pyx:806), in pymongoarrow.lib.ListBuilder.__cinit__()

File pymongoarrow[/lib.pyx:719), in pymongoarrow.lib.get_field_builder()

AttributeError: 'pyarrow.lib.DataType' object has no attribute '_type_marker'

thank you,
wiktor

commented

the issue turned out to be a list of float32 within a struct -> once I changed the schema to use float64 things work as expected