mongodb-labs / mongo-arrow

MongoDB integrations for Apache Arrow. Export MongoDB documents to numpy array, parquet files, and pandas dataframes in one line of code.

Home Page:https://mongo-arrow.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is the Schema support pyarrow.string() Type ?

ubntelton opened this issue · comments

commented
  from pymongoarrow.api import Schema, find_pandas_all
  from datetime import datetime
  from pymongo import MongoClient

  client = MongoClient()
  client.db.data.delete_many({})
  client.db.data.insert_many([
      {'_id': 1, 'amount': 21, 'mac':'eee', 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
      {'_id': 2, 'amount': 16,  'mac':'ddd', 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
      {'_id': 3, 'amount': 3,  'mac':'eeeeee', 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
      {'_id': 4, 'amount': 0, 'mac':'aaa',  'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])

  schema = Schema({'_id': pyarrow.int32(), 'amount': pyarrow.float64(), 'mac': pyarrow.string(), 'last_updated': datetime})
  df = find_pandas_all(client.db.data, {'amount': {'$gt': 5}}, schema=schema)

Schema({'_id': pyarrow.int32(), 'amount': pyarrow.float64(), 'mac': pyarrow.string(), 'last_updated': datetime})

    df = find_pandas_all(client.db.data, {'amount': {'$gt': 5}}, schema=schema)
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/api.py", line 142, in find_pandas_all
    find_arrow_all(collection, query, schema=schema, **kwargs))
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/api.py", line 59, in find_arrow_all
    schema, codec_options=collection.codec_options)
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/context.py", line 53, in from_schema
    str_type_map = _get_internal_typemap(schema.typemap)
  File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/types.py", line 70, in _get_internal_typemap
    assert len(internal_typemap) == len(typemap)
AssertionError

Hi @ubntelton - thank you for opening this ticket!

Currently PyMongoArrow doesn't support any variable-width types but we intend to add support for this in a subsequent release of the project. You can follow our progress by watching this ticket - https://jira.mongodb.org/browse/PYTHON-2783. Also, I agree that the above error message is very cryptic and does not make it clear that your specified type is unsupported so I have also opened a ticket to improve this - https://jira.mongodb.org/browse/PYTHON-2785.

Note that if you have numerical data that is stored as a string in your documents, you can still use PyMongoArrow to read this data by using an aggregation that employs one of the following type conversion operators:

Please let us know if you have any other questions.

@prashantmital to that point does it support datetime?. As for an example, if I have a date "2021-07-10" as string, if the schema reads as "$todatetime" does that work?

@krishpn datetime is supported natively provided it is stored in MongoDB as a BSON UTC datetime - https://docs.mongodb.com/manual/reference/bson-types/#date.

If your dates are stored as strings then you will need to transform them in order for them to be read as a Python datetime via PyMongoArrow. To do so, you can use an aggregation pipeline with the $dateFromString (https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/) operator. The schema passed to PyMongoArrow will simply read datetime as shown in the tutorial .

Hi @krishpn, did @prashantmital's comment help resolve your issue? Please let us know how things are going!

@prashantmital , thanks for suggestions. Before asking the question I had tried $dateFromString to convert datetime format however got an error. Cant remember the error anymore but it was cryptic. I have not touched it since then. Is there a workflow example where it works. @agolin95 fyi

@krishpn We don't have an example in the docs yet but I am happy to post one here if you can tell me what format your datetimes are in?

@krishpn Since we haven't heard from you in a while, we're closing this out. We've opened https://jira.mongodb.org/browse/PYTHON-2894 to add an example for this usecase to our documentation.