Is the Schema support pyarrow.string() Type ?
ubntelton opened this issue · comments
from pymongoarrow.api import Schema, find_pandas_all
from datetime import datetime
from pymongo import MongoClient
client = MongoClient()
client.db.data.delete_many({})
client.db.data.insert_many([
{'_id': 1, 'amount': 21, 'mac':'eee', 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
{'_id': 2, 'amount': 16, 'mac':'ddd', 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
{'_id': 3, 'amount': 3, 'mac':'eeeeee', 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
{'_id': 4, 'amount': 0, 'mac':'aaa', 'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])
schema = Schema({'_id': pyarrow.int32(), 'amount': pyarrow.float64(), 'mac': pyarrow.string(), 'last_updated': datetime})
df = find_pandas_all(client.db.data, {'amount': {'$gt': 5}}, schema=schema)
Schema({'_id': pyarrow.int32(), 'amount': pyarrow.float64(), 'mac': pyarrow.string(), 'last_updated': datetime})
df = find_pandas_all(client.db.data, {'amount': {'$gt': 5}}, schema=schema)
File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/api.py", line 142, in find_pandas_all
find_arrow_all(collection, query, schema=schema, **kwargs))
File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/api.py", line 59, in find_arrow_all
schema, codec_options=collection.codec_options)
File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/context.py", line 53, in from_schema
str_type_map = _get_internal_typemap(schema.typemap)
File "/home/fcdlab/.local/lib/python3.7/site-packages/pymongoarrow/types.py", line 70, in _get_internal_typemap
assert len(internal_typemap) == len(typemap)
AssertionError
Hi @ubntelton - thank you for opening this ticket!
Currently PyMongoArrow doesn't support any variable-width types but we intend to add support for this in a subsequent release of the project. You can follow our progress by watching this ticket - https://jira.mongodb.org/browse/PYTHON-2783. Also, I agree that the above error message is very cryptic and does not make it clear that your specified type is unsupported so I have also opened a ticket to improve this - https://jira.mongodb.org/browse/PYTHON-2785.
Note that if you have numerical data that is stored as a string in your documents, you can still use PyMongoArrow to read this data by using an aggregation that employs one of the following type conversion operators:
Please let us know if you have any other questions.
@prashantmital to that point does it support datetime?. As for an example, if I have a date "2021-07-10" as string, if the schema reads as "$todatetime" does that work?
@krishpn datetime is supported natively provided it is stored in MongoDB as a BSON UTC datetime - https://docs.mongodb.com/manual/reference/bson-types/#date.
If your dates are stored as strings then you will need to transform them in order for them to be read as a Python datetime via PyMongoArrow. To do so, you can use an aggregation pipeline with the $dateFromString
(https://docs.mongodb.com/manual/reference/operator/aggregation/dateFromString/) operator. The schema passed to PyMongoArrow will simply read datetime
as shown in the tutorial .
Hi @krishpn, did @prashantmital's comment help resolve your issue? Please let us know how things are going!
@prashantmital , thanks for suggestions. Before asking the question I had tried $dateFromString
to convert datetime
format however got an error. Cant remember the error anymore but it was cryptic. I have not touched it since then. Is there a workflow example where it works. @agolin95 fyi
@krishpn We don't have an example in the docs yet but I am happy to post one here if you can tell me what format your datetimes are in?
@krishpn Since we haven't heard from you in a while, we're closing this out. We've opened https://jira.mongodb.org/browse/PYTHON-2894 to add an example for this usecase to our documentation.