chdb-io / chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

Home Page:https://clickhouse.com/docs/en/chdb

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot read parquet files from S3 using "*.parquet"

neiblegy opened this issue · comments

i have 41 parquet files stored in S3, then i need execute sql with:
chdb.query(f"select ais_image_path from s3('http://ENDPOINT_URL/BUCKET/KEY_PREFIX/*.parquet', 'USER', 'PWD', Parquet) where ais_image_path = '{path}'", 'Dataframe')

then got error:

Code: 636. DB::Exception: Cannot extract table structure from Parquet format file, because there are no files with provided path in S3 or all files are empty. You must specify table structure manually: Cannot extract table structure from Parquet format file. You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE)

i'm sure that all parquet-files is in the right path i given, and these file can be handled correctly if they are local files.

i change "Dataframe" to "Debug" then got traceback:

2023.12.04 18:20:42.599295 [ 181933 ] {} <Debug> Application: Working directory created: /tmp/clickhouse-local-181933-1701685242-3725016050319400010
Setting up /tmp/clickhouse-local-181933-1701685242-3725016050319400010/tmp/ to store temporary data in it
Added users_xml access storage 'users_xml', path:
00000000-0000-0000-0000-00000002c6ad Authenticating user 'default' from 127.0.0.1:0
00000000-0000-0000-0000-00000002c6ad Authenticated with global context as user 94309d50-4f52-5250-31bd-74fecac179db
00000000-0000-0000-0000-00000002c6ad Creating session context with user_id: 94309d50-4f52-5250-31bd-74fecac179db
Settings: readonly = 0, allow_ddl = true, allow_introspection_functions = true
List of all grants: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
List of all grants including implicit: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
select ais_image_path from s3('http://xxxxxx/sg-mlp-mfp-mlp-dataset-anno/datasets/ryan_test/*.parquet', 'xxxxxx', 'xxxxxx', Parquet) where ais_image_path = 'images/sg-11134201-23010-chbgxn0pt0lv12'
00000000-0000-0000-0000-00000002c6ad Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
Settings: readonly = 0, allow_ddl = true, allow_introspection_functions = true
List of all grants: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
List of all grants including implicit: GRANT SHOW, SELECT, INSERT, ALTER, CREATE, DROP, UNDROP TABLE, TRUNCATE, OPTIMIZE, BACKUP, KILL QUERY, KILL TRANSACTION, MOVE PARTITION BETWEEN SHARDS, SYSTEM, dictGet, displaySecretsInShowAndSelect, INTROSPECTION, SOURCES, CLUSTER ON *.*
(from 0.0.0.0:0, user: ) SELECT ais_image_path FROM s3('http://xxxxxx/sg-mlp-mfp-mlp-dataset-anno/datasets/ryan_test/*.parquet', 'xxxxxx', '[HIDDEN]', Parquet) WHERE ais_image_path = 'images/sg-11134201-23010-chbgxn0pt0lv12' (stage: Complete)
Access granted: CREATE TEMPORARY TABLE, S3 ON *.*
2023.12.04 18:20:42.641070 [ 181933 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Trace> S3Client: Provider type: Unknown
2023.12.04 18:20:42.641096 [ 181933 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Trace> S3Client: API mode of the S3 client: AWS
2023.12.04 18:20:42.649473 [ 182237 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Trace> HTTPSessionAdapter: Created HTTP(S) session with ceph-c105-sg-drt-aip.s3.sto.shopee.io:80 (10.188.6.18:80)
Code: 636. DB::Exception: Cannot extract table structure from Parquet format file, because there are no files with provided path in S3 or all files are empty. You must specify table structure manually: Cannot extract table structure from Parquet format file. You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE) (version 23.10.1.1) (from 0.0.0.0:0) (in query: SELECT ais_image_path FROM s3('http://xxxxxx/sg-mlp-mfp-mlp-dataset-anno/datasets/ryan_test/*.parquet', 'xxxxxx', '[HIDDEN]', Parquet) WHERE ais_image_path = 'images/sg-11134201-23010-chbgxn0pt0lv12'), Stack trace (when copying this message, always include the lines below):

0. Poco::Exception::Exception(String const&, int) @ 0x0000000019a11479 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
1. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000010f7e779 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
2. DB::Exception::Exception<String const&>(int, FormatStringHelperImpl<std::type_identity<String const&>::type>, String const&) @ 0x000000000c2384e3 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
3. DB::(anonymous namespace)::ReadBufferIterator::next() @ 0x0000000016f3f603 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
4. DB::readSchemaFromFormat(String const&, std::optional<DB::FormatSettings> const&, DB::IReadBufferIterator&, bool, std::shared_ptr<DB::Context const>&, std::unique_ptr<DB::ReadBuffer, std::default_delete<DB::ReadBuffer>>&) @ 0x000000001757a6ec in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
5. DB::readSchemaFromFormat(String const&, std::optional<DB::FormatSettings> const&, DB::IReadBufferIterator&, bool, std::shared_ptr<DB::Context const>&) @ 0x000000001757be7f in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
6. DB::StorageS3::getTableStructureFromDataImpl(DB::StorageS3::Configuration const&, std::optional<DB::FormatSettings> const&, std::shared_ptr<DB::Context const>) @ 0x0000000016f36db1 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
7. DB::StorageS3::StorageS3(DB::StorageS3::Configuration const&, std::shared_ptr<DB::Context const>, DB::StorageID const&, DB::ColumnsDescription const&, DB::ConstraintsDescription const&, String const&, std::optional<DB::FormatSettings>, bool, std::shared_ptr<DB::IAST>) @ 0x0000000016f363a4 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
8. std::shared_ptr<DB::StorageS3> std::allocate_shared[abi:v15000]<DB::StorageS3, std::allocator<DB::StorageS3>, DB::StorageS3::Configuration&, std::shared_ptr<DB::Context const>&, DB::StorageID, DB::ColumnsDescription&, DB::ConstraintsDescription, String, std::nullopt_t const&, void>(std::allocator<DB::StorageS3> const&, DB::StorageS3::Configuration&, std::shared_ptr<DB::Context const>&, DB::StorageID&&, DB::ColumnsDescription&, DB::ConstraintsDescription&&, String&&, std::nullopt_t const&) @ 0x0000000015200ed0 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
9. DB::TableFunctionS3::executeImpl(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription, bool) const @ 0x00000000151fbb4b in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
10. DB::ITableFunction::execute(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context const>, String const&, DB::ColumnsDescription, bool, bool) const @ 0x000000001547747f in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
11. DB::Context::executeTableFunction(std::shared_ptr<DB::IAST> const&, DB::ASTSelectQuery const*) @ 0x0000000015cfe896 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
12. DB::JoinedTables::getLeftTableStorage() @ 0x00000000165dee61 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
13. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, std::optional<DB::Pipe>, std::shared_ptr<DB::IStorage> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::PreparedSets>) @ 0x0000000016524a52 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
14. DB::InterpreterSelectQuery::InterpreterSelectQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context> const&, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x0000000016523a97 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
15. DB::InterpreterSelectWithUnionQuery::buildCurrentChildInterpreter(std::shared_ptr<DB::IAST> const&, std::vector<String, std::allocator<String>> const&) @ 0x00000000165c06b2 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
16. DB::InterpreterSelectWithUnionQuery::InterpreterSelectWithUnionQuery(std::shared_ptr<DB::IAST> const&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&, std::vector<String, std::allocator<String>> const&) @ 0x00000000165bea5f in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
17. std::__unique_if<DB::InterpreterSelectWithUnionQuery>::__unique_single std::make_unique[abi:v15000]<DB::InterpreterSelectWithUnionQuery, std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>&, DB::SelectQueryOptions const&>(std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>&, DB::SelectQueryOptions const&) @ 0x00000000168d38f7 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
18. DB::InterpreterFactory::get(std::shared_ptr<DB::IAST>&, std::shared_ptr<DB::Context>, DB::SelectQueryOptions const&) @ 0x00000000168d29d5 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
19. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x00000000168b1995 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
20. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x00000000168aeb41 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
21. DB::LocalConnection::sendQuery(DB::ConnectionTimeouts const&, String const&, std::unordered_map<String, String, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, String>>> const&, String const&, unsigned long, DB::Settings const*, DB::ClientInfo const*, bool, std::function<void (DB::Progress const&)>) @ 0x0000000017548050 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
22. DB::ClientBase::processOrdinaryQuery(String const&, std::shared_ptr<DB::IAST>) @ 0x00000000174ef2a4 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
23. DB::ClientBase::processParsedSingleQuery(String const&, String const&, std::shared_ptr<DB::IAST>, std::optional<bool>, bool) @ 0x00000000174edf9a in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
24. DB::ClientBase::executeMultiQuery(String const&) @ 0x00000000174f6f54 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
25. DB::ClientBase::processQueryText(String const&) @ 0x00000000174f7bb7 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
26. DB::ClientBase::runNonInteractive() @ 0x00000000174fa9bb in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
27. DB::LocalServer::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000011004c77 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
28. Poco::Util::Application::run() @ 0x00000000198fd306 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
29. pyEntryClickHouseLocal(int, char**) @ 0x000000001100f13d in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
30. query_stable @ 0x000000001100f44a in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so
31. queryToBuffer(String const&, String const&, String const&, String const&) @ 0x000000001bb82cc6 in /home/protoss.gao/.local/lib/python3.9/site-packages/chdb/_chdb.cpython-39-x86_64-linux-gnu.so

2023.12.04 18:20:42.935293 [ 181933 ] {3df1eaee-b172-4aab-a491-68b4b8427a57} <Debug> MemoryTracker: Peak memory usage (for query): 74.91 MiB.
00000000-0000-0000-0000-00000002c6ad Logout, user_id: 94309d50-4f52-5250-31bd-74fecac179db
Shutting down UDFs loader
Shutting down named sessions
Shutting down database catalog
Shutting down database INFORMATION_SCHEMA
Shutting down database _local
Shutting down database information_schema
Shutting down system databases
Shutting down DDLWorker
Shutting down caches
2023.12.04 18:20:42.937242 [ 181933 ] {} <Debug> Application: Removing temporary directory: /tmp/clickhouse-local-181933-1701685242-3725016050319400010
Code: 636. DB::Exception: Cannot extract table structure from Parquet format file, because there are no files with provided path in S3 or all files are empty. You must specify table structure manually: Cannot extract table structure from Parquet format file. You can specify the structure manually. (CANNOT_EXTRACT_TABLE_STRUCTURE)
2023.12.04 18:20:42.937474 [ 181933 ] {} <Debug> Application: Uninitializing subsystem: Logging Subsystem

any string in backtrace like "xxxxxx" actually right strings.

code environment:
x86_64
python3.9
chdb:1.0.0
s3: ceph-s3

my endpoint is "http://ceph-c105-sg-drt-aip.s3.sto.xxxx.io" , seems there are hard coding process "s3" in it, then treat something wrong as BUCKET

ClickHouse offers a lot of s3 function and s3 engine related settings which influence the driver and might apply to your case

It's due to some s3 implementation that not fully follow s3 specifications.
ClickHouse will use the string before s3 in domain as bucket name.
The regex is R"((.+)\.(s3|cos|obs|oss)([.\-][a-z0-9\-.:]+))"
But in this issue, bucket name is represented to the path after domain name.

Typically, I will not fix this. But what's wired is the offical awscli could handle these misconfigured s3 storage with specify endpoint and s3 URL separately. like:

aws s3 ls s3://bucket/datasets/ryan_test/ --endpoint http://some-irrelevant-name.s3.xxx.io

I will check it later.

Won't fix, it's a mis-configured S3 issue.