runsascoded / parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

parquet-diff-test

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows} (see arrow#39399).

CLI

For each {engine, compression codec}:

parquet-diff-test writes a simple Parquet file:

df = pd.DataFrame([{ 'a': 111 }])
empty_df = df.iloc[:0]  # subset the dataset to have 0 rows
out_dir = f'out/{engine}/{compression}'
parquet_path = f'{out_dir}/empty.parquet'
empty_df.to_parquet(parquet_path, engine=engine, compression=compression)

In the same directory, it also writes:

  • metadata.json, which includes:
    • the pyarrow.ParquetFile.metadata dictionary
    • file size
    • file sha256 hash
  • xxd.txt: ASCII representation of every byte in empty.parquet

Results

The test.yml workflow runs parquet-diff-test on Ubuntu, macOS, and Windows, and pushes the results of each to a branch.

Here are the macos and windows branches' compared to ubuntu:

Summary

  • βœ… In all cases, Parquet files generated by fastparquet are identical .across OSes
  • πŸ€” In many cases, those generated by pyarrow are different from each other.

pyarrow

Ubuntu Windows macOS
brotli βœ… βœ… ❌
gzip ⚠️ ⚠️ ❌
lz4 βœ… βœ… ❌
snappy βœ… βœ… ❌
zstd βœ… βœ… ❌

fastparquet

Ubuntu Windows macOS
brotli βœ… βœ… βœ…
gzip βœ… βœ… βœ…
lz4 βœ… βœ… βœ…
snappy βœ… βœ… βœ…
zstd βœ… βœ… βœ…

Full diffs

For example, here's the diff for {pyarrow, snappy}:

git diff ubuntu..macos -- out/pyarrow/snappy/xxd.txt
 00000280: 7741 4141 4145 4141 6741 4367 4141 414e  wAAAAEAAgACgAAAN
 00000290: 7742 4141 4145 4141 4141 4151 4141 4141  wBAAAEAAAAAQAAAA
 000002a0: 7741 4141 4149 4141 7741 4241 4149 4141  wAAAAIAAwABAAIAA
-000002b0: 6741 4141 4149 4141 4141 4541 4141 4141  gAAAAIAAAAEAAAAA
-000002c0: 5941 4141 4277 5957 356b 5958 4d41 414b  YAAABwYW5kYXMAAK
-000002d0: 5942 4141 4237 496d 6c75 5a47 5634 5832  YBAAB7ImluZGV4X2
-000002e0: 4e76 6248 5674 626e 4d69 4f69 4262 6579  NvbHVtbnMiOiBbey
-000002f0: 4a72 6157 356b 496a 6f67 496e 4a68 626d  JraW5kIjogInJhbm
-00000300: 646c 4969 7767 496d 3568 6257 5569 4f69  dlIiwgIm5hbWUiOi
-00000310: 4275 6457 7873 4c43 4169 6333 5268 636e  BudWxsLCAic3Rhcn
-00000320: 5169 4f69 4177 4c43 4169 6333 5276 6343  QiOiAwLCAic3RvcC
-00000330: 4936 4944 4173 4943 4a7a 6447 5677 496a  I6IDAsICJzdGVwIj
-00000340: 6f67 4d58 3164 4c43 4169 5932 3973 6457  ogMX1dLCAiY29sdW
-00000350: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69  1uX2luZGV4ZXMiOi
-00000360: 4262 6579 4a75 5957 316c 496a 6f67 626e  BbeyJuYW1lIjogbn
-00000370: 5673 6243 7767 496d 5a70 5a57 786b 5832  VsbCwgImZpZWxkX2
-00000380: 3568 6257 5569 4f69 4275 6457 7873 4c43  5hbWUiOiBudWxsLC
-00000390: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
-000003a0: 5569 4f69 4169 6457 3570 5932 396b 5a53  UiOiAidW5pY29kZS
-000003b0: 4973 4943 4a75 6457 3177 6556 3930 6558  IsICJudW1weV90eX
-000003c0: 426c 496a 6f67 496d 3969 616d 566a 6443  BlIjogIm9iamVjdC
-000003d0: 4973 4943 4a74 5a58 5268 5a47 4630 5953  IsICJtZXRhZGF0YS
-000003e0: 4936 4948 7369 5a57 356a 6232 5270 626d  I6IHsiZW5jb2Rpbm
-000003f0: 6369 4f69 4169 5656 5247 4c54 6769 6658  ciOiAiVVRGLTgifX
-00000400: 3164 4c43 4169 5932 3973 6457 3175 6379  1dLCAiY29sdW1ucy
-00000410: 4936 4946 7437 496d 3568 6257 5569 4f69  I6IFt7Im5hbWUiOi
-00000420: 4169 5953 4973 4943 4a6d 6157 5673 5a46  AiYSIsICJmaWVsZF
-00000430: 3975 5957 316c 496a 6f67 496d 4569 4c43  9uYW1lIjogImEiLC
-00000440: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
-00000450: 5569 4f69 4169 6157 3530 4e6a 5169 4c43  UiOiAiaW50NjQiLC
-00000460: 4169 626e 5674 6348 6c66 6448 6c77 5a53  AibnVtcHlfdHlwZS
-00000470: 4936 4943 4a70 626e 5132 4e43 4973 4943  I6ICJpbnQ2NCIsIC
-00000480: 4a74 5a58 5268 5a47 4630 5953 4936 4947  JtZXRhZGF0YSI6IG
-00000490: 3531 6247 7839 5853 7767 496d 4e79 5a57  51bGx9XSwgImNyZW
-000004a0: 4630 6233 4969 4f69 4237 496d 7870 596e  F0b3IiOiB7ImxpYn
-000004b0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e  JhcnkiOiAicHlhcn
-000004c0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157  JvdyIsICJ2ZXJzaW
-000004d0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69  9uIjogIjE0LjAuMi
-000004e0: 4a39 4c43 4169 6347 4675 5a47 467a 5833  J9LCAicGFuZGFzX3
-000004f0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69  ZlcnNpb24iOiAiMi
-00000500: 3478 4c6a 5169 6651 4141 4151 4141 4142  4xLjQifQAAAQAAAB
+000002b0: 6741 4141 4330 4151 4141 4241 4141 414b  gAAAC0AQAABAAAAK
+000002c0: 5942 4141 4237 496d 6c75 5a47 5634 5832  YBAAB7ImluZGV4X2
+000002d0: 4e76 6248 5674 626e 4d69 4f69 4262 6579  NvbHVtbnMiOiBbey
+000002e0: 4a72 6157 356b 496a 6f67 496e 4a68 626d  JraW5kIjogInJhbm
+000002f0: 646c 4969 7767 496d 3568 6257 5569 4f69  dlIiwgIm5hbWUiOi
+00000300: 4275 6457 7873 4c43 4169 6333 5268 636e  BudWxsLCAic3Rhcn
+00000310: 5169 4f69 4177 4c43 4169 6333 5276 6343  QiOiAwLCAic3RvcC
+00000320: 4936 4944 4173 4943 4a7a 6447 5677 496a  I6IDAsICJzdGVwIj
+00000330: 6f67 4d58 3164 4c43 4169 5932 3973 6457  ogMX1dLCAiY29sdW
+00000340: 3175 5832 6c75 5a47 5634 5a58 4d69 4f69  1uX2luZGV4ZXMiOi
+00000350: 4262 6579 4a75 5957 316c 496a 6f67 626e  BbeyJuYW1lIjogbn
+00000360: 5673 6243 7767 496d 5a70 5a57 786b 5832  VsbCwgImZpZWxkX2
+00000370: 3568 6257 5569 4f69 4275 6457 7873 4c43  5hbWUiOiBudWxsLC
+00000380: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
+00000390: 5569 4f69 4169 6457 3570 5932 396b 5a53  UiOiAidW5pY29kZS
+000003a0: 4973 4943 4a75 6457 3177 6556 3930 6558  IsICJudW1weV90eX
+000003b0: 426c 496a 6f67 496d 3969 616d 566a 6443  BlIjogIm9iamVjdC
+000003c0: 4973 4943 4a74 5a58 5268 5a47 4630 5953  IsICJtZXRhZGF0YS
+000003d0: 4936 4948 7369 5a57 356a 6232 5270 626d  I6IHsiZW5jb2Rpbm
+000003e0: 6369 4f69 4169 5656 5247 4c54 6769 6658  ciOiAiVVRGLTgifX
+000003f0: 3164 4c43 4169 5932 3973 6457 3175 6379  1dLCAiY29sdW1ucy
+00000400: 4936 4946 7437 496d 3568 6257 5569 4f69  I6IFt7Im5hbWUiOi
+00000410: 4169 5953 4973 4943 4a6d 6157 5673 5a46  AiYSIsICJmaWVsZF
+00000420: 3975 5957 316c 496a 6f67 496d 4569 4c43  9uYW1lIjogImEiLC
+00000430: 4169 6347 4675 5a47 467a 5833 5235 6347  AicGFuZGFzX3R5cG
+00000440: 5569 4f69 4169 6157 3530 4e6a 5169 4c43  UiOiAiaW50NjQiLC
+00000450: 4169 626e 5674 6348 6c66 6448 6c77 5a53  AibnVtcHlfdHlwZS
+00000460: 4936 4943 4a70 626e 5132 4e43 4973 4943  I6ICJpbnQ2NCIsIC
+00000470: 4a74 5a58 5268 5a47 4630 5953 4936 4947  JtZXRhZGF0YSI6IG
+00000480: 3531 6247 7839 5853 7767 496d 4e79 5a57  51bGx9XSwgImNyZW
+00000490: 4630 6233 4969 4f69 4237 496d 7870 596e  F0b3IiOiB7ImxpYn
+000004a0: 4a68 636e 6b69 4f69 4169 6348 6c68 636e  JhcnkiOiAicHlhcn
+000004b0: 4a76 6479 4973 4943 4a32 5a58 4a7a 6157  JvdyIsICJ2ZXJzaW
+000004c0: 3975 496a 6f67 496a 4530 4c6a 4175 4d69  9uIjogIjE0LjAuMi
+000004d0: 4a39 4c43 4169 6347 4675 5a47 467a 5833  J9LCAicGFuZGFzX3
+000004e0: 5a6c 636e 4e70 6232 3469 4f69 4169 4d69  ZlcnNpb24iOiAiMi
+000004f0: 3478 4c6a 5169 6651 4141 4267 4141 4148  4xLjQifQAABgAAAH
+00000500: 4268 626d 5268 6377 4141 4151 4141 4142  BhbmRhcwAAAQAAAB
 00000510: 5141 4141 4151 4142 5141 4341 4147 4141  QAAAAQABQACAAGAA
 00000520: 6341 4441 4141 4142 4141 4541 4141 4141  cADAAAABAAEAAAAA
 00000530: 4141 4151 4951 4141 4141 4841 4141 4141  AAAQIQAAAAHAAAAA

The pyarrow metadata is the same for both; I can't tell what explains the difference.

  • All fastparquet parquets are identical.
  • pyarrow parquets are mostly identical, except for one header byte in the gzip codec.
git diff ubuntu..windows -- out/pyarrow/gzip/xxd.txt
 00000000: 5041 5231 1504 1500 1528 4c15 0015 0012  PAR1.....(L.....
-00000010: 0000 1f8b 0800 0000 0000 0003 0300 0000  ................
+00000010: 0000 1f8b 0800 0000 0000 000a 0300 0000  ................
 00000020: 0000 0000 0000 264c 1c15 0419 2500 0619  ......&L....%...
 00000030: 1801 6115 0416 0016 1c16 4426 0026 0829  ..a.......D&.&.)
 00000040: 1c15 0415 0015 0200 0000 1504 192c 3500  .............,5.

Discussion

The discrepancy between macOS and Ubuntu has made some tests inconvenient; it would be nice to understand why it occurs.

Docker

Interestingly, I see the same macOS diffs when running run.sh in an ubuntu Docker image on a macOS host machine

About

Demonstrate differences in Parquet files generated by pyarrow on macOS vs. {Ubuntu, Windows}.


Languages

Language:Python 81.1%Language:Dockerfile 14.7%Language:Shell 4.2%