xitongsys / parquet-go

pure golang library for reading/writing parquet file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

issue with dot "." in field name

pwmcintyre opened this issue · comments


I know it has been briefly mentioned in other issue about the drama of using "." in field names, but i'm hoping you can help

Using the Java parquet-tools to inspect the schema of an existing Parquet file i have, i can see it contains "." in the field names, but works fine:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet

message spark_schema {
  optional binary version (STRING);
  optional binary meta.format (STRING);
  optional binary meta.id (STRING);

and while using your tool i get the following:

$ parquet-tools -cmd schema -file ./part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet

----- Go struct -----
Spark_schema struct {
  Version *string
  Meta46format *string
  Meta46id *string
----- Json schema -----
  "Tag": "name=Spark_schema, repetitiontype=REQUIRED",
  "Fields": [
      "Tag": "name=Version, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
      "Tag": "name=Meta46format, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
      "Tag": "name=Meta46id, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null

I'm similarly having trouble writing files with "." in the key — eg with this struct:

type Event struct {
	Version *string `parquet:"name=version, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	MetaID *string `parquet:"name=meta.id, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`

I get the following error when attempting to read it:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/output_test/struct/output.parquet

org.apache.parquet.io.InvalidRecordException: meta not found in message parquet_go_root {
  optional binary version (STRING) = 0;
  optional binary meta.id (STRING) = 0;

any ideas?

hi, @pwmcintyre
Golang doesn't support a variable name with dot. So you should provide a legal name for a go struct field.
Following is an example of write/read a parquet file with a field which name has a ..

package main

import (


type Student struct {
        //// name is the parquet filed name. inname is the variable name
	Name    string  `parquet:"name=student.name, inname=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	Age     int32   `parquet:"name=age, type=INT32, encoding=PLAIN"`

func main() {
	var err error
	fw, err := local.NewLocalFileWriter("output/flat.parquet")
	if err != nil {
		log.Println("Can't create local file", err)

	pw, err := writer.NewParquetWriter(fw, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet writer", err)

	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.PageSize = 8 * 1024 //8K
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	num := 10
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   "StudentName",
			Age:    int32(20 + i%5),
		if err = pw.Write(stu); err != nil {
			log.Println("Write error", err)
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
	log.Println("Write Finished")

	fr, err := local.NewLocalFileReader("output/flat.parquet")
	if err != nil {
		log.Println("Can't open file")

	pr, err := reader.NewParquetReader(fr, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet reader", err)
	num = int(pr.GetNumRows())
	stus := make([]Student, num) //read 10 rows
	if err = pr.Read(&stus); err != nil {
		log.Println("Read error", err)



running result:

2021/01/28 08:38:46 Write Finished
2021/01/28 08:38:46 [{StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23} {StudentName 24} {StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23}
{StudentName 24}]

@xitongsys — appreciate your time, thank you

i have reproduced your result above — but similar to my example earlier, when attempting to read this new parquet file with my existing systems (i'm using AWS Athena), i get an error similar to the below error from parquet-tools:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/output.parquet
org.apache.parquet.io.InvalidRecordException: student not found in message parquet_go_root {
  required binary student.name (STRING) = 0;
  required int32 age = 0;

similarly, using another Go implementation, i still cannot read this file:

$ parquet-tool schema output.parquet
panic: line 2: expected ;, got unknown start of token '46' instead

and so i suspect there may be an issue in the handling of the "." in the output file?

hi, @pwmcintyre
Could your provide a sample file like "/data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet ?

@xitongsys — emailed, and while not sensitive, we would prefer it not shared publicly :)

hi @xitongsys ... did your post get about java implementation get deleted? did you find the answer?

hi, @pwmcintyre
I have found the reason. Parquet-go just use "." as a field delimiter which caused this issue. I'm considering how to fix it and keep the compatibility with before.

@xitongsys — thanks for the update, please let me know if there's anything I can help with

hi, @pwmcintyre
Fixed in this pull
Actually I just use \x01 as the delimiter instead of ..
Example file you can found here

@xitongsys — well done! thanks again

I can confirm AWS Athena is happy with this change 👌 (ignore the nulls, it's just a test)

ok, I will close this issue.