xitongsys / parquet-go

pure golang library for reading/writing parquet file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

issue with dot "." in field name

pwmcintyre opened this issue · comments

hi

I know it has been briefly mentioned in other issue about the drama of using "." in field names, but i'm hoping you can help

Using the Java parquet-tools to inspect the schema of an existing Parquet file i have, i can see it contains "." in the field names, but works fine:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet

message spark_schema {
  optional binary version (STRING);
  optional binary meta.format (STRING);
  optional binary meta.id (STRING);
}

and while using your tool i get the following:

$ parquet-tools -cmd schema -file ./part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet

----- Go struct -----
Spark_schema struct {
  Version *string
  Meta46format *string
  Meta46id *string
}
----- Json schema -----
{
  "Tag": "name=Spark_schema, repetitiontype=REQUIRED",
  "Fields": [
    {
      "Tag": "name=Version, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
    },
    {
      "Tag": "name=Meta46format, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
    },
    {
      "Tag": "name=Meta46id, type=UTF8, repetitiontype=OPTIONAL",
      "Fields": null
    }
  ]
}

I'm similarly having trouble writing files with "." in the key — eg with this struct:

type Event struct {
	Version *string `parquet:"name=version, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	MetaID *string `parquet:"name=meta.id, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
}

I get the following error when attempting to read it:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/output_test/struct/output.parquet

org.apache.parquet.io.InvalidRecordException: meta not found in message parquet_go_root {
  optional binary version (STRING) = 0;
  optional binary meta.id (STRING) = 0;
}

any ideas?

hi, @pwmcintyre
Golang doesn't support a variable name with dot. So you should provide a legal name for a go struct field.
Following is an example of write/read a parquet file with a field which name has a ..

package main

import (
	"log"

	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/parquet"
	"github.com/xitongsys/parquet-go/reader"
	"github.com/xitongsys/parquet-go/writer"
)

type Student struct {
        //// name is the parquet filed name. inname is the variable name
	Name    string  `parquet:"name=student.name, inname=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
	Age     int32   `parquet:"name=age, type=INT32, encoding=PLAIN"`
}

func main() {
	var err error
	fw, err := local.NewLocalFileWriter("output/flat.parquet")
	if err != nil {
		log.Println("Can't create local file", err)
		return
	}

	//write
	pw, err := writer.NewParquetWriter(fw, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet writer", err)
		return
	}

	pw.RowGroupSize = 128 * 1024 * 1024 //128M
	pw.PageSize = 8 * 1024 //8K
	pw.CompressionType = parquet.CompressionCodec_SNAPPY
	num := 10
	for i := 0; i < num; i++ {
		stu := Student{
			Name:   "StudentName",
			Age:    int32(20 + i%5),
		}
		if err = pw.Write(stu); err != nil {
			log.Println("Write error", err)
		}
	}
	if err = pw.WriteStop(); err != nil {
		log.Println("WriteStop error", err)
		return
	}
	log.Println("Write Finished")
	fw.Close()

	///read
	fr, err := local.NewLocalFileReader("output/flat.parquet")
	if err != nil {
		log.Println("Can't open file")
		return
	}

	pr, err := reader.NewParquetReader(fr, new(Student), 4)
	if err != nil {
		log.Println("Can't create parquet reader", err)
		return
	}
	num = int(pr.GetNumRows())
	stus := make([]Student, num) //read 10 rows
	if err = pr.Read(&stus); err != nil {
		log.Println("Read error", err)
	}
	log.Println(stus)

	pr.ReadStop()
	fr.Close()

}

running result:

2021/01/28 08:38:46 Write Finished
2021/01/28 08:38:46 [{StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23} {StudentName 24} {StudentName 20} {StudentName 21} {StudentName 22} {StudentName 23}
{StudentName 24}]

@xitongsys — appreciate your time, thank you

i have reproduced your result above — but similar to my example earlier, when attempting to read this new parquet file with my existing systems (i'm using AWS Athena), i get an error similar to the below error from parquet-tools:

$ docker run -it --rm -v ${PWD}:/data  nathanhowell/parquet-tools schema /data/output.parquet
org.apache.parquet.io.InvalidRecordException: student not found in message parquet_go_root {
  required binary student.name (STRING) = 0;
  required int32 age = 0;
}

similarly, using another Go implementation, i still cannot read this file:

$ parquet-tool schema output.parquet
panic: line 2: expected ;, got unknown start of token '46' instead

and so i suspect there may be an issue in the handling of the "." in the output file?

hi, @pwmcintyre
Could your provide a sample file like "/data/part-00001-d82e5581-88f1-4203-85db-861c8d907350.c000.snappy.parquet ?

@xitongsys — emailed, and while not sensitive, we would prefer it not shared publicly :)

hi @xitongsys ... did your post get about java implementation get deleted? did you find the answer?

hi, @pwmcintyre
I have found the reason. Parquet-go just use "." as a field delimiter which caused this issue. I'm considering how to fix it and keep the compatibility with before.

@xitongsys — thanks for the update, please let me know if there's anything I can help with

hi, @pwmcintyre
Fixed in this pull
Actually I just use \x01 as the delimiter instead of ..
Example file you can found here

@xitongsys — well done! thanks again

I can confirm AWS Athena is happy with this change 👌 (ignore the nulls, it's just a test)
image

ok, I will close this issue.