[Java] serialize to output stream is limited to 2GB

Question

[Java] serialize to output stream is limited to 2GB

Neiko2002 opened this issue 2 months ago · comments

Nico Hezel commented 2 months ago

Search before asking

I had searched in the issues and found no similar issues.

Version

Version: 0.4.1
OS: Windows
JDK: 21

Component(s)

Java

Minimal reproduce step

public static void main(String[] args) throws Exception {
	Fury fury = Fury.builder().requireClassRegistration(false).build();
	try (OutputStream output = new BufferedOutputStream(Files.newOutputStream(Files.createTempFile(null, null)))) {
		fury.serialize(output, new BigObj());
	}
}
	
static public class BigObj {
	public byte[] b1 = new byte[Integer.MAX_VALUE/2];
	public byte[] b2 = new byte[Integer.MAX_VALUE/2];
}

What did you expect to see?

I was hoping to get a file with 2147483646 bytes, all zero.

What did you see instead?

Exception in thread "main" java.lang.NegativeArraySizeException: -2147483510
	at io.fury.memory.MemoryBuffer.ensure(MemoryBuffer.java:1980)
	at io.fury.memory.MemoryBuffer.writePrimitiveArrayWithSizeEmbedded(MemoryBuffer.java:1946)
	at io.fury.serializer.ArraySerializers$ByteArraySerializer.write(ArraySerializers.java:290)

Anything Else?

I think when providing an OutputStream to the serialize method the intermediate MemoryBuffer should behave like the buffer inside the BufferedOutputStream. When the buffer is full it should flush its content to the underlying OutputStream in order to free up its bytes.

Are you willing to submit a PR?

I'm willing to submit a PR!

Shawn Yang · Answer 1 · Wed Apr 17 2024 12:02:41 GMT+0800 (China Standard Time)

Fury need to go back in the buffer to update some header in some situations. In such cases, flush ahead is not possible. In the long run, we may be able to streaming write if we provide options to disable such look back.

But could you share which cases you need to serialize such big object? It's rare in a production environment, and protobuf don't support it too

Nico Hezel · Answer 2 · Wed Apr 17 2024 16:35:18 GMT+0800 (China Standard Time)

We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot.

Shawn Yang · Answer 3 · Wed Apr 17 2024 17:38:06 GMT+0800 (China Standard Time)

We have quite large files on disk and can not use protobuf because of the 2GB limitation. Thats why we where looking for alternatives: fast serialization with cross-language support. I feel in the future if we store machine learning embeddings in a column based style to a file, the 2GB limit will be a problem quite a lot.

This is interesting, if embeddings are storaged, we may need larger limitation. Could we split a big object into some small objects for serialization. I mean, you can serialize like this:

Fury fury = xxx;
OutputStream stream = xxx;
fury.serialize(stream, o1);
fury.serialize(stream, o2);
fury.serialize(stream, o3);

Then for deserializaion, you can:

Fury fury = xxx;
FuryInputStream stream = xxx;
Object o1 = fury.deserialize(stream);
Object o2 = fury.deserialize(stream);
Object o3 = fury.deserialize(stream);

Shawn Yang · Answer 4 · Wed Apr 17 2024 17:42:29 GMT+0800 (China Standard Time)

If we can't split an object graph into multiple serialization, then we do need to support larger size limit

Nico Hezel · Answer 5 · Wed Apr 17 2024 18:44:12 GMT+0800 (China Standard Time)

In our case we have a file with meta information and embeddings of several millions images. All the embeddings are stored in column-based styles for fast access and distance calculations. The embeddings are around 1000 dimensions, which means we can only store 2 million images in one file otherwise just the embeddings alone are to large for fury.

Shawn Yang · Answer 6 · Wed Apr 17 2024 18:58:34 GMT+0800 (China Standard Time)

Why not split this file into smaller files

Nico Hezel · Answer 7 · Wed Apr 17 2024 19:14:08 GMT+0800 (China Standard Time)

It is just a hassle, right now the meta information of all images are stored in row-based style, followed by the embeddings information of all the images in column-based style. I see three options with the current implementation of fury to handle large files:

Store meta information and embeddings in seperate files, and split the embedding file into smaller files to cirumvent the 2GB limit
Try to keep everything in one file, but create additional files if its breaks the 2GB barrier
Store everything in row-based format (meta information and embedding per image) and split the files if needed

For 2 and 3 we would need to keep track of how big the file already is in order to make reasonable splits (not splitting the meta data or embedding of an image into two files). Finding a good splitting point for option 1 is more straight forward since the number of embeddings fitting into one file can be calculated in advanced. In all cases ideal memory allocation and ordering of data would need more consideration.