Exception when writing big indices

Question

Exception when writing big indices

gnodet opened this issue 6 years ago · comments

Caused by: java.io.UTFDataFormatException: encoded string too long: 68271 bytes
        at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
        at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
        at org.jboss.jandex.IndexWriterV2.writeStringTable(IndexWriterV2.java:216)
        at org.jboss.jandex.IndexWriterV2.write(IndexWriterV2.java:193)
        at org.jboss.jandex.IndexWriter.write(IndexWriter.java:107)
        at org.jboss.jandex.IndexWriter.write(IndexWriter.java:74)

Jason T. Greene · Answer 1 · Sat Feb 23 2019 06:49:24 GMT+0800 (China Standard Time)

Do you have a reproducer somewhere? That is a very very large String entry in the table. I could add support for Strings larger than 64K but I would love to see an example to understand why/how it happens.

Jason T. Greene · Answer 2 · Sun Jan 24 2021 01:48:49 GMT+0800 (China Standard Time)

Closing due to inactivity

Michael Edgar · Answer 3 · Sat Sep 25 2021 18:15:26 GMT+0800 (China Standard Time)

This came up in smallrye/smallrye-open-api#924. Indexing GAV org.jetbrains.kotlin:kotlin-stdlib:1.5.21 reproduces the exception. Please re-open this issue.
cc: @Ladicek

Ladislav Thon · Answer 4 · Mon Sep 27 2021 16:10:50 GMT+0800 (China Standard Time)

Reopened. If you have a fix, I'll be glad to review, but I can't get to it myself any time soon.

Jason T. Greene · Answer 5 · Mon Sep 27 2021 20:56:03 GMT+0800 (China Standard Time)

Thanks for the example BTW. The fix will have to change the index format. Solution should be simple, write out our own packed int length indicator followed by the utf-8 byte array when version > blah

Guillaume Nodet · Answer 6 · Mon Sep 27 2021 21:37:37 GMT+0800 (China Standard Time)

I have a fix for that, I just need to write a test case...

Ladislav Thon · Answer 7 · Mon Sep 27 2021 21:40:37 GMT+0800 (China Standard Time)

I have also submitted a fix here: #146. Adding a test is a good point though, let me think about that :-)

Guillaume Nodet · Answer 8 · Mon Sep 27 2021 21:57:31 GMT+0800 (China Standard Time)

Here's the test then:

    @Test
    public void testWriteRead() throws IOException {
        Indexer indexer = new Indexer();

        String url = getClass().getClassLoader()
                .getResource(Repeatable.class.getName().replace('.', '/') + ".class").toString();
        String jarFile = url.substring("jar:file:".length(), url.indexOf("!/"));
        JarFile jar = new JarFile(jarFile);
        Enumeration<JarEntry> entries = jar.entries();
        while (entries.hasMoreElements()) {
            JarEntry entry = entries.nextElement();
            if (entry.getName().endsWith(".class")) {
                final InputStream stream = jar.getInputStream(entry);
                indexer.index(stream);
                stream.close();
            }
        }

        Index index = indexer.complete();

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        new IndexWriter(baos).write(index);

        index = new IndexReader(new ByteArrayInputStream(baos.toByteArray())).read();
    }

You just need to add the following dependency:

        <dependency>
            <groupId>org.jetbrains.kotlin</groupId>
            <artifactId>kotlin-stdlib</artifactId>
            <version>1.5.21</version>
            <scope>test</scope>
        </dependency>

Ladislav Thon · Answer 9 · Mon Sep 27 2021 21:58:49 GMT+0800 (China Standard Time)

Thanks, I'll take a look!

Also thanks @n1hility, I didn't think it would be this straightforward :-)

Ladislav Thon · Answer 10 · Fri Oct 01 2021 23:03:32 GMT+0800 (China Standard Time)

I closed my PR (for now) as that can't really solve the underlying problem. The issue is not in that Kotlin somehow generates unexpectedly long strings in the constant pool. The JVM spec clearly says that a string in constant pool may only be 64 kB long -- because the length field is 2 bytes long. It isn't possible to generate a valid class file whose constant pool would contain a longer string.

The problem is that when we're interpreting the constant pool data as strings (Indexer.decodeUtf8Entry), we assume they are UTF-8 encoded, which they are not. They use the "modified UTF-8" encoding. In other words, we convert the bytes to a String incorrectly. Of course, when we want to serialize such wrong string back, we don't get the original form, but something malformed. The fact that it's longer than 64 kB is just a coincidence.

The Kotlin's kotlin/collections/ArraysKt___ArraysKt.class class file contains a string entry in the constant pool that is 65 534 bytes long: the 0th element of the d1 array member of the kotlin.Metadata annotation. We need to make sure that we roundtrip that value correctly.

Jason T. Greene · Answer 11 · Sat Oct 02 2021 00:05:42 GMT+0800 (China Standard Time)

Great catch, the pool length is why I was curious about how this happens as it shouldn't. I just assumed from the example that they had some encoding oddities, but the indexing code is clearly wrong and should be using DataInput. oops! Well the good news is that's an even simpler fix :)

Ladislav Thon · Answer 12 · Mon Oct 04 2021 18:06:27 GMT+0800 (China Standard Time)

Submitted #150 with a proper fix.

Please let me know if you also need this fixed in 2.4 and I'll do a backport.

Ladislav Thon · Answer 13 · Mon Oct 04 2021 19:05:35 GMT+0800 (China Standard Time)

Fixed in #150.

Michael Edgar · Answer 14 · Mon Oct 04 2021 19:38:31 GMT+0800 (China Standard Time)

Please let me know if you also need this fixed in 2.4 and I'll do a backport.

A backport would be appreciated for sure. I assume it's going to be a while until we get 3.0 released/integrated with the Smallrye libraries.

Ladislav Thon · Answer 15 · Mon Oct 04 2021 20:19:06 GMT+0800 (China Standard Time)

OK, backport is in #151.