smallrye / jandex

Java Annotation Indexer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Exception when writing big indices

gnodet opened this issue · comments

Caused by: java.io.UTFDataFormatException: encoded string too long: 68271 bytes
        at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
        at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
        at org.jboss.jandex.IndexWriterV2.writeStringTable(IndexWriterV2.java:216)
        at org.jboss.jandex.IndexWriterV2.write(IndexWriterV2.java:193)
        at org.jboss.jandex.IndexWriter.write(IndexWriter.java:107)
        at org.jboss.jandex.IndexWriter.write(IndexWriter.java:74)

Do you have a reproducer somewhere? That is a very very large String entry in the table. I could add support for Strings larger than 64K but I would love to see an example to understand why/how it happens.

Closing due to inactivity

This came up in smallrye/smallrye-open-api#924. Indexing GAV org.jetbrains.kotlin:kotlin-stdlib:1.5.21 reproduces the exception. Please re-open this issue.
cc: @Ladicek

Reopened. If you have a fix, I'll be glad to review, but I can't get to it myself any time soon.

Thanks for the example BTW. The fix will have to change the index format. Solution should be simple, write out our own packed int length indicator followed by the utf-8 byte array when version > blah

I have a fix for that, I just need to write a test case...

I have also submitted a fix here: #146. Adding a test is a good point though, let me think about that :-)

Here's the test then:

    @Test
    public void testWriteRead() throws IOException {
        Indexer indexer = new Indexer();

        String url = getClass().getClassLoader()
                .getResource(Repeatable.class.getName().replace('.', '/') + ".class").toString();
        String jarFile = url.substring("jar:file:".length(), url.indexOf("!/"));
        JarFile jar = new JarFile(jarFile);
        Enumeration<JarEntry> entries = jar.entries();
        while (entries.hasMoreElements()) {
            JarEntry entry = entries.nextElement();
            if (entry.getName().endsWith(".class")) {
                final InputStream stream = jar.getInputStream(entry);
                indexer.index(stream);
                stream.close();
            }
        }

        Index index = indexer.complete();

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        new IndexWriter(baos).write(index);

        index = new IndexReader(new ByteArrayInputStream(baos.toByteArray())).read();
    }

You just need to add the following dependency:

        <dependency>
            <groupId>org.jetbrains.kotlin</groupId>
            <artifactId>kotlin-stdlib</artifactId>
            <version>1.5.21</version>
            <scope>test</scope>
        </dependency>

Thanks, I'll take a look!

Also thanks @n1hility, I didn't think it would be this straightforward :-)

I closed my PR (for now) as that can't really solve the underlying problem. The issue is not in that Kotlin somehow generates unexpectedly long strings in the constant pool. The JVM spec clearly says that a string in constant pool may only be 64 kB long -- because the length field is 2 bytes long. It isn't possible to generate a valid class file whose constant pool would contain a longer string.

The problem is that when we're interpreting the constant pool data as strings (Indexer.decodeUtf8Entry), we assume they are UTF-8 encoded, which they are not. They use the "modified UTF-8" encoding. In other words, we convert the bytes to a String incorrectly. Of course, when we want to serialize such wrong string back, we don't get the original form, but something malformed. The fact that it's longer than 64 kB is just a coincidence.

The Kotlin's kotlin/collections/ArraysKt___ArraysKt.class class file contains a string entry in the constant pool that is 65 534 bytes long: the 0th element of the d1 array member of the kotlin.Metadata annotation. We need to make sure that we roundtrip that value correctly.

Great catch, the pool length is why I was curious about how this happens as it shouldn't. I just assumed from the example that they had some encoding oddities, but the indexing code is clearly wrong and should be using DataInput. oops! Well the good news is that's an even simpler fix :)

Submitted #150 with a proper fix.

Please let me know if you also need this fixed in 2.4 and I'll do a backport.

Fixed in #150.

Please let me know if you also need this fixed in 2.4 and I'll do a backport.

A backport would be appreciated for sure. I assume it's going to be a while until we get 3.0 released/integrated with the Smallrye libraries.

OK, backport is in #151.