libarchive / libarchive

Multi-format archive and compression library

Home Page:http://www.libarchive.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Truncated 7-Zip file body (error code: -30) on archive_read_data with 7z archives containing larger (>= 32 MiB) files when skipping entries

mxmlnkn opened this issue · comments

Hello,

Thanks for this widely useful project! I'm trying to incorporate it into ratarmount via python-libarchive-c.

After making the new backend work successfully with smaller archives, I stumbled upon a weird problem with a larger test file.

Create test files:

# Large archive with two files to test seekability and independence of opened files, which reproduces the bug.
> spaces-32-MiB.txt; for i in $( seq $(( 32 * 1024 )) ); do printf '%1024s' $'\n' >> spaces-32-MiB.txt; done
> zeros-32-MiB.txt; for i in $( seq $(( 32 * 1024 )) ); do printf '%01023d\n' 0 >> zeros-32-MiB.txt; done
7z a two-large-files.7z spaces-32-MiB.txt zeros-32-MiB.txt

# Slightly smaller file that I accidentally created before because of a bug, which for some reason works fine!
> spaces-32-MiB.txt; for i in $( seq $(( 32 * 1024 )) ); do printf '%1023s' $'\n' >> spaces-32-MiB.txt; done
> zeros-32-MiB.txt; for i in $( seq $(( 32 * 1024 )) ); do printf '%01023d\n' 0 >> zeros-32-MiB.txt; done
7z a two-slightly-less-large-files.7z spaces-32-MiB.txt zeros-32-MiB.txt

(The slightly smaller version calls printf '%1023s' $'\n' instead of printf '%1024s' $'\n')

Python code triggering the issue
import libarchive

def listFiles(path):
    print("\nList all entries of:", filePath)
    with libarchive.file_reader(path) as archive:
        for entry in archive:
            print(entry)
            for block in entry.get_blocks():
                assert len(block) > 0

def readNthEntry(path, entryIndex):
    print(f"\nGet contents of file {entryIndex} of archive: {path}")
    with libarchive.file_reader(path) as archive:
        entryCount = 0
        for entry in archive:
            if entryCount == entryIndex:
                print(entry)
                readSize = 0
                for block in entry.get_blocks():
                    readSize += len(block)
                print(f"  Read file contents: {readSize} B")
            entryCount += 1

filePath = "two-large-files.7z"
filePath2 = "two-slightly-less-large-files.7z"

listFiles(filePath)  # No error
readNthEntry(filePath2, 1)  # No error
# libarchive.exception.ArchiveError: Truncated 7-Zip file body (errno=84, retcode=-30, archive_p=...)
readNthEntry(filePath, 1)
C++ code triggering the issue
#include <array>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>
#include <utility>

#include <archive.h>
#include <archive_entry.h>


class Libarchive
{
public:
    Libarchive( const std::string& path )
    {
        archive_read_support_filter_all( m_archive );
        archive_read_support_format_all( m_archive );

        auto returnCode = archive_read_open_filename( m_archive, path.c_str(), 10240 );
        if ( returnCode != ARCHIVE_OK ) {
            std::stringstream message;
            message << "[Libarchive] Open " << path << " failed with: " << archive_error_string( m_archive )
                    << " (error code: " << std::to_string( returnCode ) << ")";
            throw std::runtime_error( std::move( message ).str() );
        }
    }

    ~Libarchive()
    {
        const auto returnCode = archive_read_free( m_archive );
        if ( returnCode != ARCHIVE_OK ) {
            std::cerr << "Freeing archive failed with: " << returnCode << "\n";
        }
    }

    [[nodiscard]] archive*
    pointer() const noexcept
    {
        return m_archive;
    }

private:
    archive* const m_archive{ archive_read_new() };
};


class LibarchiveEntry
{
public:
    ~LibarchiveEntry()
    {
        archive_entry_free( m_entry );
    }

    [[nodiscard]] archive_entry*
    pointer() const noexcept
    {
        return m_entry;
    }

private:
    archive_entry* const m_entry{ archive_entry_new() };
};


void
listFiles( const std::string& path )
{
    Libarchive archive{ path };

    archive_entry* entry{ nullptr };
    while ( archive_read_next_header( archive.pointer(), &entry ) == ARCHIVE_OK ) {
        std::cout << archive_entry_pathname( entry ) << "\n";
        //archive_read_data_skip(a);  // not necessary as the Wiki says
    }
}


void
readNthEntries( const std::string&      path,
                const std::set<size_t>& entryIndexes )
{
    std::cout << "\nGet contents of files";
    for ( const auto i : entryIndexes ) {
        std::cout << " " << i;
    }
    std::cout << " in archive: " << path << "\n";

    Libarchive archive{ path };

    size_t entryCount{ 0 };
    LibarchiveEntry entry;
    while ( true ) {
        /* I also tried with archive_read_next_header, but the bug persists. */
        if ( archive_read_next_header2( archive.pointer(), entry.pointer() ) != ARCHIVE_OK ) {
            break;
        }

        if ( entryIndexes.contains( entryCount ) ) {
            std::cout << archive_entry_pathname( entry.pointer() ) << "\n";

            std::array<char, 32 * 1024> buffer{};
            size_t readSize{ 0 };
            while ( true ) {
                const auto readSizePerCall = archive_read_data( archive.pointer(), buffer.data(), buffer.size() );
                if ( readSizePerCall < 0 ) {
                    std::stringstream message;
                    message << "[Libarchive] Read data failed with: " << archive_error_string( archive.pointer() )
                            << " (error code: " << std::to_string( readSizePerCall ) << ")";
                    //continue;  // Works fine (amount of returned data is correct) to simply ignore the error!?
                    throw std::runtime_error( std::move( message ).str() );
                }
                if ( readSizePerCall == 0 ) {
                    break;
                }
                readSize += readSizePerCall;
            }
            std::cout << "  Read file contents: " << readSize << " B\n";
        } else {
            //archive_read_data_skip( archive.pointer() );  // Uncommenting this does not help.
        }
        ++entryCount;
    }
}


int main()
{
    static const std::string filePath = "two-large-files.7z";
    static const std::string filePath2 = "two-slightly-less-large-files.7z";

    std::cout << "\nList all entries of: " << filePath << "\n";
    listFiles( filePath );

    /* Works fine with the slightly smaller file. */
    readNthEntries( filePath2, { 0 } );
    readNthEntries( filePath2, { 1 } );

    /* Works fine when not skipping any entry. */
    readNthEntries( filePath2, { 0, 1 } );
    readNthEntries( filePath, { 0, 1 } );

    readNthEntries( filePath, { 0 } );
    /* Read data failed with: Truncated 7-Zip file body (error code: -30) */
    readNthEntries( filePath, { 1 } );

    return 0;
}

Compiled with:

g++ -Wall -Wextra -Wshadow -std=c++20 -o libarchive-entry-skipping-issue{,.cpp} -larchive && ./libarchive-entry-skipping-issue

Output:

List all entries of: two-large-files.7z
spaces-32-MiB.txt
zeros-32-MiB.txt

Get contents of files 0 in archive: two-slightly-less-large-files.7z
spaces-32-MiB.txt
  Read file contents: 33521664 B

Get contents of files 1 in archive: two-slightly-less-large-files.7z
zeros-32-MiB.txt
  Read file contents: 33554432 B

Get contents of files 0 1 in archive: two-slightly-less-large-files.7z
spaces-32-MiB.txt
  Read file contents: 33521664 B
zeros-32-MiB.txt
  Read file contents: 33554432 B

Get contents of files 0 1 in archive: two-large-files.7z
spaces-32-MiB.txt
  Read file contents: 33554432 B
zeros-32-MiB.txt
  Read file contents: 33554432 B

Get contents of files 0 in archive: two-large-files.7z
spaces-32-MiB.txt
  Read file contents: 33554432 B

Get contents of files 1 in archive: two-large-files.7z
zeros-32-MiB.txt
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Libarchive] Read data failed with: Truncated 7-Zip file body (error code: -30)
Aborted

Observations:

  • Note that I was very close to reporting this at python-libarchive-c instead of here because I was unable to reproduce the bug with the C++ code at first. It turns out that I forgot the return code check of archive_read_data and it also turns out that ignoring that error (see commented-out code) seems to result in the correct amount of data being returned in subsequent archive_read_data calls!
  • I had a slightly smaller file at first because of printf peculiarities. Everything works fine with that file two-slightly-less-large-files.7z. It only happens with two-large-files.7z.
  • It also does not happen when not skipping entries, i.e., when calling archive_read_data for all entries.

Do you see the same issue with this?

bsdtar -tvf two-large-files.7z

Note: The -t option to bsdtar skips the entry bodies to produce its listing.

@kientzle So, it works the same as my listFiles implementations, i.e., archive_read_data is not even called and therefore this bug should not happen. I tried it, and it works without error, same as my implementations. It only happens when skipping the first and then trying to read the second entry.

bsdtar -tvf two-large-files.7z
# -rwx------  0 0      0    33554432 Apr  1 13:16 spaces-32-MiB.txt
# -rwx------  0 0      0    33554432 Mar 31 23:27 zeros-32-MiB.txt

I can reproduce the bug with bsdtar like this:

bsdtar -x --exclude spaces-32-MiB.txt -f two-large-files.7z
# zeros-32-MiB.txt: Truncated 7-Zip file body: File exists
# bsdtar: Error exit delayed from previous errors.

While it works fine when excluding the other file:

bsdtar -x --exclude zeros-32-MiB.txt -f two-large-files.7z