OSGeo / shapelib

Official repository of shapelib

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need unittests directly in C or C++

schwehr opened this issue · comments

The current testing setup only runs command line programs and compares their results to golden files. shapelib needs unit tests that direct call the C functions with a much more diverse range of inputs. e.g. exercising various error conditions. There are lots of frameworks that would work well for this.

I'm thinking of going with Catch2. It's got a (slower) option to use a single header and source file to get started. And I've wanted to give it a try. Having these tests in C++ (probably >= C++17) will mean that some platforms will only be able to use the original shell script based testing, but that should be okay as these tests will exercise the code on at least the 4 configurations currently setup for CI in the project.

See the C and C++ sections in Wikipedia's C and C++ Frameworks in List of unit testing frameworks. Probably anything reasonably maintained and open source would be fine. Some of the options are:

Test Framework Framework Language Example usage
Catch2 C++ users
Criterion C; C++ optional ?
googletest and gmock C++ PDAL/test/unit, PROJ/test/unit
tut C++ geos/tests/unit, gdal/autotest/cpp

Example starter based test in dbfopen_test.cc. Apologies for code that isn't totally clean and SetContents isn't particularly good.

catch2:

#include <filesystem>
#include <fstream>
#include <iostream>
#include <string>
#include <string_view>

#include "catch.hpp"
#include "shapefil.h"

namespace {

constexpr char kTestData[] = "testdata/";

bool SetContents(std::string_view file_name, std::string_view content) {
  std::ofstream file(file_name);
  if (!file.is_open()) return false;
  file << content;
  file.close();
  return true;
}

TEST_CASE("DBFOpen", "[dbfopen]") {
  SECTION("Open does not exist - rb") {
    auto handle = DBFOpen("/does/not/exist.dbf", "rb");
    REQUIRE(handle == nullptr);
  }

  SECTION("Open does not exist - rb+") {
    auto handle = DBFOpen("/does/not/exist2.dbf", "rb+");
    REQUIRE(handle == nullptr);
  }

  SECTION("Open not a dbf") {
    const std::string filename = kTestData + std::string("not_a_dbf.dbf");
    auto handle = DBFOpen(filename.c_str(), "rb");
    REQUIRE(handle == nullptr);
  }

  SECTION("Open and close a.dbf") {
    const std::string filename = kTestData + std::string("a.dbf");
    auto handle = DBFOpen(filename.c_str(), "rb");
    REQUIRE(handle != nullptr);
    DBFClose(handle);
  }
}

TEST_CASE("DBFCreate", "[dbfcreate]") {
  SECTION("DoesNotExist") {
    auto handle = DBFCreate("/does/not/exist");
    REQUIRE(nullptr == handle);
  }

  SECTION("CreateAlreadyExists") {
    const std::string filename = kTestData + std::string("in-the-way.dbf");
    REQUIRE(SetContents(filename, "some content"));
    auto handle = DBFCreate(filename.c_str());
    // TODO(schwehr): Seems like a bug to overwrite an existing.
    REQUIRE(nullptr != handle);
    DBFClose(handle);
    auto size = std::filesystem::file_size(filename);
    REQUIRE(34 == size);
  }

  SECTION("Create and close") {
    const std::string filename = kTestData + std::string("empty.dbf");
    auto handle = DBFCreate(filename.c_str());
    DBFClose(handle);
    auto size = std::filesystem::file_size(filename);
    REQUIRE(34 == size);
  }
}

}  // namespace

Almost the same thing written with GoogleTest:

#include <filesystem>
#include <fstream>
#include <iostream>
#include <string>
#include <string_view>

#include "gunit.h"
#include "shapefil.h"

namespace {

constexpr char kTestData[] = "testdata/";

bool SetContents(std::string_view file_name, std::string_view content) {
  std::ofstream file(file_name);
  if (!file.is_open()) return false;
  file << content;
  file.close();
  return true;
}

TEST(DbfOpenTest, testDoesNotExist) {
  auto handle = DBFOpen("/does/not/exist", "rb");
  EXPECT_EQ(nullptr, handle);
}

TEST(DbfOpenTest, testOpenNotDbf) {
  const std::string filename = kTestData + std::string("not_a_dbf.dbf");
  auto handle = DBFOpen(filename.c_str(), "rb");
  EXPECT_EQ(nullptr, handle);
}

TEST(DbfOpenTest, testOpenClose) {
  const std::string filename = kTestData + std::string(""a.dbf");
  auto handle = DBFOpen(filename.c_str(), "rb");
  EXPECT_NE(nullptr, handle);
  DBFClose(handle);
}

TEST(DbfCreateTest, testDoesNotExist) {
  auto handle = DBFCreate("/does/not/exist");
  EXPECT_EQ(nullptr, handle);
}

TEST(DbfCreateTest, testCreateAlreadyExists) {
  const std::string filename = kTestData + std::string("in-the-way.dbf");
  ASSERT_TRUE(SetContents(filename, "some content"));
  auto handle = DBFCreate(filename.c_str());
  // TODO(schwehr): Seems like a bug to overwrite an existing.
  EXPECT_NE(nullptr, handle);
  DBFClose(handle);
  auto size = std::filesystem::file_size(filename);
  EXPECT_EQ(34, size);
}

TEST(DbfCreateTest, testCreateClose) {
  const std::string filename = kTestData + std::string("empty.dbf");
  auto handle = DBFCreate(filename.c_str());
  DBFClose(handle);
  auto size = std::filesystem::file_size(filename);
  EXPECT_EQ(34, size);
}

}  // namespace

In recent years, or at least this matches my own interests, shapelib standalone has mostly be a by-product of GDAL shapelib internal copy than a project moving by itself. I'm not sure if we want to invest too much in adding and maintaining more code (tests) to shapelib standalone, whereas it is already quite extensively tested through GDAL.

You've said this before, so it's worth a moment for me to lay out the why I'm doing this. First, the users that I see:

  • shapelib is in Debian, Ubuntu, and Fedora as a stand alone package.
  • On my debian-testing work machine, I see 20 binaries with dpkg -L shapelib | grep bin/ | wc -l
  • I have users who directly access the shapelib code from inside gdal without using anything else (accept gdal's port)

It is great that GDAL has extensive testing of the shapefile driver with >130 tests (grep 'def test' ogr_*shape*.py | wc -l ) in these files:

Being that high level, I have a hard time translating what I see in those to knowing what shapelib itself is supposed to be doing when I read shapelib code. Most of its behavior is documented only implicitly in how it's used by GDAL. That's very hard for me to follow.

I plan for the GDAL tree I manage to split off shapelib as a separate thing where is just uses the separate gdal:port_lib target that I have. I'd like to be able to see tests that directly correspond to the files in question without all the intermediate code of GDAL and SWIG/Python between the tests and the actual code.

Code like shapelib is going to be with us for a long time despite so many in the community feeling frustrated by the limitations it imposes. The longer shapelib goes on in the state that it's in, the more drag it places on the community. shapelib keeps coming up again and again for me and without cleanup, I don't have much hope of pushing it out of my awareness back to where it belongs... just the thing we have to use (directly or indirectly through gdal). I know nobody else is interested in getting shapelib in shape, but knowing that it's low level behavior is documented and tested will help me sleep better. I'm sure it's got plenty of bugs waiting to be found and when they are, code that is cleaner, documented, and directly tested will be much easier to work on for people who don't already know it.

Benefits of removing the tech debt from shapelib:

  • Easier/quicker understanding for users and software engineers who don't spend all their time with GDAL and family
    • There are users of just shapelib the library
    • There are users of shapelib the 20 binaries that are super shaky
  • Easier to fix bugs when they come up
  • Cleaner code that is well tested is easier to find and fix bugs for
  • Static analyzers and sanitizers do better with cleaner code
  • New warnings will stand out from the large number of warnings that happen even without -Wall -Wextra
  • We could turn on -Werror
  • Distributions could stop packaging their own man pages
  • Putting fuzzers right into the APIs of shapelib will undoubtly find more things I can fix
  • I still see random crashes in GDAL code that I don't have the energy to track down. They are often in sandboxes or using users input that I should not access. That's why I started just walking through the gdal tree trying to cleanup anything I could

And for my own internal use case:

I have a huge number of targets that bring in GDAL just for some small part of it. We build (almost) all binaries statically to be hermetic and hardened. We often bump into the maximum binary ceiling (which is large). Being able to take the sizable chunk of shapelib only users and have them only link in port_lib and shapelib will be a huge win. Developers should be able to count on the libraries that are below them when they are writing code that depends on them. The tests in GDAL of shape don't make me comfortable that shapelib is actually solid and not going to someday have unexpected behavior changes in corner cases that someone probably counts on somewhere.

shapelib is one thing that I have a chance of getting to a reasonable level of tech debt such that I think we will be able to mostly ignore it with confidence for another decade or two. I look at GRIB and I just feel hopeless. I poked at libtiff and two of the core developers seem apposed to the kind of cleanup that makes me confident about code functioning as intended and being maintainable by the community as a whole (as apposed to the specialists who have to deeply understand these libs now). I took a large swing at MB-System, but that didn't change the behavior of any of the coders working on the system. They've only deleted some of the tests I've added and added none of their own. But shapelib is tractable for me with the little bits of time I get between normal work that takes big continuous blocks of time and my kids jumping on me destroying my ability to think at any sort level beyond "Room on the Broom" or "Baby Shark".

Shapelib is still at state of broken window syndrome in my opinion.

FWIW, I just landed here after hitting a crash in R's built-in DBF reader, which turned out to be a vendored copy of shapelib. PostGIS also had a vendored copy for the shp2pgsql tool. I think it's fair to assume this code is in quite a few major projects outside of GDAL.

It would be good to have a list of the key places where copies exist.

I've been running a dbf fuzzer for about 7000 core hours so far. These 179 files might be helpful when writing unit tests to trigger particular code paths.

corpus-dbf.zip

My stripped down CMakeLists.txt works. It will work with C++-11 or C++-14 if the dbfopen_test.cc doesn't use #include <filesystem> or #include <string-view>. CMAKE_CXX_EXTENSIONS False isn't required

cmake_minimum_required(VERSION 3.10)

set (PROJECT_VERSION_MAJOR 1)
set (PROJECT_VERSION_MINOR 5)
set (PROJECT_VERSION_PATCH 0)
set (PROJECT_VERSION
  "${PROJECT_VERSION_MAJOR}.${PROJECT_VERSION_MINOR}.${PROJECT_VERSION_PATCH}")
project(
  shapelib
  LANGUAGES C CXX
  VERSION ${PROJECT_VERSION})

set(CMAKE_C_FLAGS "-Wall -Wextra -O2")

set(CMAKE_CXX_FLAGS "-Wall -Wextra -O2")
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED True)
set(CMAKE_CXX_EXTENSIONS False)

set(PACKAGE shp)
set(lib_SRC
  dbfopen.c
  safileio.c
  sbnsearch.c
  shpopen.c
  shptree.c
  )

add_library(${PACKAGE} ${lib_SRC})

enable_testing()

find_package(Catch2 REQUIRED)
add_executable(dbfopen_test dbfopen_test.cc)
target_link_libraries(dbfopen_test Catch2::Catch2 shp)

include(CTest)
include(Catch)
catch_discover_tests(dbfopen_test)

With just these files in the tree:

find . -type f | sort
./catch.hpp
./CMakeLists.txt
./dbfopen.c
./dbfopen_test.cc
./safileio.c
./sbnsearch.c
./shapefil.h
./shpopen.c
./shptree.c
./testdata/a.dbf
./testdata/empty.dbf
./testdata/in-the-way.dbf
./testdata/not_a_dbf.dbf

I've been running a dbf fuzzer for about 7000 core hours so far. These 179 files might be helpful when writing unit tests to trigger particular code paths.

corpus-dbf.zip

All 174 files behave as expected:

SECTION("Open bad DBF files")
{
    for (const auto& filename : fs::directory_iterator(fs::path{ "corpus-dbf" }))
    {
        const auto handle = DBFOpen(filename.path().string().c_str(), "rb");
        REQUIRE(handle == nullptr);
        DBFClose(handle);
    }
}

@schwehr Note, that I first went for the Catch2 unit testing framework but switched to GTest afterwards.

Thanks for getting that into the code base!