gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.

Home Page:http://gitpython.readthedocs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unhandled IndexError when calling .read() on a malformed config file

DaveLak opened this issue · comments

Hi, I noticed the fuzzing tests that OSS-Fuzz runs on this project are broken and while I was working on fixing them I believe I came across a minor bug:

The Bug

An Uncaught Python exception: IndexError: string index out of range can be triggered if when trying to call .read() on GitConfigParser if it was initialized with a malformed config file.

Current Behavior

It's easiest to demonstrate, so please consider this example:

import io
from git.config import GitConfigParser


def reproduce_issue():

    malformed_config_content_bytestring = b'[-]\nk:"v\n"'

    problematic_config_file = io.BytesIO(malformed_config_content_bytestring)

    # problematic_config_file looks now like this:
    """
    [-]
    k:"v
    "
    """

    # We have to name the file otherwise we'll trigger
    # `AttributeError: '_io.BytesIO' object has no attribute 'name'` here:
    # https://github.com/gitpython-developers/GitPython/blob/c7675d2cedcd737f20359a4a786e213510452413/git/config.py#L623
    problematic_config_file.name = "fuzzedconfig.config"

    # This is fine
    git_config = GitConfigParser(problematic_config_file)
    
    # The next line raised an unhandled `IndexError: string index out of range`
    git_config.read()


if __name__ == "__main__":
    reproduce_issue()

Assuming that code is in /path/to/example/config_indexerror_reproduction.py then

python config_indexerror_reproduction.py

produces something akin to:

Traceback (most recent call last):
  File "/path/to/example/config_indexerror_reproduction.py", line 30, in <module>
    reproduce_issue()
  File "/path/to/example/config_indexerror_reproduction.py", line 26, in reproduce_issue
    git_config.read()
  File "/path/to/example/.venv/lib/python3.12/site-packages/git/config.py", line 607, in read
    self._read(file_path, file_path.name)
  File "/path/to/example/.venv/lib/python3.12/site-packages/git/config.py", line 514, in _read
    cursect.setlast(optname, optval + string_decode(line))
                                      ^^^^^^^^^^^^^^^^^^^
  File "/path/to/example/.venv/lib/python3.12/site-packages/git/config.py", line 441, in string_decode
    if v[-1] == "\\":
       ~^^^^
IndexError: string index out of range
My Reproduction Environment Details

The reproduction code above was tested on:

Python Version: 3.12.1 (main, Feb  5 2024, 16:23:00) [Clang 15.0.0 (clang-1500.1.0.2.5)]
OS Information: macOS-14.4.1-x86_64-i386-64bit
Installed Packages:
Package   Version
--------- -------
gitdb     4.0.11
GitPython 3.1.42
pip       24.0
smmap     5.0.1

And the fuzzer environment was:

Python Version: 3.8.3 (default, Mar 17 2024, 03:21:27)
[Clang 15.0.0 (https://github.com/llvm/llvm-project.git bf7f8d6fa6f460bf0a16ffe
OS Information: Linux-6.6.16-linuxkit-x86_64-with-glibc2.2.5
Installed Packages:
Package                   Version
------------------------- -------
altgraph                  0.17.4
atheris                   2.3.0
coverage                  6.3.2
importlib_metadata        7.0.2
gitdb                     4.0.11
GitPython                 3.1.42
pip                       24.0
smmap                     5.0.1
packaging                 24.0
pyinstaller               5.0.1
setuptools                41.0.1
six                       1.15.0
zipp                      3.18.1

So, if I'm reading the source correct, it seems like the combination of some header section ([-] above) followed by a key/value assignment that has a value consisting of a double quoted string with a new line inside it confuses the check here which strips the " on the new line:

GitPython/git/config.py

Lines 521 to 528 in c7675d2

else:
line = line.rstrip()
if line.endswith('"'):
is_multi_line = False
line = line[:-1]
# END handle quotations
optval = cursect.getlast(optname)
cursect.setlast(optname, optval + string_decode(line))

and passes an empty string to string_decode which isn't expecting that when it indexes into it's arg:

GitPython/git/config.py

Lines 454 to 455 in c7675d2

def string_decode(v: str) -> str:
if v[-1] == "\\":

Expected Behavior

I'd expect an explicitly raised ParsingError similar to how it's handled a little further up:

GitPython/git/config.py

Lines 514 to 518 in c7675d2

else:
# Check if it's an option with no value - it's just ignored by git.
if not self.OPTVALUEONLY.match(line):
if not e:
e = cp.ParsingError(fpname)

On a related note, do the maintainers of this project have any interest in bringing the fuzzing test code into this repo? I searched the repository for prior discussions about it but didn't come across anything.

I'm happy to:

  • open a new issue or GH discussion thread with more context if it'd be helpful
  • perform the migration/integration work; since I have the necessary changes locally already it's just a matter of what repo I make the PR against with the test improvements
    • (edit: I ended up putting the fix into PR while I wait to hear back as I figured it's better fixed than broken regardless of where it lived 😅 google/oss-fuzz#11763)
  • help support the fuzz integration if desired

Thanks a lot for this wonderfully complete bug report, it's much appreciated.

Personally, I think the config-parser implementation here is probably very broken and non-conforming, and the fuzzer will have a very good time with it. A problem I see is that reproductions are usually hidden behind a login, so it requires extra work to make these available here so the community has a chance to fix them. However, I am open to give it a shot and see how it goes.

I definitely welcome you to help with maintaining the fuzzer integration, and would appreciate if that would also entail the maintenance of the issues it finds such that all information for reproduction would be contained within them so it's not only maintainers who can fix them.

Thanks for the prompt reply 🙂

Silly me for not realizing earlier, but I see gitoxide has been integrated for a while with OSS-Fuzz issues mirrored in GH issues and some good results. Given that context, and this project's maintenance mode status I definitely hear where your coming from.

I think it probably is worth opening up a discussion thread to continue the fuzzing convo for the sake of keeping this issue on topic as I do have some follow-up questions for you related to what you think a fix for this might look like (if any). I went ahead and replied to you here: #1889 (comment)

I started to take a closer look at this and now have a better idea of what's going on here as well as a proof-of-concept fix, but I wanted to get your thoughts on this @Byron, before I put it into PR. CC @EliahKagan in case you have thoughts here as well.

Initially, my instinct was to refer to the git-scm.com docs on Git config syntax and use that to inform the fix here, but (ignoring the fact that there isn't much guidance on value formats relevant to this issue) this portion of @Byron's comment above caught my attention and made re-think things a bit:

I think the config-parser implementation here is probably very broken and non-conforming

Specifically, the "non-conforming" part caught my eye. If GitConfigParser's behavior is already known (and to some degree expected) to be non-conforming to canonical Git's behavior, then I don't think I agree with my initial assessment of the "Expected Behavior" that I described in the PR description (i.e., this should throw an exception) regardless of how canonical Git would handle it. Introducing a "fix" that throws an exception seems unnecessarily disruptive to users that may have otherwise working code that was written around this "non-conforming implementation". Preserving the existing behavior and just making it safer around exception cases seems like a better approach.

So instead, I think a more considerate and generally simpler fix would be to maintain the existing parsing behavior, and just make its checks a little more robust to eliminate the risk of the unhandled IndexError described in this issue.

The diff below shows the change I am considering which would:

  • Improve the guarding if check in the string_decode function to safely handle empty strings and prevent index errors when accessing string elements (this resolves the root cause of this issue.)
  • Update the option value parsing logic in the _read method to more robustly handle quoted values, ensuring correct handling even when quotes are not closed properly or when the line continues beyond a quoted segment.
  • Add checks to prevent passing empty strings to string_decode, which could occur in multi-line value scenarios.

Here's the diff:

diff --git a/git/config.py b/git/config.py
index 3ce9b123..91b69531 100644
--- a/git/config.py
+++ b/git/config.py
@@ -452,7 +452,7 @@ class GitConfigParser(cp.RawConfigParser, metaclass=MetaParserBuilder):
         e = None  # None, or an exception.
 
         def string_decode(v: str) -> str:
-            if v[-1] == "\\":
+            if v and v[-1] == "\\":
                 v = v[:-1]
             # END cut trailing escapes to prevent decode error
 
@@ -508,6 +508,8 @@ class GitConfigParser(cp.RawConfigParser, metaclass=MetaParserBuilder):
                     if len(optval) > 1 and optval[0] == '"' and optval[-1] != '"':
                         is_multi_line = True
                         optval = string_decode(optval[1:])
+                    elif len(optval) > 1 and optval[0] == '"' and optval[-1] == '"':
+                        optval = string_decode(optval[1:-1])  # safely decode fully quoted value
                     # END handle multi-line
                     # Preserves multiple values for duplicate optnames.
                     cursect.add(optname, optval)
@@ -524,8 +526,10 @@ class GitConfigParser(cp.RawConfigParser, metaclass=MetaParserBuilder):
                     is_multi_line = False
                     line = line[:-1]
                 # END handle quotations
+                if line:  # Ensure line is not empty before decoding
+                    line = string_decode(line)
                 optval = cursect.getlast(optname)
-                cursect.setlast(optname, optval + string_decode(line))
+                cursect.setlast(optname, optval + line)
             # END parse section or option
         # END while reading

What do you think? Does this sound like a reasonable approach here, or is there some context I might be missing?

If this does sound like something worth introducing, let me know and:

  1. I'll add a Unit test.
  2. Remove the workaround for this error in the fuzz_config.py test so it's covered there as well.
  3. Open a PR.

So instead, I think a more considerate and generally simpler fix would be to maintain the existing parsing behavior, and just make its checks a little more robust to eliminate the risk of the unhandled IndexError described in this issue.

The above hits home, and perfectly describes how I imagine handling this. It's an improvement, without going overboard and try to fix something that is systematically broken - a conforming parser can't be made based on an INI parser.

If this does sound like something worth introducing, let me know and:

  1. I'll add a Unit test.
  2. Remove the workaround for this error in the fuzz_config.py test so it's covered there as well.
  3. Open a PR.

This sounds like a plan! Thanks again.

PR with the fix: #1908

Only the changes to string_decode were necessary, so that's all I changed.