thblt / write-yourself-a-git

Learn Git by reimplementing it from scratch

Home Page:https://wyag.thb.lt

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errors in regexes for hash resolution

jstern103 opened this issue · comments

Section 7.6.1 rewrites object_resolve() to resolve object hashes using regular expressions. However, the new implementation appears to be flawed in a few ways.

First, the function cannot properly identify full-length hashes. If name is not empty or "HEAD", it gets matched against hashRE, which accepts any hex strings between 1 and 16 characters long. If the match is successful, the code attempts to determine whether the hash is complete by checking if its length is 40, which will always be false. As there is no else or elif clause for the match against hashRE, the function will always return an empty list of candidates when passed a full hash.

Second, the function defines a second regex, smallHashRE, which is identical to hashRE. In all likelihood, this is a side-effect of the previous problem, since hashRE clearly has an erroneous definition.

Third, smallHashRE never gets used. Instead, to identify short hashes, the function checks any string that matched hashRE to see if it's at least 4 characters long.

My proposed solution is to modify the code accordingly.

hashRE = re.compile(r"^[0-9A-Fa-f]{40}$")
smallHashRE = re.compile(r"^[0-9A-Fa-f]{4,16}$")
# ...
if hashRE.match(name):
    return [ name.lower() ]
elif smallHashRE.match(name):
    name = name.lower()
    prefix = name[0:2]
# ...

Good suggestions. One small improvement: smallHashRE should use {4,40}, otherwise the first 17 characters of a hash won't match. :)

Just a different suggestion, which would also scratch the need for regex completely:

is_hash = True
for ch in name.lower():
    if not ((ch >= 'a' and ch <= 'f') or (ch >= '0' and ch <= '9')):
        is_hash = False
        break

if is_hash:
        if len(name) == 40:
            ...
        elif len(name) >= 4:
            ...