github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security

Home Page:https://codeql.github.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CodeQL: regexpCapture() lazy search doesn't work but Java Pattern can

ReDistant opened this issue · comments

commented

as the regexpCapture() description

When the given regular expression matches the entire receiver, 
returns the substring matched by the given capture group (starting at 1). 
The regex format used is Java's [Pattern]).

i try select ql like blew:

select "{tableName} where id=#{tableId}   ".regexpCapture("\\{(.*)}\\s*", 1) as cmp1,
  "{tableName} where id=#{tableId}   ".regexpCapture("\\{(.*?)}\\s*", 1) as cmp2  // lazy match

it's result is

cmp1: tableName} where id=#{tableId
cmp2: tableName} where id=#{tableId

when i try to use java pattern, get the different result, code like this

public static void main(String[] args) {
        String input = "{tableName} where id=#{tableId}   ";   // same string
        String patternString1 = "\\{(.*)}\\s*";
        String patternString2 = "\\{(.*?)}\\s*";  // lazy match

        Pattern pattern1 = Pattern.compile(patternString1);
        Matcher matcher1 = pattern1.matcher(input);
        Pattern pattern2 = Pattern.compile(patternString2);
        Matcher matcher2 = pattern2.matcher(input);


        if (matcher1.find()) {
            String group = matcher1.group(1);
            System.out.println("cmp1: " + group);
        }
        if (matcher2.find()) {
            String group = matcher2.group(1);
            System.out.println("cmp2: " + group);
        }
    }

its result is

cmp1: tableName} where id=#{tableId
cmp2: tableName

I'm a little confused, Is there something wrong with my usage?
why two way get the different regex match result?
why the CodeQL regexpCapture() lazy match ? looks like doesn't work like java?

my test env
Java:

openjdk version "1.8.0_402"
OpenJDK Runtime Environment Corretto-8.402.08.1 (build 1.8.0_402-b08)
OpenJDK 64-Bit Server VM Corretto-8.402.08.1 (build 25.402-b08, mixed mode)

codeql-cli:

PS C:\Users\xxx> codeql -V
CodeQL command-line toolchain release 2.16.6.
Copyright (C) 2019-2024 GitHub, Inc.
Unpacked in: D:\develop-kits\codeql
   Analysis results depend critically on separately distributed query and
   extractor modules. To list modules that are visible to the toolchain,
   use 'codeql resolve qlpacks' and 'codeql resolve languages'.

codeql

PS D:\codeql> git log
commit 7e1dd38623f822046bda8e7d7652cb41f638d417 (HEAD -> main, origin/main, origin/HEAD)

Hi @ReDistant,

It's a bit subtle, but the documentation for regexpCapture states that the regexp has to match the entire string:

When the given regular expression matches the entire receiver…

Looking at your non-greedy regexp:

select "{tableName} where id=#{tableId}   ".regexpCapture("\\{(.*?)}\\s*", 1)

Let's consider what happens when the regexp engine attempts to match the first part of the regexp \{(.*?)} against {tableName} such that group 1 is tableName.

This can only succeed if the rest of the regexp (\s*) matches all of the rest of the string (where id=#{tableId} ), but it doesn't. That means tableName cannot be captured as group 1. So the regexp engine will instead try a different, longer match.

{tableName} where id=#{tableId} also matches \{(.*?)}, but this time the remainder is , which does match \s*. That's why you see group 1 captured as tableName} where id=#{tableId.

You can reproduce this behavior in your Java example by using matches() instead of find().

To fix your QL code, you can modify your regexp so that it will still match the entire string when tableName is captured as group 1. For example:

select "{tableName} where id=#{tableId}   ".regexpCapture("\\{(.*?)}.*", 1) as cmp2

Produces the result you were hoping for:

cmp2: tableName
commented

Hi @ReDistant,

It's a bit subtle, but the documentation for regexpCapture states that the regexp has to match the entire string:

When the given regular expression matches the entire receiver…

Looking at your non-greedy regexp:

select "{tableName} where id=#{tableId}   ".regexpCapture("\\{(.*?)}\\s*", 1)

Let's consider what happens when the regexp engine attempts to match the first part of the regexp \{(.*?)} against {tableName} such that group 1 is tableName.

This can only succeed if the rest of the regexp (\s*) matches all of the rest of the string (where id=#{tableId} ), but it doesn't. That means tableName cannot be captured as group 1. So the regexp engine will instead try a different, longer match.

{tableName} where id=#{tableId} also matches \{(.*?)}, but this time the remainder is , which does match \s*. That's why you see group 1 captured as tableName} where id=#{tableId.

You can reproduce this behavior in your Java example by using matches() instead of find().

To fix your QL code, you can modify your regexp so that it will still match the entire string when tableName is captured as group 1. For example:

select "{tableName} where id=#{tableId}   ".regexpCapture("\\{(.*?)}.*", 1) as cmp2

Produces the result you were hoping for:

cmp2: tableName

Hi @nickrolfe

Thank you very much for your guidance. I misunderstood the usage of the regexpCapture predicate. With your help, I can now correctly obtain the string value I want.