CodeQL: regexpCapture() lazy search doesn't work but Java Pattern can
ReDistant opened this issue · comments
as the regexpCapture()
description
When the given regular expression matches the entire receiver,
returns the substring matched by the given capture group (starting at 1).
The regex format used is Java's [Pattern]).
i try select ql like blew:
select "{tableName} where id=#{tableId} ".regexpCapture("\\{(.*)}\\s*", 1) as cmp1,
"{tableName} where id=#{tableId} ".regexpCapture("\\{(.*?)}\\s*", 1) as cmp2 // lazy match
it's result is
cmp1: tableName} where id=#{tableId
cmp2: tableName} where id=#{tableId
when i try to use java pattern, get the different result, code like this
public static void main(String[] args) {
String input = "{tableName} where id=#{tableId} "; // same string
String patternString1 = "\\{(.*)}\\s*";
String patternString2 = "\\{(.*?)}\\s*"; // lazy match
Pattern pattern1 = Pattern.compile(patternString1);
Matcher matcher1 = pattern1.matcher(input);
Pattern pattern2 = Pattern.compile(patternString2);
Matcher matcher2 = pattern2.matcher(input);
if (matcher1.find()) {
String group = matcher1.group(1);
System.out.println("cmp1: " + group);
}
if (matcher2.find()) {
String group = matcher2.group(1);
System.out.println("cmp2: " + group);
}
}
its result is
cmp1: tableName} where id=#{tableId
cmp2: tableName
I'm a little confused, Is there something wrong with my usage?
why two way get the different regex match result?
why the CodeQL regexpCapture()
lazy match ?
looks like doesn't work like java?
my test env
Java:
openjdk version "1.8.0_402"
OpenJDK Runtime Environment Corretto-8.402.08.1 (build 1.8.0_402-b08)
OpenJDK 64-Bit Server VM Corretto-8.402.08.1 (build 25.402-b08, mixed mode)
codeql-cli:
PS C:\Users\xxx> codeql -V
CodeQL command-line toolchain release 2.16.6.
Copyright (C) 2019-2024 GitHub, Inc.
Unpacked in: D:\develop-kits\codeql
Analysis results depend critically on separately distributed query and
extractor modules. To list modules that are visible to the toolchain,
use 'codeql resolve qlpacks' and 'codeql resolve languages'.
codeql
PS D:\codeql> git log
commit 7e1dd38623f822046bda8e7d7652cb41f638d417 (HEAD -> main, origin/main, origin/HEAD)
Hi @ReDistant,
It's a bit subtle, but the documentation for regexpCapture
states that the regexp has to match the entire string:
When the given regular expression matches the entire receiver…
Looking at your non-greedy regexp:
select "{tableName} where id=#{tableId} ".regexpCapture("\\{(.*?)}\\s*", 1)
Let's consider what happens when the regexp engine attempts to match the first part of the regexp \{(.*?)}
against {tableName}
such that group 1 is tableName
.
This can only succeed if the rest of the regexp (\s*
) matches all of the rest of the string (where id=#{tableId}
), but it doesn't. That means tableName
cannot be captured as group 1. So the regexp engine will instead try a different, longer match.
{tableName} where id=#{tableId}
also matches \{(.*?)}
, but this time the remainder is
, which does match \s*
. That's why you see group 1 captured as tableName} where id=#{tableId
.
You can reproduce this behavior in your Java example by using matches() instead of find().
To fix your QL code, you can modify your regexp so that it will still match the entire string when tableName
is captured as group 1. For example:
select "{tableName} where id=#{tableId} ".regexpCapture("\\{(.*?)}.*", 1) as cmp2
Produces the result you were hoping for:
cmp2: tableName
Hi @ReDistant,
It's a bit subtle, but the documentation for
regexpCapture
states that the regexp has to match the entire string:When the given regular expression matches the entire receiver…
Looking at your non-greedy regexp:
select "{tableName} where id=#{tableId} ".regexpCapture("\\{(.*?)}\\s*", 1)Let's consider what happens when the regexp engine attempts to match the first part of the regexp
\{(.*?)}
against{tableName}
such that group 1 istableName
.This can only succeed if the rest of the regexp (
\s*
) matches all of the rest of the string (where id=#{tableId}
), but it doesn't. That meanstableName
cannot be captured as group 1. So the regexp engine will instead try a different, longer match.
{tableName} where id=#{tableId}
also matches\{(.*?)}
, but this time the remainder is, which does match
\s*
. That's why you see group 1 captured astableName} where id=#{tableId
.You can reproduce this behavior in your Java example by using matches() instead of find().
To fix your QL code, you can modify your regexp so that it will still match the entire string when
tableName
is captured as group 1. For example:select "{tableName} where id=#{tableId} ".regexpCapture("\\{(.*?)}.*", 1) as cmp2Produces the result you were hoping for:
cmp2: tableName
Hi @nickrolfe
Thank you very much for your guidance. I misunderstood the usage of the regexpCapture
predicate. With your help, I can now correctly obtain the string value I want.