file-icons / atom

Atom file-specific icons for improved visual grepping.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problematic regular expression in caseKludge function

ericcornelissen opened this issue · comments

TL;DR: There is a (potentially) problematic regular expression in the caseKludge function which either 1) makes the code more difficult to read than it actually is, or 2) is a minor bug w.r.t. the regular expressions generated for file-icons.

Note: This is not a high priority issue at all, just something I wanted to bring to your attention and potentially get resolved.


I have been investigating an alert for this project on LGTM regarding a regular expression in the caseKludge function, in particular this one:

if(/[\W_ \t]?/.test(s)) return "[\\W_ \\t]?";

As per the alert, the regular expression in the if-statement (/[\W_ \t]?/) will match any non-empty string - and hence test will always return true as the string s is always one character. You can see this in Action in this regular expression on regex101.com, which shows a zero-length matches. Alternatively, you can see this by just running /[\W_ \t]?/.test("1") and /[\W_ \t]?/.test("") in your Atom or browser console.

As a result the return-statement on the next line is never executed. Now, I'm not sure what the intended behaviour here is, as the git history for this project does not provide any insights (the line was added in e05a36a without a relevant explanation, and I didn't take the time to look through the history of alhadis/utils).

From my point of view, there are two possible solutions w.r.t. to the dead code on line 38:

  1. Remove line 37 and change line 38 to return "[\\W_ \\t]?";. This would leave the behaviour unchanged but make the code clearer.
  2. Change the regular expression on line 37 to /[\W_ \t]/ (removing the ?). Then the test only succeeds if the character actually matches the regular expression. However, I'm not 100% sure if it is actually possible for any character not to match that regular expression, and if that is in fact impossible the first solution would be just fine.

Furthermore, the return value of line 37 has the same potential problem (i.e. it matches empty strings). I did try to compile without the trailing ? in the return value of line 37 and found that only two lines in .icondb.js change after recompilation, namely:

- ["asp-icon",["dark-blue","dark-blue"],/\.asp$/i,,false,,/\.asp$/i,/^[Aa][Ss][Pp][\W_ \t]?[Nn][Ee][Tt]$|^aspx(?:-vb)?$/],
+ ["asp-icon",["dark-blue","dark-blue"],/\.asp$/i,,false,,/\.asp$/i,/^[Aa][Ss][Pp][\W_ \t][Nn][Ee][Tt]$|^aspx(?:-vb)?$/],

and

- ["kx-icon",["medium-blue","medium-blue"],/\.q$/i,,false,,/^source\.q$/,/^[Qq][\W_ \t]?[Kk][Dd][Bb][\W_ \t]?$|^Kdb\s*\+$/],
+ ["kx-icon",["medium-blue","medium-blue"],/\.q$/i,,false,,/^source\.q$/,/^[Qq][\W_ \t][Kk][Dd][Bb][\W_ \t]$|^Kdb\s*\+$/],

from which it follows that, for example in the case of ASP.net, currently both asp.net and aspnet would match, but by changing the return value of line 37 aspnet no longer matches. Since I'm unfamiliar with ASP.net and Q/Kdb+ myself, I don't know if the changes in the diffs above are problematic or not...

Ugh. This was recently refactored to remove pointless crap that added nothing but wasted CPU cycles. Honestly, I've no fucking idea what I was thinking when I wrote the icon-compiler… the caseKludge is a particularly glaring example of ugly code I can't remove without manually checking/updating every rule defined in config.cson.

As per the alert, the regular expression in the if-statement (/[\W_ \t]?/) will match any non-empty string.
[…]
However, I'm not 100% sure if it is actually possible for any character not to match that regular expression, and if that is in fact impossible the first solution would be just fine.

Oops. Yeah, erh, I'd better explain. Or try to, rather.

First, the reason it's not causing an issue is that strategies don't perform lookups for falsey input (querying "" should always return null, for presumably obvious reasons).

Second, the /[\W_ \t]?/ part was thoughtlessly copied from code that attempted to generate regular expressions for matching all potential variations of a name that contained word boundaries (e.g., taking "coffee-script" as input and returning a RegExp matching Coffee Script, coffee-script, coffee_script or CoffeeScript). Needless to say, this was a stupid idea with very little practical benefit.

I say we nuke the ? to eliminate the dead-code. I'd welcome a PR (you provided a fantastic write-up, it's only fair you do the honours. 🙇 )

You mean eliminate the ? from the if-statement of line 37 only (to eliminate the dead code) or from the return-statement of line 37 as well? Based on "Needless to say, this was a stupid idea with very little practical benefit." I would guess both 🤔

Also, I did some more testing using fuzz-like testers for JavaScript (in particular Fuzzy, which isn't a real fuzzer but close enough for this use-case I think) and it seems that it is in fact impossible to reach line 38 even if we change line 37. This lines up with my understanding of the regular expression on line 37.

So, in the interest of "nothing but wasted CPU cycles", would you be okay with replacing line 37 and 38 by a single line return "[\\W_ \\t]?"; or return "[\\W_ \\t]"; (and which of the two)? I tried to compile using both of these options and using return "[\\W_ \\t]?"; the .icondb.js is unchanged and using return "[\\W_ \\t]";the .icondb.js is changed in the same way as I originally reported.

Just replace the function with this. Ugh…

/**
 * Synthesise case-insensitivity for a regexp string.
 *
 * JavaScript doesn't support scoped modifiers like (?i),
 * so this function attempts to approximate the closest thing.
 *
 * @param {String} input - Case-insensitive text
 * @return {String}
 */
function caseKludge(input){
	return input.split("").map(match =>
		/[A-Z]/.test(match) ? "[" + match + match.toLowerCase() + "]" :
		/[a-z]/.test(match) ? "[" + match + match.toUpperCase() + "]" :
		match
	).join("").replace(/(\[\w{2,3}\])(\1+)/g, (match, first, rest) =>
		first + "{" + ((rest.length / first.length) + 1) + "}");
}

I mean, I've written some shit code in the past 16 years, but this is one incorrigible abortion I'm literally ashamed of bringing into existence.

Also, I did some more testing using fuzz-like testers for JavaScript (in particular Fuzzy, which isn't a real fuzzer but close enough for this use-case I think)

To be frank, fuzzing was never necessary in the first place. I was just trying to do as much as possible with as little config.cson code as possible (which it does well… at least I got that part right 😀). The only two entires which use this feature are ASP.net and Q/Kdb+which, BTW, should have actually been the opposite. Some stupid refactoring must've passed the inverse value to caseKludge's fuzz parameter.

😂 Sooooo... not only was the fuzzy-matching feature doing nothing all along, it was actually being used on patterns that didn't need it (which is virtually all of them). Amazing, right??? I've got no alibi for this mess other than the severe depression I was going through in late-2016 when this shit was written. 😓

Actually, wait. I'll paste you the exact diff. I could make the changes myself, but I think after all your effort, you deserve the honour of committing them:

diff --git a/.eslintrc.json b/.eslintrc.json
index a299bb0..7fc5d27 100644
--- a/.eslintrc.json
+++ b/.eslintrc.json
@@ -9,5 +9,8 @@
 	"overrides": [{
 		"files": ["lib/{main,ui}.js"],
 		"rules": {"space-before-function-paren": 0}
+	},{
+		"files": ["lib/utils.js"],
+		"rules": {"multiline-ternary": 0}
 	}]
 }
diff --git a/lib/icons/.icondb.js b/lib/icons/.icondb.js
index 65bc224..a734cc7 100644
--- a/lib/icons/.icondb.js
+++ b/lib/icons/.icondb.js
@@ -464,7 +464,7 @@ module.exports = [
 ["arc-icon",["medium-blue","medium-blue"],/\.arc$/i],
 ["arduino-icon",["dark-cyan","dark-cyan"],/\.ino$/i,,false,,/\.arduino$/i,/^Arduin[0o]$/i],
 ["asciidoctor-icon",["medium-blue","medium-blue"],/\.(?:ad|adoc|asc|asciidoc)$/i,,false,,/\.asciidoc$/i,/^Ascii[\W_ \t]?D[0o]c$/i],
-["asp-icon",["dark-blue","dark-blue"],/\.asp$/i,,false,,/\.asp$/i,/^[Aa][Ss][Pp][\W_ \t]?[Nn][Ee][Tt]$|^aspx(?:-vb)?$/],
+["asp-icon",["dark-blue","dark-blue"],/\.asp$/i,,false,,/\.asp$/i,/^[Aa][Ss][Pp].[nN][eE][tT]$|^aspx(?:-vb)?$/],
 ["asp-icon",["medium-maroon","medium-maroon"],/\.asax$/i],
 ["asp-icon",["dark-green","dark-green"],/\.ascx$/i],
 ["asp-icon",["medium-green","medium-green"],/\.ashx$/i],
@@ -1462,7 +1462,7 @@ module.exports = [
 ["python-icon",["dark-green","dark-green"],/^(?:SConstruct|SConscript)$/],
 ["python-icon",["medium-green","medium-green"],/^(?:Snakefile|WATCHLISTS)$/],
 ["python-icon",["dark-maroon","dark-maroon"],/^wscript$/],
-["kx-icon",["medium-blue","medium-blue"],/\.q$/i,,false,,/^source\.q$/,/^[Qq][\W_ \t]?[Kk][Dd][Bb][\W_ \t]?$|^Kdb\s*\+$/],
+["kx-icon",["medium-blue","medium-blue"],/\.q$/i,,false,,/^source\.q$/,/^[Qq]\/[Kk][dD][bB]+$|^Kdb\s*\+$/],
 ["kx-icon",["dark-purple","dark-purple"],/\.k$/i,,false,,/^source\.k4$/,/^Q\/?Kdb\+?$/i],
 ["qiskit-icon",["dark-blue","dark-blue"],/\.qasm$/i,,false,,/\.qasm$/i,/^Qasm$|^[0o]pen[\W_ \t]?Qasm$/i],
 ["qlik-icon",["medium-green","medium-green"],/\.qvw$/i],
diff --git a/lib/utils.js b/lib/utils.js
index edb217f..10a205f 100644
--- a/lib/utils.js
+++ b/lib/utils.js
@@ -19,28 +19,14 @@ module.exports = {
  * so this function attempts to approximate the closest thing.
  *
  * @param {String} input - Case-insensitive text
- * @param {Boolean} fuzz - Apply {@link fuzzyRegExp} to input
  * @return {String}
  */
-function caseKludge(input, fuzz = false){
-	let output = input.split("").map((s, index, array) => {
-		if(/[A-Z]/.test(s)){
-			const output = "[" + s + s.toLowerCase() + "]";
-			const prev   = array[index - 1];
-			if(fuzz && prev && /[a-z]/.test(prev))
-				return "[\\W_\\S]*" + output;
-			return output;
-		}
-		if(/[a-z]/.test(s))     return "[" + s.toUpperCase() + s + "]";
-		if(!fuzz)               return escapeRegExp(s);
-		if("0" === s)           return "[0Oo]";
-		if(/[\W_ \t]?/.test(s)) return "[\\W_ \\t]?";
-		return s;
-	}).join("");
-	
-	if(fuzz)
-		output = output.replace(/\[Oo\]/g, "[0Oo]");
-	return output.replace(/(\[\w{2,3}\])(\1+)/g, (match, first, rest) =>
+function caseKludge(input){
+	return input.split("").map(match =>
+		/[A-Z]/.test(match) ? "[" + match + match.toLowerCase() + "]" :
+		/[a-z]/.test(match) ? "[" + match + match.toUpperCase() + "]" :
+		match
+	).join("").replace(/(\[\w{2,3}\])(\1+)/g, (match, first, rest) =>
 		first + "{" + ((rest.length / first.length) + 1) + "}");
 }
 
diff --git a/test/2-utils.js b/test/2-utils.js
index 6597420..6d94744 100644
--- a/test/2-utils.js
+++ b/test/2-utils.js
@@ -48,30 +48,6 @@ describe("Utility functions", () => {
 			});
 		});
 		
-		describe("caseKludge()", () => {
-			const {caseKludge} = utils;
-			it("generates case-insensitive regex source", () => {
-				const pattern = new RegExp(`^(ABC|${caseKludge("DEF")})`);
-				expect("dEf").to.match(pattern);
-				expect("aBc").not.to.match(pattern);
-			});
-			
-			it("fuzzes word boundaries", () => {
-				const source = caseKludge("camelCase", true);
-				const pattern = new RegExp(`^abc: ${source}$`);
-				expect("abc: camelCASE").to.match(pattern);
-				expect("abc: camel-CASE").to.match(pattern);
-				expect("ABC: camel-CASE").not.to.match(pattern);
-			});
-			
-			it("allows multiple separators between fuzzed boundaries", () => {
-				const source = caseKludge("camelCase", true);
-				const pattern = new RegExp(`^abc: ${source}$`);
-				expect("abc: camel----CASE").to.match(pattern);
-				expect("abc: camel--CA").not.to.match(pattern);
-			});
-		});
-
 		describe("escapeRegExp()", () => {
 			const {escapeRegExp} = utils;
 			it("escapes backslashes",       () => void expect(escapeRegExp("\\")).to.equal("\\\\"));

Actually, wait. I'll paste you the exact diff. I could make the changes myself, but I think after all your effort, you deserve the honour of committing them:

I wouldn't have minded if you did, or you could have used commits with multiple authors 😄 Anyway, a PR for the changes can be found at #824 😃

To be frank, fuzzing was never necessary in the first place. I was just trying to do as much as possible with as little config.cson code as possible (which it does well… at least I got that part right grinning). The only two entires which use this feature are ASP.net and Q/Kdb+… which, BTW, should have actually been the opposite. Some stupid refactoring must've passed the inverse value to caseKludge's fuzz parameter.

😂 Sooooo... not only was the fuzzy-matching feature doing nothing all along, it was actually being used on patterns that didn't need it (which is virtually all of them). Amazing, right??? I've got no alibi for this mess other than the severe depression I was going through in late-2016 when this shit was written. 😓

I think we got confused here, I meant fuzzing as a testing technique not the fuzz parameter which is fuzzy in the sense of fuzzy search.

I think we got confused here, I meant fuzzing as a testing technique not the fuzz parameter which is fuzzy in the sense of fuzzy search.

Getting warmer. I actually meant "fuzz" in the sense of fuzzy logic (think old-school FORTRAN). I sensed we were on two different pages, but decided to ramble crap anyway.

Sorry, I didn't catch sleep last night, so my brain's on auto-pilot tonight. 😁

Ah okay, thanks for the clarification! And no problem at all 😄