Why are we decoding the URLs before parsing

Question

Why are we decoding the URLs before parsing

larskraemer opened this issue 4 years ago · comments

Before doing any work in the parser, we are decoding the URL, i.e. replacing "%ab" with '\xab'.
I think it would be better to this after parsing the URL, since the following URL, for example, would produce incorrect results:

https://example.com/test%3Ftest (Note: 0x3F is ASCII for '?')

If this URL is decoded first, then parsed, it will be parsed as "https://example.com/test" with a query string of "test". This is not the behavior any browser is going to give you, and I believe it should not be the behavior of this program.
Instead, I believe we should decode the parts of the URL separately after parsing, probably even while assembling the URL key.

Ameen Maali · Answer 1 · Sun Jun 07 2020 03:12:18 GMT+0800 (China Standard Time)

Hey @larskraemer, thanks for the note. This is a great point. I built this in mind with pulling URLs from multiple different tools, some of which I do know may have encoded or decoded results. However, you make a good point that’s there’s a risk of getting incorrect results here with something like the following example:

https://site.com/redirect?url=https://google.com%3Furl=anothersite.com%26code=302&code=200

After thinking about it, I’m not even sure there’s much value in decoding at all. I’m going to check on some of the popular tools out there that generate these URL lists (such as waybackurls & gau) to understand the output and determine if encoding and decoding is even relevant anymore

larskraemer · Answer 2 · Sun Jun 07 2020 03:57:42 GMT+0800 (China Standard Time)

Yea, I thought it might have something to do with interoperability. I think in general, we should be handling URLs as they would be typed into the browser, and maybe add a switch to decode first, or even just a seperate decoder tool. I just reworked the parsing a bit in anticipation of #6 and got a pretty significant performance boost from just not decoding at all. (Also, 5 times over regex, yay)
Decoding might be sensible for these kinds of URLs:

example.com/test?%61=b
example.com/test?a=b

These query strings are absolutely identical to a server, so we should treat them as identical too. Same for unnecessarily encoded parts of the path.

larskraemer · Answer 3 · Sun Jun 07 2020 05:12:15 GMT+0800 (China Standard Time)

I just noticed, this decoding business is going to mess up the deduplication too, even if we decode only the parts separately:
example.com/test?a=b
example.com/test?a=b%26c%3Dd

The second query string decodes to ?a=b&c=d, which will make it appear in the deduped results.
At this point, we might have to change the way we are eliminating duplicates completely

Ameen Maali · Answer 4 · Mon Jun 08 2020 05:58:49 GMT+0800 (China Standard Time)

Decoding & encoding logic was removed, closing out