ameenmaali / urldedupe

Pass in a list of URLs with query strings, get back a unique list of URLs and query string combinations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Why are we decoding the URLs before parsing

larskraemer opened this issue · comments

Before doing any work in the parser, we are decoding the URL, i.e. replacing "%ab" with '\xab'.
I think it would be better to this after parsing the URL, since the following URL, for example, would produce incorrect results:

https://example.com/test%3Ftest (Note: 0x3F is ASCII for '?')

If this URL is decoded first, then parsed, it will be parsed as "https://example.com/test" with a query string of "test". This is not the behavior any browser is going to give you, and I believe it should not be the behavior of this program.
Instead, I believe we should decode the parts of the URL separately after parsing, probably even while assembling the URL key.

Hey @larskraemer, thanks for the note. This is a great point. I built this in mind with pulling URLs from multiple different tools, some of which I do know may have encoded or decoded results. However, you make a good point that’s there’s a risk of getting incorrect results here with something like the following example:

https://site.com/redirect?url=https://google.com%3Furl=anothersite.com%26code=302&code=200

After thinking about it, I’m not even sure there’s much value in decoding at all. I’m going to check on some of the popular tools out there that generate these URL lists (such as waybackurls & gau) to understand the output and determine if encoding and decoding is even relevant anymore

Yea, I thought it might have something to do with interoperability. I think in general, we should be handling URLs as they would be typed into the browser, and maybe add a switch to decode first, or even just a seperate decoder tool. I just reworked the parsing a bit in anticipation of #6 and got a pretty significant performance boost from just not decoding at all. (Also, 5 times over regex, yay)
Decoding might be sensible for these kinds of URLs:

example.com/test?%61=b
example.com/test?a=b

These query strings are absolutely identical to a server, so we should treat them as identical too. Same for unnecessarily encoded parts of the path.

I just noticed, this decoding business is going to mess up the deduplication too, even if we decode only the parts separately:
example.com/test?a=b
example.com/test?a=b%26c%3Dd

The second query string decodes to ?a=b&c=d, which will make it appear in the deduped results.
At this point, we might have to change the way we are eliminating duplicates completely

Decoding & encoding logic was removed, closing out