Add `-no-clobber` to avoid overwriting files that already exist

Question

Add `-no-clobber` to avoid overwriting files that already exist

xqbumu opened this issue 6 months ago · comments

Please describe your feature request:

From the current logic, in Katana, setting 'srd' allows you to save the crawled content. However, when executing it for the second time, the content in that directory will be cleared. I hope to support incremental crawling, which means:

The directory should not be cleared during the second execution;
When encountering a request that has already been saved, skip crawling that link.

Describe the use case of this feature:

Replacing the -nc option in wget: use cases

wget -P ./output -nc -i urls.txt

Refer: https://github.com/projectdiscovery/katana/blob/main/pkg/output/output.go#L120

Dogan Can Bakir · Answer 1 · Wed Jan 31 2024 16:32:00 GMT+0800 (China Standard Time)

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

卜木 · Answer 2 · Thu Feb 01 2024 10:12:48 GMT+0800 (China Standard Time)

Thanks for opening this issue. I don't remember the specifics, but if -resume is specified, previously crawled content should not be removed. I'll look into this.

Thank you for your reply. I have used this switch (-resume), but it only works for resuming interrupted crawling. When I modify my urls.txt file, Katana should not be able to perform incremental crawling.

Dogan Can Bakir · Answer 3 · Thu Feb 01 2024 16:29:37 GMT+0800 (China Standard Time)

@xqbumu,
Makes sense!

@Mzack9999,
Thoughts? - "Incremental Crawling" sounds good to me 💭

Mzack9999 · Answer 4 · Mon Feb 26 2024 20:44:25 GMT+0800 (China Standard Time)

This is for sure an interesting feature, but I'm not sure it can be fully applied to the crawling process. While it's easy to mimic it avoiding overwriting existing files, abandoning the crawling process requires some more thoughts, as it can't be simply based on the existence of the file, as for example, it would end the crawling at the very beginning since the root branch already exist. Maybe some better strategy can be adopted, for example:

Crawl normally till a minimum depth (2?)
Above that depth, if the crawled page is the same of existing one (or all children of parent node are the same above a certain threshold) => break out

What do you think?

卜木 · Answer 5 · Fri Mar 01 2024 18:11:24 GMT+0800 (China Standard Time)

@Mzack9999

Thank you for your response. My initial expectation was to be able to continue crawling the remaining links after an interruption. However, the re-crawling strategy you mentioned here seems to enhance the ability to resume crawling.

As for the re-crawling strategy, I feel that in addition to defining the depth of the links, it could also consider judging based on the modification time of the crawled files, as it is easier to determine data updates based on time.

The above is just my personal opinion, and I welcome your guidance.

Dogan Can Bakir · Answer 6 · Wed Mar 06 2024 19:33:08 GMT+0800 (China Standard Time)

@Mzack9999,

My initial expectation was to be able to continue crawling the remaining links after an interruption.

Let's begin with this idea and then gradually develop it further. What do you say?