Only get the original link when crawling onion sites

Question

Only get the original link when crawling onion sites

0xEnders opened this issue 10 months ago · comments

Hi guys,

was following the guide step by step. However when i tried crawling a particular link i only get that link returned even though manually navigating TOR shows that there are multiple other links. Have tried for a few different websites but still having the same issue. Am unsure if its because of my settings or a bug.

Please advise.

Akeem King · Answer 1 · Mon Dec 04 2023 21:03:04 GMT+0800 (China Standard Time)

What's the link so that I can try to reproduce it? Also can you provide more information such as

Operating System
Which version of TorBot that you're using?
How you're executing the application?
TOR configuration

0xEnders · Answer 2 · Mon Dec 04 2023 21:11:11 GMT+0800 (China Standard Time)

Thanks for the quick reply!

I am trying the links :

http://alphvmmm27o3abo3r2mlmjrpdmzle3rykajqc5xsj7j7ejksbpsa36ad.onion/
http://noescapemsqxvizdxyl7f7rmg5cdjwp33pg2wpmiaaibilb4btwzttad.onion/

Operating System : Ubuntu 22
Which version of TorBot that you're using? : current dev version. i git cloned it

How you're executing the application?
python3 torbot -u http://website.onion --depth 2

TOR configuration : default config
sudo apt install tor
sudo service tor start

Also, is there a way to crawl based on a text file of email addresses?

Akeem King · Answer 3 · Mon Dec 04 2023 21:17:34 GMT+0800 (China Standard Time)

You're welcome and thanks for providing the information, I'll look into it later today or sometime this week. There is no feature to crawl email addresses, the current program operates on HTML retrieved from sites so I don't know how that would be possible with email addresses but if you have suggestions for a new feature then feel free to submit a ticket and it'll be looked into. If you already know how the feature should be implemented then you can take a crack at it and submit a pull request to the repo.

0xEnders · Answer 4 · Mon Dec 04 2023 21:20:18 GMT+0800 (China Standard Time)

correction, text file of websites* not email addresses. And thanks for looking into it. ill go and mess around with the settings and see what happens. 2 other things :

Is it recommended to amend the torcc config file? Because i didnt touch that and all
Can I get a link to the slack channel? The link on the main page has expired.

Thanks once again!

Akeem King · Answer 5 · Mon Dec 04 2023 21:26:09 GMT+0800 (China Standard Time)

It's your choice. I've created CLI flags to dynamically define the SOCKS5 proxy when instantiating the HTTPS client.
The link should still work, but the Slack channel is not highly used. If you have suggestions, thoughts, or problems. You'll likely get the quickest response from posting here.

0xEnders · Answer 6 · Mon Dec 04 2023 22:20:17 GMT+0800 (China Standard Time)

There's no way for us to crawl multiple websites at once right?

Akeem King · Answer 7 · Mon Dec 04 2023 22:23:59 GMT+0800 (China Standard Time)

Not currently, it'd probably be a fairly straightforward feature to implement but no one has requested it. If you want to know what's possible or not, check the README. If you have ideas or suggestions, create a new ticket.

Akeem King · Answer 8 · Mon Dec 04 2023 22:24:37 GMT+0800 (China Standard Time)

Or build it out yourself and submit it if you're capable.

Akeem King · Answer 9 · Tue Dec 05 2023 21:06:11 GMT+0800 (China Standard Time)

I checked the URLs and the reason why it's only returning the host domain is that all of the links are paths within the same domain. The scraper looks for unique host domains that are fully qualified URIs. All of the links are paths to the same domain, not different sites.

Akeem King · Answer 10 · Tue Dec 05 2023 21:47:37 GMT+0800 (China Standard Time)

I'll look into modifying the feature to identify paths.