Accessing and downloading (xml, pdf ) files from handelsregister API

Question

Accessing and downloading (xml, pdf ) files from handelsregister API

timtensor opened this issue a year ago · comments

timtensor commented a year ago

Hi @wirthual ,
I am trying the following workflow

Using the advanced search query : https://www.handelsregister.de/rp_web/erweitertesuche.xhtml i do the following
a) Choose Federal States - Berlin / Bavaria
b) Company / search words - "Hallo"
c) Choose the option "contain all words"
d ) Type of registar "HRA"
e ) And choose 100 hits per list .
This will give us a search list as shown in the screen shot
I want to then downlaod SI or DK content of all the hits .
I am wondering if that is possible or not ?
Screenshot :

Please let me know what could be the best way forward

Melih Sünbül · Answer 1 · Tue Feb 20 2024 19:45:45 GMT+0800 (China Standard Time)

Hello, did you able to figure a way out?

timtensor · Answer 2 · Tue Feb 20 2024 23:11:34 GMT+0800 (China Standard Time)

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

Melih Sünbül · Answer 3 · Wed Feb 21 2024 04:24:58 GMT+0800 (China Standard Time)

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

Thanks

monkeygopro · Answer 4 · Wed Feb 21 2024 06:44:21 GMT+0800 (China Standard Time)

interested in other solutions? or do you want to use this one specifically?

Melih Sünbül · Answer 5 · Wed Feb 21 2024 06:50:03 GMT+0800 (China Standard Time)

interested in other solutions? or do you want to use this one specifically?

I am open to hear other alternatives if there is any.

monkeygopro · Answer 6 · Wed Feb 21 2024 08:43:14 GMT+0800 (China Standard Time)

I don't know what your project is about or what you are gonna do with the data, but Selenium looks like a very good option. Contact me if you want further information.

timtensor · Answer 7 · Wed Feb 21 2024 17:44:08 GMT+0800 (China Standard Time)

As I am using a colab notebook , I think an API would be the best right ? I think the colab notebook doesn't support selenium .

The project basically collects data to analyze different companies .

Melih Sünbül · Answer 8 · Wed Feb 21 2024 17:59:12 GMT+0800 (China Standard Time)

Yeah, I agree. I already tried an alternative to selenium which is playwright, but it has limitations that come together with scraping the website.

monkeygopro · Answer 9 · Wed Feb 21 2024 22:53:37 GMT+0800 (China Standard Time)

what limitations do you mean?

Melih Sünbül · Answer 10 · Wed Feb 21 2024 23:29:29 GMT+0800 (China Standard Time)

IP blocks, for example.

monkeygopro · Answer 11 · Thu Feb 22 2024 03:48:36 GMT+0800 (China Standard Time)

Even if you are using this API your IP will be blocked if you send more than 60 Requests per hour. for avoiding getting blocked, you have to rotate your proxies. How many requests are you planing to use in your project?

Melih Sünbül · Answer 12 · Thu Feb 22 2024 03:54:14 GMT+0800 (China Standard Time)

I already tried to use proxies by rotating, but in that case, the website is not being loaded in a reasonable time. I am planning to send above 200 requests periodically.

monkeygopro · Answer 13 · Thu Feb 22 2024 05:22:04 GMT+0800 (China Standard Time)

hmm, you can modify the code handelsregister.py and send post requests within the form "ergebnisseform" in .../ergebnisse.xhtml !but you have also to deal with javax.faces.ViewState its readonly and you can't control it.

timtensor · Answer 14 · Thu Feb 22 2024 17:20:24 GMT+0800 (China Standard Time)

Oh ok , I think the API just makes it a bit easier and is there for a reason , I guess but I think it is not maintained

Muhammad Tayyab · Answer 15 · Sun Feb 25 2024 05:07:38 GMT+0800 (China Standard Time)

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

timtensor · Answer 16 · Sun Feb 25 2024 09:27:06 GMT+0800 (China Standard Time)

no sorry actually i have no clue , you managed to get the documents via api ? Like all the documents related to a search . If so it would be great if you could share your method .

Hai Nguyen Mau · Answer 17 · Sun Mar 10 2024 08:34:29 GMT+0800 (China Standard Time)

I have built a complete solution for this problem and will see how and whether I can share the code and approach somehow. As mentioned above its using full browser rendering as opposed to this API, which I think is rather a dead-end when it comes to actually downloading the documents.

I do hope, that the Handelsregister will at some point publish a proper API.

timtensor · Answer 18 · Tue Apr 23 2024 21:22:45 GMT+0800 (China Standard Time)

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

Did you find a way to resolve this issue . I was looking into playwright but so far did not manage to find the issue.
How are you using the API to download the docs ?

monkeygopro · Answer 19 · Thu Apr 25 2024 08:38:09 GMT+0800 (China Standard Time)

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

timtensor · Answer 20 · Thu Apr 25 2024 17:10:31 GMT+0800 (China Standard Time)

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

Small project for about 15-45 documents for NLP analysis