bundesAPI / handelsregister

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Accessing and downloading (xml, pdf ) files from handelsregister API

timtensor opened this issue · comments

Hi @wirthual ,
I am trying the following workflow

  • Using the advanced search query : https://www.handelsregister.de/rp_web/erweitertesuche.xhtml i do the following
    a) Choose Federal States - Berlin / Bavaria
    b) Company / search words - "Hallo"
    c) Choose the option "contain all words"
    d ) Type of registar "HRA"
    e ) And choose 100 hits per list .

  • This will give us a search list as shown in the screen shot

  • I want to then downlaod SI or DK content of all the hits .
    I am wondering if that is possible or not ?
    Screenshot :
    image

Please let me know what could be the best way forward

Hello, did you able to figure a way out?

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

Hi unfortunately not , since it has to come up from the team it is currently kind of blocked

Thanks

interested in other solutions? or do you want to use this one specifically?

interested in other solutions? or do you want to use this one specifically?

I am open to hear other alternatives if there is any.

I don't know what your project is about or what you are gonna do with the data, but Selenium looks like a very good option. Contact me if you want further information.

As I am using a colab notebook , I think an API would be the best right ? I think the colab notebook doesn't support selenium .

The project basically collects data to analyze different companies .

Yeah, I agree. I already tried an alternative to selenium which is playwright, but it has limitations that come together with scraping the website.

what limitations do you mean?

IP blocks, for example.

Even if you are using this API your IP will be blocked if you send more than 60 Requests per hour. for avoiding getting blocked, you have to rotate your proxies. How many requests are you planing to use in your project?

I already tried to use proxies by rotating, but in that case, the website is not being loaded in a reasonable time. I am planning to send above 200 requests periodically.

hmm, you can modify the code handelsregister.py and send post requests within the form "ergebnisseform" in .../ergebnisse.xhtml !but you have also to deal with javax.faces.ViewState its readonly and you can't control it.

Oh ok , I think the API just makes it a bit easier and is there for a reason , I guess but I think it is not maintained

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

no sorry actually i have no clue , you managed to get the documents via api ? Like all the documents related to a search . If so it would be great if you could share your method .

I have built a complete solution for this problem and will see how and whether I can share the code and approach somehow. As mentioned above its using full browser rendering as opposed to this API, which I think is rather a dead-end when it comes to actually downloading the documents.

I do hope, that the Handelsregister will at some point publish a proper API.

I tried and found the solution, I found their API to get the document but that API required a cookie (SESSIONID). Which is not accessible using javascript. Do you have any solution to get the cookie?

Did you find a way to resolve this issue . I was looking into playwright but so far did not manage to find the issue.
How are you using the API to download the docs ?

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

download

For wich project do you want to use the API. If you are working on larg project(downloading millions of documents), i can offer you a paid solutions

Small project for about 15-45 documents for NLP analysis