epfLLM / meditron

Meditron is a suite of open-source medical Large Language Models (LLMs).

Home Page:https://huggingface.co/epfl-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Errors with three of the scrapers

jpcorb20 opened this issue · comments

commented

Hello,

I was trying to scrape magic, drugs and guidelinecentral without success, while some others were fine. Any idea how to make them work? Drugs seemed to work but 0 article was in the JSONL. GuidelineCentral got some click issues. FInally, Magic printed errors for each article but one.

Thanks in advance,

commented

Looks like in the case of Drugs.com, it works by changing "class='ContentBox'" to "class='ddc-main-content'", and the content variable needs to be called with "content.get_attribute('innerHTML')" to get an HTML str for markdownify.

commented

For the GuidelineCentral, I got some issues first with the chrome driver on WSL in general and changed some options, but it looks like the "--headless" option shouldn't be activated for this scraper to work.

Hi @jpcorb20,

Thanks a lot for your interest in the guidelines scraping pipeline.

As described in the user notice, these scrapers are very fickle and aren't made to withstand the dynamic nature of websites. They worked in November 2023, but there's no guarantee that they would hold in the future.

If you'd be interested in updating the pipeline to update the scrapers, please do a pull request. We'd be grateful for your help.

commented

Hello @AGBonnet, thanks for your reply! Definitely, I will put a PR regarding the updates of two scrappers (Drugs and GuidelineCentral).