alison985 / ahrq_ngc

Archive of AHRQ National Guideline Clearinghouse guidelines.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What can you tell us about the format of the data?

JesseTG opened this issue · comments

This weekend I'm looking to scrape these summaries and convert them to Markdown so that they can easily be hosted and read elsewhere (possibly through a GitHub pages site). Specifically, the following info would help me out:

  • How exactly did you capture and preprocess these NGC summaries before they went offline?
  • Are these pages all valid HTML or XML?
  • Are these all of the summaries?
  • Does each summary contain all of the original text?
  • Are there other pages in this dump besides the summaries?

Anything you can tell me would be helpful.

They should be XML though some of them will have HTML before the XML starts. The XML starts with <version> from what I can tell.

These are what I and friends were able to get. I think it is safe to assume it is NOT all of them.

I don't know if it's all of the text - I just hope so.

There shouldn't be anything in the /ngc directory except guideline XML files.

I found a better-formatted dump in the meantime (and will be publishing a nicely-formatted website with its contents this weekend), but thank you all the same.