Advice on how to ingest whole websites

Question

Advice on how to ingest whole websites

tpmccallum opened this issue 6 months ago · comments

Is there a way to ingest an entire website, for example based on a site map file.
Or can you please tell me the API and point for submitting a single HTML page, and I can write the web crawler myself and pass each page into the system.

shanurrahman · Answer 1 · Wed Dec 20 2023 16:34:19 GMT+0800 (China Standard Time)

does this not solve your issue ? This is just a bfs crawler, sitemap.xml seems like a good starting point for ingesting the whole website.

What would be a good approach, since you might not want to index everything in sitemap. Suggestions ?

Timothy McCallum · Answer 2 · Thu Dec 21 2023 08:07:02 GMT+0800 (China Standard Time)

BFS

Absolutely, Beautiful Soup is a great approach. It seems to crawl links that are present in each page automatically.

Sitemap.xml

Another approach would be to follow the sitemap convention and index each page as per the site's sitemap.xml file (which is found in the site's robots.txt file.)

An example in Python

Saving the following Python to a new file called web-crawler.py :

# Imports
import requests
import xml.dom.minidom as minidom

# Fetch sitemap's text (ironically using sitemaps.org's sitemap for this example)
website_sitemap = requests.get('https://www.sitemaps.org/sitemap.xml', allow_redirects=True).text

# Parse the sitemap's text to obtain the list of pages
parsed_website_sitemap_document = minidom.parseString(website_sitemap)

# Cherry pick just the loc elements from the XML
website_sitemap_loc_elements = parsed_website_sitemap_document.getElementsByTagName('loc')

# Declare blank lists of pages for the website being indexed
website_page_urls = []

# Iterate over loc elements (of the sitemap) and add to the site's list of pages
for website_sitemap_loc_element in website_sitemap_loc_elements:
    website_page_urls.append(website_sitemap_loc_element.toxml().removesuffix("</loc>").removeprefix("<loc>"))

# This is the list of pages that will be indexed
print("Number of page to process is {}\n First page to process is {} and the last page to process is {}".format(len(website_page_urls), website_page_urls[0], website_page_urls[len(website_page_urls) - 1]))

Then running the web-crawler.py file, produces the following results:

$ python3 web-crawler.py 
Number of page to process is 84
 First page to process is https://www.sitemaps.org/ and the last page to process is https://www.sitemaps.org/zh_TW/terms.html

I guess the best approach would be for the user to just paste in a base URL e.g. sitemaps.org and then let the Python code go ahead and read the robots.txt and sitemap.xml files automatically. The above is just a starting point.

Would love to know what you think (perhaps a choice could be presented to the user e.g. a radio button BFS vs sitemap).

The BFS will work for me, for now. I see it has indexed many pages overnight. Thanks for the response. Let me know if you have any other questions.