Auto-detect broken links

Question

Auto-detect broken links

MrFlick opened this issue 3 years ago · comments

I wrote up a rough R script to locate broken links in the course content. In total it found 952 links and of those 43 where broken. That's a p-value of 0.0451 so it's kind of significant 😄. Here's the code

library(rvest)
library(purrr)
library(dplyr)

base_url <- Vectorize(function(url) {
  r <- httr::parse_url(url)
  r$path <- dirname(httr::parse_url(url)$path)
  httr::build_url(r)
}, USE.NAMES = FALSE)

polite_GET_status <- function(urls, delay=1) {
  statuses <- rep(0, length(urls))
  safe_status <- possibly(~httr::status_code(httr::GET(.x)), otherwise=999)
  for(i in seq_along(urls)) {
    statuses[i] <- safe_status(urls[i])
    Sys.sleep(delay)
  }
  statuses
}

check_page_links <- function(page, base="", seen=data.frame(url=character(0)), delay=1) {
  stopifnot(inherits(page, "xml_node"))
  stopifnot("url" %in% names(seen))
  page_ids <- page %>% html_nodes("*[id]") %>% html_attr("id") %>% paste0("#", .)
  page_hrefs <- page %>% html_nodes("a") %>% html_attr("href") %>% unique()
  grep_filter <- function(x, pattern, negate=FALSE, ...) x[grepl(pattern, x,)!=negate]
  links <- page_hrefs %>% 
    grep_filter("^(#|mailto)", negate=TRUE) %>% 
    tibble(href=.) %>% 
    mutate(url = url_absolute(href, base=base))
  new_links <- links %>% 
    anti_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url") %>% 
    mutate(status=polite_GET_status(url, delay=delay),
     result = case_when(status==200~"OK", status==404~"Not Found", TRUE~"Error"))
  seen_links <- links %>% 
    inner_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url")
  dups <- function(x) unique(x[duplicated(x)])
  dup_paged_ids <- dups(page_ids)
  anchors <- page_hrefs %>% 
    grep_filter("^#") %>% 
    tibble(href=.) %>% 
    mutate(result = case_when(
      href %in% dup_paged_ids ~ "Duplicate ID",
      href %in% page_ids ~ "OK",
      TRUE ~ "Not Found"))
  bind_rows(new_links, seen_links, anchors)
}

crawl_pages <- function(seed_url, root_url=base_url(seed_url), delay=.5) {
  page <- read_html(seed_url)
  results <- check_page_links(page, base=seed_url, delay=delay) %>% 
    mutate(page=seed_url)

  filter_children <- function(x, parent) { unique(x[which(startsWith(x, parent))]) }  
  scanned <- seed_url
  toscan <- setdiff(filter_children(results$url, root_url), scanned)
  seen <- results %>% select(url, status, result) %>% unique()
  while(length(toscan) > 0) {
    url <- toscan[1]
    page <- possibly(read_html,otherwise=NULL)(url)
    if (!is.null(page)) {
      pageresults <- check_page_links(page, base=url, seen=seen, delay=delay) %>% 
        mutate(page=.env$url)
      
      candidates <- filter_children(pageresults %>% filter(status==200) %>% pull(url), root_url)
      toscan <- append(toscan, setdiff(setdiff(candidates, scanned), toscan))
      
      results <- results %>% bind_rows(pageresults)
      seen <- results %>% select(url, status, result) %>% unique()
    }
    toscan <- toscan[-1]
    scanned <- c(scanned, url)
    Sys.sleep(delay)
  }
  results
}

url <- "https://umcarpentries.org/intro-curriculum-r/index.html"
all_links <- crawl_pages(url)
all_links %>% filter(result != "OK")

This will return a table that till show the page that was searched, the href used in the <a> tag on that page, the "full" url that evaluates to (turning relative URLs into absolute URLs), the http status code returned from a GET request to that URL, and the result which is "OK" for good links, and non-OK for any potential problems.

Would it be possible to incorporate something like this into the build workflow so it can check for broken links automatically?

It does take some time to run the code but that's mostly because I've added in a delay between http requests in order to be a "polite" web scraper and not bombard any one server with too many requests in a short period of time. The current delay between requests is .5 seconds. The code is written to not query the same URL twice but there are still about ~160 unique URLs that are checked.

It checks both URL links and anchor-style links. So if the URL uses "#" it will search the IDs on the page to make sure it's here. Note that since it uses the element ID and those IDs need to be unique on a page, it also will report if one of those IDs has been duplicate which would interfere with the links.

Kelly Sovacool, PhD · Answer 1 · Tue Nov 23 2021 00:18:16 GMT+0800 (China Standard Time)

Awesome, we should definitely incorporate this into the build workflow.

Kelly Sovacool, PhD · Answer 2 · Tue Nov 23 2021 03:06:23 GMT+0800 (China Standard Time)

Since this checks the rendered site at https://umcarpentries.org/intro-curriculum-r/, it won't catch if someone breaks a link in a PR until after the PR is merged. Links outside our control could also break (e.g. links to the cheatsheets) at any time. Rather than putting this in the build-website workflow, maybe it should be it's own GitHub Actions workflow that runs on a CRON schedule?

MrFlick · Answer 3 · Tue Nov 23 2021 04:03:02 GMT+0800 (China Standard Time)

Those are good points. I don't actually have much experience with GitHub Actions myself so I wasn't sure what was possible. But If the code runs once a month or something to check links, that would be helpful.