UMCarpentries / intro-curriculum-r

Custom curriculum for teaching R, the Unix Shell, & Git with an integrated workflow and reproducible research practices.

Home Page:https://umcarpentries.org/intro-curriculum-r/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Auto-detect broken links

MrFlick opened this issue Β· comments

I wrote up a rough R script to locate broken links in the course content. In total it found 952 links and of those 43 where broken. That's a p-value of 0.0451 so it's kind of significant πŸ˜„. Here's the code

library(rvest)
library(purrr)
library(dplyr)

base_url <- Vectorize(function(url) {
  r <- httr::parse_url(url)
  r$path <- dirname(httr::parse_url(url)$path)
  httr::build_url(r)
}, USE.NAMES = FALSE)

polite_GET_status <- function(urls, delay=1) {
  statuses <- rep(0, length(urls))
  safe_status <- possibly(~httr::status_code(httr::GET(.x)), otherwise=999)
  for(i in seq_along(urls)) {
    statuses[i] <- safe_status(urls[i])
    Sys.sleep(delay)
  }
  statuses
}

check_page_links <- function(page, base="", seen=data.frame(url=character(0)), delay=1) {
  stopifnot(inherits(page, "xml_node"))
  stopifnot("url" %in% names(seen))
  page_ids <- page %>% html_nodes("*[id]") %>% html_attr("id") %>% paste0("#", .)
  page_hrefs <- page %>% html_nodes("a") %>% html_attr("href") %>% unique()
  grep_filter <- function(x, pattern, negate=FALSE, ...) x[grepl(pattern, x,)!=negate]
  links <- page_hrefs %>% 
    grep_filter("^(#|mailto)", negate=TRUE) %>% 
    tibble(href=.) %>% 
    mutate(url = url_absolute(href, base=base))
  new_links <- links %>% 
    anti_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url") %>% 
    mutate(status=polite_GET_status(url, delay=delay),
     result = case_when(status==200~"OK", status==404~"Not Found", TRUE~"Error"))
  seen_links <- links %>% 
    inner_join(seen %>% select(-any_of("href")) %>% filter(!is.na(url)), by="url")
  dups <- function(x) unique(x[duplicated(x)])
  dup_paged_ids <- dups(page_ids)
  anchors <- page_hrefs %>% 
    grep_filter("^#") %>% 
    tibble(href=.) %>% 
    mutate(result = case_when(
      href %in% dup_paged_ids ~ "Duplicate ID",
      href %in% page_ids ~ "OK",
      TRUE ~ "Not Found"))
  bind_rows(new_links, seen_links, anchors)
}

crawl_pages <- function(seed_url, root_url=base_url(seed_url), delay=.5) {
  page <- read_html(seed_url)
  results <- check_page_links(page, base=seed_url, delay=delay) %>% 
    mutate(page=seed_url)

  filter_children <- function(x, parent) { unique(x[which(startsWith(x, parent))]) }  
  scanned <- seed_url
  toscan <- setdiff(filter_children(results$url, root_url), scanned)
  seen <- results %>% select(url, status, result) %>% unique()
  while(length(toscan) > 0) {
    url <- toscan[1]
    page <- possibly(read_html,otherwise=NULL)(url)
    if (!is.null(page)) {
      pageresults <- check_page_links(page, base=url, seen=seen, delay=delay) %>% 
        mutate(page=.env$url)
      
      candidates <- filter_children(pageresults %>% filter(status==200) %>% pull(url), root_url)
      toscan <- append(toscan, setdiff(setdiff(candidates, scanned), toscan))
      
      results <- results %>% bind_rows(pageresults)
      seen <- results %>% select(url, status, result) %>% unique()
    }
    toscan <- toscan[-1]
    scanned <- c(scanned, url)
    Sys.sleep(delay)
  }
  results
}

url <- "https://umcarpentries.org/intro-curriculum-r/index.html"
all_links <- crawl_pages(url)
all_links %>% filter(result != "OK")

This will return a table that till show the page that was searched, the href used in the <a> tag on that page, the "full" url that evaluates to (turning relative URLs into absolute URLs), the http status code returned from a GET request to that URL, and the result which is "OK" for good links, and non-OK for any potential problems.

Would it be possible to incorporate something like this into the build workflow so it can check for broken links automatically?

It does take some time to run the code but that's mostly because I've added in a delay between http requests in order to be a "polite" web scraper and not bombard any one server with too many requests in a short period of time. The current delay between requests is .5 seconds. The code is written to not query the same URL twice but there are still about ~160 unique URLs that are checked.

It checks both URL links and anchor-style links. So if the URL uses "#" it will search the IDs on the page to make sure it's here. Note that since it uses the element ID and those IDs need to be unique on a page, it also will report if one of those IDs has been duplicate which would interfere with the links.

Awesome, we should definitely incorporate this into the build workflow.

Since this checks the rendered site at https://umcarpentries.org/intro-curriculum-r/, it won't catch if someone breaks a link in a PR until after the PR is merged. Links outside our control could also break (e.g. links to the cheatsheets) at any time. Rather than putting this in the build-website workflow, maybe it should be it's own GitHub Actions workflow that runs on a CRON schedule?

Those are good points. I don't actually have much experience with GitHub Actions myself so I wasn't sure what was possible. But If the code runs once a month or something to check links, that would be helpful.