dmi3kno / polite

Be nice on the web

Home Page:https://dmi3kno.github.io/polite/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pathways not scraping / updating with nod()

cnice018 opened this issue · comments

I am trying to grab search results from coded form submissions for work. I am attempting the process on widely usable sites before application to our local library search page.

My question is how to handle search the results. For example, lets say I need to reference the moon for some reason:
Generate query results

library(rvest)
library(polite)  
library(tidyverse)

#establish the search engine where the form is located
gBow <-   bow("https://www.google.com/")

#fill out form
gSearchForm <- scrape(gBow) %>%
  html_node("form") %>%
  html_form() %>%
  set_values(q = "Moon")

#get results of query
results <- submit_form(gBow, gSearchForm, submit = "btnG")

Format and display query results

#display in browser works fine
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()

#scrape results throws an error "No scraping allowed here"
scrape(results)

#Nod does not allow "access" to results page, which was my first thought
gSearchNod <- nod(gBow, resultsPath)
#Resulting session URL is still www.google.com, not the updated URL.
scrape(gSearchNod)

#And yet I can still navigate the results page with the rvest commands just fine
results %>% 
  follow_link("Moon - Wikipedia") %>% 
  html_node(".infobox") %>% 
  html_table(fill=T) %>% 
  select("Stat"=X1, "Dist"=X2) %>% 
  filter(Stat %in% c("Perigee", "Apogee"))

So, know lets try to eliminate a step, and query Wikipedia directly. Note the difference in performance in the display URL step, and otherwise overall similarity to the above problem.

#establish the search engine where the form is located
wikiBow <- bow("https://en.wikipedia.org/wiki/Main_Page")

#fill out form
wikiSearchForm <- scrape(wikiBow) %>%
  html_node("form") %>%
  html_form() %>%
  set_values(search = "Moon")

#get results of query
results <- submit_form(wikiBow, wikiSearchForm, submit = "fulltext")

Displaying in browser takes you to the internal search page, and does not forward you to the page named "wiki/Moon".

This isn't actually a problem, if I can parse this page just fine

resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()

#scrape results throws an error "No scraping allowed here", same error
scrape(results)

#Nod does not allow "access" to results page, which was my first thought
wSearchNod <- nod(wikiBow, resultsPath)
scrape(wSearchNod)

#And yet I can still navigate the results page with the rvest commands just fine
results %>% 
  follow_link("Moon") %>% 
  html_node(".infobox") %>% 
  html_table(fill=T) %>% 
  select("Stat"=X1, "Dist"=X2) %>% 
  filter(Stat %in% c("Perigee", "Apogee"))

I am reproducing the error reliably across search engines, including the internal one I am designing this for. I assume the error is related to permissions of results pages, but I can't seem to get a work around that isn't long coding URL's directly.

#the goal (functionally)

searchFunction <- function(searchTerm){
s <- bow("www.internalLibrary.com")

scrape(bow) %>% 
html_node("form") %>% 
html_form() %>% 
set_values(q = SearchTerm) %>%
submit_form(s, .) %>%
??????????????????????????????????????
nod() %>%
scrape() %>%
consider_life_choices_that_led_me_here() %>%
??????????????????????????????????????
html_node("#results") %>%
html_table() %>%
select("FY19" = cost, "Date" = approval) %>%
data.frame() %>%
return()
}

Any input would be appreciated.

First things first

robotstxt::paths_allowed("https://www.google.com/search")
#> [1] FALSE

robotstxt::paths_allowed("https://en.wikipedia.org/w/")
#> [1] FALSE

{polite} follows scraping permissions stipulated in robots.txt. In this case both google search results and wikipedia.org/w/ are not scrapable.

Nod merely reuses the session established by bow and re-checks the permissions before attempting to scrape. It does not access content of the web-response - it is all happening at scrape.

I may need to add more explicit documentation to both functions to make their intention clear.

Ah, that makes sense. I did not realize I could check permissions manually, thank you. I may need to rethink the fundamental approach of using forms at all.