Pathways not scraping / updating with nod()
cnice018 opened this issue · comments
I am trying to grab search results from coded form submissions for work. I am attempting the process on widely usable sites before application to our local library search page.
My question is how to handle search the results. For example, lets say I need to reference the moon for some reason:
Generate query results
library(rvest)
library(polite)
library(tidyverse)
#establish the search engine where the form is located
gBow <- bow("https://www.google.com/")
#fill out form
gSearchForm <- scrape(gBow) %>%
html_node("form") %>%
html_form() %>%
set_values(q = "Moon")
#get results of query
results <- submit_form(gBow, gSearchForm, submit = "btnG")
Format and display query results
#display in browser works fine
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()
#scrape results throws an error "No scraping allowed here"
scrape(results)
#Nod does not allow "access" to results page, which was my first thought
gSearchNod <- nod(gBow, resultsPath)
#Resulting session URL is still www.google.com, not the updated URL.
scrape(gSearchNod)
#And yet I can still navigate the results page with the rvest commands just fine
results %>%
follow_link("Moon - Wikipedia") %>%
html_node(".infobox") %>%
html_table(fill=T) %>%
select("Stat"=X1, "Dist"=X2) %>%
filter(Stat %in% c("Perigee", "Apogee"))
So, know lets try to eliminate a step, and query Wikipedia directly. Note the difference in performance in the display URL step, and otherwise overall similarity to the above problem.
#establish the search engine where the form is located
wikiBow <- bow("https://en.wikipedia.org/wiki/Main_Page")
#fill out form
wikiSearchForm <- scrape(wikiBow) %>%
html_node("form") %>%
html_form() %>%
set_values(search = "Moon")
#get results of query
results <- submit_form(wikiBow, wikiSearchForm, submit = "fulltext")
Displaying in browser takes you to the internal search page, and does not forward you to the page named "wiki/Moon".
This isn't actually a problem, if I can parse this page just fine
resultsPath <- results %>% .[["url"]]
resultsPath %>% browseURL()
#scrape results throws an error "No scraping allowed here", same error
scrape(results)
#Nod does not allow "access" to results page, which was my first thought
wSearchNod <- nod(wikiBow, resultsPath)
scrape(wSearchNod)
#And yet I can still navigate the results page with the rvest commands just fine
results %>%
follow_link("Moon") %>%
html_node(".infobox") %>%
html_table(fill=T) %>%
select("Stat"=X1, "Dist"=X2) %>%
filter(Stat %in% c("Perigee", "Apogee"))
I am reproducing the error reliably across search engines, including the internal one I am designing this for. I assume the error is related to permissions of results pages, but I can't seem to get a work around that isn't long coding URL's directly.
#the goal (functionally)
searchFunction <- function(searchTerm){
s <- bow("www.internalLibrary.com")
scrape(bow) %>%
html_node("form") %>%
html_form() %>%
set_values(q = SearchTerm) %>%
submit_form(s, .) %>%
??????????????????????????????????????
nod() %>%
scrape() %>%
consider_life_choices_that_led_me_here() %>%
??????????????????????????????????????
html_node("#results") %>%
html_table() %>%
select("FY19" = cost, "Date" = approval) %>%
data.frame() %>%
return()
}
Any input would be appreciated.
First things first
robotstxt::paths_allowed("https://www.google.com/search")
#> [1] FALSE
robotstxt::paths_allowed("https://en.wikipedia.org/w/")
#> [1] FALSE
{polite}
follows scraping permissions stipulated in robots.txt
. In this case both google search results and wikipedia.org/w/ are not scrapable.
Nod
merely reuses the session established by bow
and re-checks the permissions before attempting to scrape. It does not access content of the web-response - it is all happening at scrape
.
I may need to add more explicit documentation to both functions to make their intention clear.
Ah, that makes sense. I did not realize I could check permissions manually, thank you. I may need to rethink the fundamental approach of using forms at all.