discrepancy between rvest and bow + scrape
njtierney opened this issue Β· comments
Hi there!
Great package! I'm just about to teach it in a class (hooray for responsible web scraping! :) ). I've run into this problem - I'm wondering if you happen to have a solution, or if you would prefer that I posted this on Stack Overflow?
Details below - let me know if you have any questions - and thanks again for this great package! π
library(polite)
library(rvest)
#> Loading required package: xml2
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#> User-agent: polite R package - https://github.com/dmi3kno/polite
#> robots.txt: 6 rules are defined for 1 bots
#> Crawl delay: 15 sec
#> The path is scrapable for this user-agent
scraped_site <- scrape(check_site)
#> Warning: Client error: (400) Bad Request http://
#> stats.espncricinfo.com/ci/engine/stats/index.html?
#> class=10%3Bpage%3D1%3Bteam%3D289%3Btemplate%3Dresults%3Btype%3Dbatting%3Bwrappertype%3Dprint
scraped_site
#> NULL
rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...
Created on 2019-08-29 by the reprex package (v0.3.0)
I cannot reproduce this issue.
library(polite)
library(rvest)
#> Loading required package: xml2
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#> User-agent: polite R package - https://github.com/dmi3kno/polite
#> robots.txt: 6 rules are defined for 1 bots
#> Crawl delay: 15 sec
#> The path is scrapable for this user-agent
scraped_site <- scrape(check_site)
scraped_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...
rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...
Created on 2019-08-29 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> β Session info ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 3.5.3 (2019-03-11)
#> os Ubuntu 18.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2019-08-29
#>
#> β Packages ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.1)
#> backports 1.1.4 2019-04-10 [1] CRAN (R 3.5.1)
#> callr 3.3.1 2019-07-18 [1] CRAN (R 3.5.3)
#> cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.1)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
#> curl 4.0 2019-07-22 [1] CRAN (R 3.5.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.1)
#> devtools 2.1.0 2019-07-06 [1] CRAN (R 3.5.3)
#> digest 0.6.20 2019-07-04 [1] CRAN (R 3.5.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.5.3)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.5.3)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.1)
#> here 0.1 2017-05-28 [1] CRAN (R 3.5.1)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.5.1)
#> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1)
#> httr 1.4.1 2019-08-05 [1] CRAN (R 3.5.3)
#> knitr 1.23 2019-05-18 [1] CRAN (R 3.5.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.1)
#> mime 0.7 2019-06-11 [1] CRAN (R 3.5.3)
#> pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.5.3)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.1)
#> polite * 0.0.0.9005 2019-02-05 [1] Github (dmi3kno/polite@445bf49)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.1)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.5.3)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.1)
#> R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.1)
#> ratelimitr 0.4.1 2018-10-07 [1] CRAN (R 3.5.1)
#> Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.5.3)
#> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.5.3)
#> rlang 0.4.0.9002 2019-08-27 [1] Github (r-lib/rlang@15e799c)
#> rmarkdown 1.14 2019-07-12 [1] CRAN (R 3.5.3)
#> robotstxt 0.6.2 2018-07-18 [1] CRAN (R 3.5.1)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.1)
#> rvest * 0.3.4 2019-05-15 [1] CRAN (R 3.5.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.1)
#> spiderbar 0.2.1 2017-11-17 [1] CRAN (R 3.5.1)
#> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.5.1)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.5.1)
#> testthat 2.2.1 2019-07-25 [1] CRAN (R 3.5.3)
#> triebeard 0.3.0 2016-08-04 [1] CRAN (R 3.5.1)
#> urltools 1.7.3 2019-04-14 [1] CRAN (R 3.5.3)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.5.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
#> xfun 0.8 2019-06-25 [1] CRAN (R 3.5.3)
#> xml2 * 1.2.1 2019-07-29 [1] CRAN (R 3.5.3)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
#>
#> [1] /home/mitchell/R/x86_64-pc-linux-gnu-library/3.5
#> [2] /usr/local/lib/R/library
You guys need to include SessionInfo() in order for me to tell who's right and who's wrong :)
But in this case I tend to agree that polite
does not seem to be prepared for ";" separator between the query parameters.
This works fine for me.
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10&page=1&team=289&template=results&type=batting&wrappertype=print"
bow(site) %>% scrape()
As well as this one
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html"
bow(site) %>% scrape(query = list(class="10",
page="1",
team="289",
template="results",
type="batting",
wrappertype="print"))
I will look at the (recently revised) code for scrape()
and come back to you, although I am hesitant to claim that ";" separator is "standard".
Added the session info! π
Thanks for your inputs.
I found a typo in the code and pushed the update to the package. Can you try reinstalling with
remotes::install_github("dmi3kno/polite")
Thank you for reporting this issue!
@njtierney and @mitchelloharawild would you test and report if this fixes the semicolor-separated arguments issue for you?
@mitchelloharawild your version of polite
is a little old. You will need to reinstall it as indicated above
Hi @dmi3kno ! Sorry I didn't get back to you with the session info. Good news - it works for me now! Thanks for the fix! π
library(polite)
packageVersion("polite")
#> [1] '0.1.0'
library(rvest)
#> Loading required package: xml2
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#> User-agent: polite R package - https://github.com/dmi3kno/polite
#> robots.txt: 6 rules are defined for 1 bots
#> Crawl delay: 15 sec
#> The path is scrapable for this user-agent
scraped_site <- scrape(check_site)
scraped_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...
rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...
Created on 2019-09-02 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> β Session info ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os macOS Mojave 10.14.6
#> system x86_64, darwin15.6.0
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2019-09-02
#>
#> β Packages ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
#> backports 1.1.4 2019-04-10 [1] CRAN (R 3.6.0)
#> callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1)
#> cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
#> curl 4.0 2019-07-22 [1] CRAN (R 3.6.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
#> devtools 2.1.0 2019-07-06 [1] CRAN (R 3.6.0)
#> digest 0.6.20 2019-07-04 [1] CRAN (R 3.6.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.0)
#> glue 1.3.1.9000 2019-07-29 [1] Github (tidyverse/glue@423b7e5)
#> here 0.1 2017-05-28 [1] standard (@0.1)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
#> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.6.0)
#> httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.0)
#> knitr 1.24 2019-08-08 [1] CRAN (R 3.6.0)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
#> mime 0.7 2019-06-11 [1] CRAN (R 3.6.0)
#> pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.6.0)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
#> polite * 0.1.0 2019-08-30 [1] Github (dmi3kno/polite@def0def)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.1)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
#> R6 2.4.0 2019-02-14 [1] CRAN (R 3.6.0)
#> ratelimitr 0.4.1 2018-10-07 [1] CRAN (R 3.6.0)
#> Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.6.0)
#> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.0)
#> rlang 0.4.0 2019-06-25 [1] CRAN (R 3.6.0)
#> rmarkdown 1.14 2019-07-12 [1] CRAN (R 3.6.0)
#> robotstxt 0.6.2 2018-07-18 [1] CRAN (R 3.6.0)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
#> rvest * 0.3.4 2019-05-15 [1] CRAN (R 3.6.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
#> spiderbar 0.2.2 2019-08-19 [1] CRAN (R 3.6.0)
#> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
#> testthat 2.2.1 2019-07-25 [1] CRAN (R 3.6.0)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.0)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
#> xfun 0.8 2019-06-25 [1] CRAN (R 3.6.0)
#> xml2 * 1.2.2 2019-08-09 [1] CRAN (R 3.6.0)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Thank you. Issue closed.
Thank you for solving the issue so quickly - love using this package. Looking forward to seeing it on CRAN soon :)