dmi3kno / polite

Be nice on the web

Home Page:https://dmi3kno.github.io/polite/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

discrepancy between rvest and bow + scrape

njtierney opened this issue Β· comments

Hi there!

Great package! I'm just about to teach it in a class (hooray for responsible web scraping! :) ). I've run into this problem - I'm wondering if you happen to have a solution, or if you would prefer that I posted this on Stack Overflow?

Details below - let me know if you have any questions - and thanks again for this great package! πŸŽ‰

library(polite)
library(rvest)
#> Loading required package: xml2
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#>      User-agent: polite R package - https://github.com/dmi3kno/polite
#>      robots.txt: 6 rules are defined for 1 bots
#>     Crawl delay: 15 sec
#>   The path is scrapable for this user-agent
scraped_site <- scrape(check_site)
#> Warning: Client error: (400) Bad Request http://
#> stats.espncricinfo.com/ci/engine/stats/index.html?
#> class=10%3Bpage%3D1%3Bteam%3D289%3Btemplate%3Dresults%3Btype%3Dbatting%3Bwrappertype%3Dprint
scraped_site
#> NULL

rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...

Created on 2019-08-29 by the reprex package (v0.3.0)

I cannot reproduce this issue.

library(polite)
library(rvest)
#> Loading required package: xml2

site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#>      User-agent: polite R package - https://github.com/dmi3kno/polite
#>      robots.txt: 6 rules are defined for 1 bots
#>     Crawl delay: 15 sec
#>   The path is scrapable for this user-agent

scraped_site <- scrape(check_site)
scraped_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...

rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...

Created on 2019-08-29 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.3 (2019-03-11)
#>  os       Ubuntu 18.04.2 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2019-08-29                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.5.1)                 
#>  backports     1.1.4      2019-04-10 [1] CRAN (R 3.5.1)                 
#>  callr         3.3.1      2019-07-18 [1] CRAN (R 3.5.3)                 
#>  cli           1.1.0      2019-03-19 [1] CRAN (R 3.5.1)                 
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.5.1)                 
#>  curl          4.0        2019-07-22 [1] CRAN (R 3.5.3)                 
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.5.1)                 
#>  devtools      2.1.0      2019-07-06 [1] CRAN (R 3.5.3)                 
#>  digest        0.6.20     2019-07-04 [1] CRAN (R 3.5.3)                 
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.5.3)                 
#>  fs            1.3.1      2019-05-06 [1] CRAN (R 3.5.3)                 
#>  glue          1.3.1      2019-03-12 [1] CRAN (R 3.5.1)                 
#>  here          0.1        2017-05-28 [1] CRAN (R 3.5.1)                 
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.5.1)                 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.5.1)                 
#>  httr          1.4.1      2019-08-05 [1] CRAN (R 3.5.3)                 
#>  knitr         1.23       2019-05-18 [1] CRAN (R 3.5.3)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.5.1)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.5.1)                 
#>  mime          0.7        2019-06-11 [1] CRAN (R 3.5.3)                 
#>  pkgbuild      1.0.4      2019-08-05 [1] CRAN (R 3.5.3)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.5.1)                 
#>  polite      * 0.0.0.9005 2019-02-05 [1] Github (dmi3kno/polite@445bf49)
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.5.1)                 
#>  processx      3.4.1      2019-07-18 [1] CRAN (R 3.5.3)                 
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.5.1)                 
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.5.1)                 
#>  ratelimitr    0.4.1      2018-10-07 [1] CRAN (R 3.5.1)                 
#>  Rcpp          1.0.2      2019-07-25 [1] CRAN (R 3.5.3)                 
#>  remotes       2.1.0      2019-06-24 [1] CRAN (R 3.5.3)                 
#>  rlang         0.4.0.9002 2019-08-27 [1] Github (r-lib/rlang@15e799c)   
#>  rmarkdown     1.14       2019-07-12 [1] CRAN (R 3.5.3)                 
#>  robotstxt     0.6.2      2018-07-18 [1] CRAN (R 3.5.1)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.5.1)                 
#>  rvest       * 0.3.4      2019-05-15 [1] CRAN (R 3.5.3)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.5.1)                 
#>  spiderbar     0.2.1      2017-11-17 [1] CRAN (R 3.5.1)                 
#>  stringi       1.4.3      2019-03-12 [1] CRAN (R 3.5.1)                 
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.5.1)                 
#>  testthat      2.2.1      2019-07-25 [1] CRAN (R 3.5.3)                 
#>  triebeard     0.3.0      2016-08-04 [1] CRAN (R 3.5.1)                 
#>  urltools      1.7.3      2019-04-14 [1] CRAN (R 3.5.3)                 
#>  usethis       1.5.1      2019-07-04 [1] CRAN (R 3.5.3)                 
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.5.1)                 
#>  xfun          0.8        2019-06-25 [1] CRAN (R 3.5.3)                 
#>  xml2        * 1.2.1      2019-07-29 [1] CRAN (R 3.5.3)                 
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.5.1)                 
#> 
#> [1] /home/mitchell/R/x86_64-pc-linux-gnu-library/3.5
#> [2] /usr/local/lib/R/library

You guys need to include SessionInfo() in order for me to tell who's right and who's wrong :)

But in this case I tend to agree that polite does not seem to be prepared for ";" separator between the query parameters.

This works fine for me.

site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10&page=1&team=289&template=results&type=batting&wrappertype=print"
bow(site) %>% scrape()

As well as this one

site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html"
bow(site) %>% scrape(query = list(class="10",
                                  page="1",
                                  team="289",
                                  template="results",
                                  type="batting",
                                  wrappertype="print"))

I will look at the (recently revised) code for scrape() and come back to you, although I am hesitant to claim that ";" separator is "standard".

Added the session info! πŸ‘
Thanks for your inputs.

I found a typo in the code and pushed the update to the package. Can you try reinstalling with

remotes::install_github("dmi3kno/polite")

Thank you for reporting this issue!

@njtierney and @mitchelloharawild would you test and report if this fixes the semicolor-separated arguments issue for you?
@mitchelloharawild your version of polite is a little old. You will need to reinstall it as indicated above

Hi @dmi3kno ! Sorry I didn't get back to you with the session info. Good news - it works for me now! Thanks for the fix! πŸŽ‰

library(polite)
packageVersion("polite")
#> [1] '0.1.0'
library(rvest)
#> Loading required package: xml2
site <- "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print"
check_site <- bow(site, force = TRUE)
check_site
#> <polite session> http://stats.espncricinfo.com/ci/engine/stats/index.html?class=10;page=1;team=289;template=results;type=batting;wrappertype=print
#>     User-agent: polite R package - https://github.com/dmi3kno/polite
#>     robots.txt: 6 rules are defined for 1 bots
#>    Crawl delay: 15 sec
#>   The path is scrapable for this user-agent
scraped_site <- scrape(check_site)
scraped_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...
rvest_site <- read_html(site)
rvest_site
#> {html_document}
#> <html xmlns="http://www.w3.org/1999/xhtml">
#> [1] <head>\n<title>Batting records | Women's Twenty20 Internationals | C ...
#> [2] <body onload="return guruStart();">\n<div id="ciMainContainer">\n <d ...

Created on 2019-09-02 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.1 (2019-07-05)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2019-09-02                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package     * version    date       lib source                         
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                 
#>  backports     1.1.4      2019-04-10 [1] CRAN (R 3.6.0)                 
#>  callr         3.3.1      2019-07-18 [1] CRAN (R 3.6.1)                 
#>  cli           1.1.0      2019-03-19 [1] CRAN (R 3.6.0)                 
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                 
#>  curl          4.0        2019-07-22 [1] CRAN (R 3.6.0)                 
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                 
#>  devtools      2.1.0      2019-07-06 [1] CRAN (R 3.6.0)                 
#>  digest        0.6.20     2019-07-04 [1] CRAN (R 3.6.0)                 
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 3.6.0)                 
#>  fs            1.3.1      2019-05-06 [1] CRAN (R 3.6.0)                 
#>  glue          1.3.1.9000 2019-07-29 [1] Github (tidyverse/glue@423b7e5)
#>  here          0.1        2017-05-28 [1] standard (@0.1)                
#>  highr         0.8        2019-03-20 [1] CRAN (R 3.6.0)                 
#>  htmltools     0.3.6      2017-04-28 [1] CRAN (R 3.6.0)                 
#>  httr          1.4.1      2019-08-05 [1] CRAN (R 3.6.0)                 
#>  knitr         1.24       2019-08-08 [1] CRAN (R 3.6.0)                 
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 3.6.0)                 
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 3.6.0)                 
#>  mime          0.7        2019-06-11 [1] CRAN (R 3.6.0)                 
#>  pkgbuild      1.0.4      2019-08-05 [1] CRAN (R 3.6.0)                 
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 3.6.0)                 
#>  polite      * 0.1.0      2019-08-30 [1] Github (dmi3kno/polite@def0def)
#>  prettyunits   1.0.2      2015-07-13 [1] CRAN (R 3.6.0)                 
#>  processx      3.4.1      2019-07-18 [1] CRAN (R 3.6.1)                 
#>  ps            1.3.0      2018-12-21 [1] CRAN (R 3.6.0)                 
#>  R6            2.4.0      2019-02-14 [1] CRAN (R 3.6.0)                 
#>  ratelimitr    0.4.1      2018-10-07 [1] CRAN (R 3.6.0)                 
#>  Rcpp          1.0.2      2019-07-25 [1] CRAN (R 3.6.0)                 
#>  remotes       2.1.0      2019-06-24 [1] CRAN (R 3.6.0)                 
#>  rlang         0.4.0      2019-06-25 [1] CRAN (R 3.6.0)                 
#>  rmarkdown     1.14       2019-07-12 [1] CRAN (R 3.6.0)                 
#>  robotstxt     0.6.2      2018-07-18 [1] CRAN (R 3.6.0)                 
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 3.6.0)                 
#>  rvest       * 0.3.4      2019-05-15 [1] CRAN (R 3.6.0)                 
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 3.6.0)                 
#>  spiderbar     0.2.2      2019-08-19 [1] CRAN (R 3.6.0)                 
#>  stringi       1.4.3      2019-03-12 [1] CRAN (R 3.6.0)                 
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 3.6.0)                 
#>  testthat      2.2.1      2019-07-25 [1] CRAN (R 3.6.0)                 
#>  usethis       1.5.1      2019-07-04 [1] CRAN (R 3.6.0)                 
#>  withr         2.1.2      2018-03-15 [1] CRAN (R 3.6.0)                 
#>  xfun          0.8        2019-06-25 [1] CRAN (R 3.6.0)                 
#>  xml2        * 1.2.2      2019-08-09 [1] CRAN (R 3.6.0)                 
#>  yaml          2.2.0      2018-07-25 [1] CRAN (R 3.6.0)                 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

Thank you. Issue closed.

Thank you for solving the issue so quickly - love using this package. Looking forward to seeing it on CRAN soon :)