shrysr / webscrape-nomatic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simple project to scrape product information off the Nomatic site

I think they have a fantastic range of products and this is just an experiment of a fan of the company and product, and a hobby project.

Reference URL: https://www.nomatic.com/products/the-nomatic-travel-pack

This mini project was inspired by a question posed by the awesome Max Humber via Linked in.

Environment

Renv can be used to quickly install the required libraries. Use renv::install()

General plan/tasks

  • beginnings of a function to pull in all the products on multiple pages and pull in key descriptions for individual pages. This
  • [X] individual functions have to be plugged into a map function for obtaining all the product URL’s
  • [ ] Product URL’s have to be reviewed since there are some spurious products like insurance
  • [ ] table formation and info extract need to be fleshed out and improved
  • [ ] function to be created for individual product feature extraction

Code demo

Load Libraries and Source functions

## Loading Libraries ---
library(dplyr)
library(rvest)
library(purrr)
library(fs)
library(xopen)
library(furrr)
library(stringr)

## Source functions
source("functions.R")

Init Variables and URL

## Prefix URLs
all_products_prefix_url = "https://www.nomatic.com/collections/all-products?page="
individual_product_prefix_url = "https://www.nomatic.com/products/"


## Number of pages to scrape 6
## The number of pages was manually obtained by inspection
page_range <- seq(1:6)

Obtain All product URLs

all_product_url_tbl <- page_range %>% map_df(
                                        ~ consolidate_all_links_on_a_page(
  all_products_prefix_url = all_products_prefix_url,
  page_no = .x,
  individual_product_prefix_url))

all_product_url_tbl

Obtaining individual product features

Using a single product url for initial exploration

url_test <- all_product_url_tbl[2,2] %>% pull

html <- read_html(url_test)

print(url_test)

Main product features :

main_product_features <- html %>%
  html_nodes(".product-features") %>%
  html_nodes("li") %>%
  html_text() %>%
  str_trim() %>%
  unique()
### Carousel details ----
carousel_details <- html %>%
  html_nodes(".accordion") %>%
  html_nodes(".accordion_body") %>%
  html_text() %>%
  str_trim() %>%
  unique() %>% 
  tibble("detailed_features" = .)

### Carousel headers
carousel_feature_names <- html %>%
  html_nodes(".accordion") %>%
  html_nodes(".accordion_head") %>%
  html_text() %>%
  str_trim() %>%
  unique() %>% 
  tibble("feature_names" = .) 

all_carousel_details_tbl <-
  bind_cols(carousel_feature_names) %>%
  bind_cols(carousel_details)

About

License:MIT License


Languages

Language:R 100.0%