alexleen / scrape-x

Simple .NET library that provides generic web scraping abilities using XPaths.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Build status nuget

Simple .NET library that provides generic web scraping abilities using XPaths.

Basic features:

  • Fluent interface
  • Pagination
  • Throttling
  • HttpClient injection


For how-to's, examples, and documentation, please see the wiki.

Example Usage

private static void Main(string[] args)
    IScraperFactory scraperFactory = new ScraperFactory();

    //Set up a new scraper to scrape Austin's craigslist
    IPaginatingScraper scraper = scraperFactory.CreatePaginatingScraper("");

    //Set the URL for the results page. In this case, "apts/housing for rent".
           //Set the XPath for search result nodes
           //Sets the XPath for search result links relative to result node
           //Sets a predicate that decides whether or not an individual result should be visited or not.
           //In this case, results are only visited if their "housing" span contains "1br".
           //This saves considerable bandwidth.
           .SetResultVisitPredicate(housing => housing.Contains("1br"), "p/span[2]/span[2]")
           //Sets "Next" button link XPath
           //Sets XPaths used for retrieving data from the target page.
           //Keys are used to identify the data in the callback to the Go method.
           .SetTargetPageXPaths(new Dictionary<string, string>
               { "latitude", "//*[@id=\"map\"]/@data-latitude" },
               { "longitude", "//*[@id=\"map\"]/@data-longitude" },
               { "price", "/html/body/section/section/h2/span[2]/span[1]" },
               { "br", "/html/body/section/section/section/div[1]/p[1]/span[1]/b[1]" },
               { "sqft", "/html/body/section/section/section/div[1]/p[1]/span[2]/b" }
           //Everytime a target page is scraped this callback is called.

private static void OnResultRetrieved(string link, IDictionary<string, string> results)
    //Do something with the results...


JetBrains Rider


Simple .NET library that provides generic web scraping abilities using XPaths.

License:MIT License


Language:C# 100.0%