Scrape as a service

ScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.

Quickstart

Add ASP.NET Hosting, ScrapeAAS, a validator of your choice (here Dawn.Guard RIP), and a object mapper of your choice (here AutoMapper), and the database/messagequeue you feel most comftable with (here EFcore with SQLite).

dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection

Full example of scraping the r/dotnet subreddit.

Create a crawler, a that service periodically triggers scraping

var builder = Host.CreateApplicationBuilder(args);
builder.Services
  .AddAutoMapper()
  .AddScrapeAAS()
  .AddHostedService<RedditSubredditCrawler>()
  .AddDataFlow<RedditPostSpider>()
  .AddDataFlow<RedditSqliteSink>()

sealed class RedditSubredditCrawler : BackgroundService {
  private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
  private readonly IDataflowPublisher<RedditPost> _publisher;
  ...
  protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
    ... execute service scope periotically
  }

  private async Task CrawlAsync(IDataflowPublisher<RedditSubreddit> publisher, CancellationToken stoppingToken)
  {
    _logger.LogInformation("Crawling /r/dotnet");
    await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
    _logger.LogInformation("Crawling complete");
  }
}

Implement your spiders, services that collect, and normalize data.

sealed class RedditPostSpider : IDataflowHandler<RedditSubreddit> {
  private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
  private readonly IDataflowPublisher<RedditComment> _publisher;
  ...

  private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
  {
    Url root = new("https://old.reddit.com/");
    _logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
    var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
    _logger.LogInformation("Request complete");
    var queriedContent = document
      .QuerySelectorAll("div.thing")
      .AsParallel()
      .Select(div => new
      {
        PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
        Title = div.QuerySelector("a.title")?.TextContent,
        Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
        Comments = div.QuerySelector("a.comments")?.TextContent,
        CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
        PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
        PostedBy = div.QuerySelector("a.author")?.TextContent,
      })
      .Select(queried => new RedditPost(
        new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
        Guard.Argument(queried.Title).NotEmpty(),
        long.Parse(queried.Upvotes.AsSpan()),
        Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
        new(queried.CommentsUrl),
        DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
        new(Guard.Argument(queried.PostedBy).NotEmpty())
      ), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
    foreach (var item in queriedContent)
    {
      await _publisher.PublishAsync(item, stoppingToken);
    }
    _logger.LogInformation("Parsing complete");
  }
}

Add a sink, a service that commits the scraped data disk/network.

sealed class RedditSqliteSink : IAsyncDisposable, IDataflowHandler<RedditSubreddit>, IDataflowHandler<RedditPost>
{
  private readonly RedditPostSqliteContext _context;
  private readonly IMapper _mapper;
  ...
  public async ValueTask DisposeAsync()
  {
    await _context.Database.EnsureCreatedAsync();
    await _context.SaveChangesAsync();
  }

  public async ValueTask HandleAsync(RedditSubreddit message, CancellationToken cancellationToken = default)
  {
    var messageDto = _mapper.Map<RedditSubredditDto>(message);
    await _context.Database.EnsureCreatedAsync(cancellationToken);
    await _context.Subreddits.AddAsync(messageDto, cancellationToken);
  }

  public async ValueTask HandleAsync(RedditPost message, CancellationToken cancellationToken = default)
  {
    var messageDto = _mapper.Map<RedditPostDto>(message);
    if (await _context.Users.FindAsync(new object[] { message.PostedBy.Id }, cancellationToken) is { } existingUser)
    {
      messageDto.PostedById = existingUser.Id;
      messageDto.PostedBy = existingUser;
    }
    await _context.Database.EnsureCreatedAsync(cancellationToken);
    await _context.Posts.AddAsync(messageDto, cancellationToken);
  }
}

Why not WebReaper or DotnetSpider?

I have tried both toolstacks, and found them wanting. So I tried to make it better by delegating as much work as reasonable to existing projects.

In addition to my own goals; from evaluating both libraries I wish to keep all thier pros, and discard all their cons. The verbocity of this library sits comtably between WebReaper and DotnetSpider, but more towards the DotnetSpider end of things.

Integration into ASP.NET Hosting.
No dependencies at the core of the project. Instead package a reasonable set of addons by default.
Use and expose integrated NuGet packages in addons when possible to allow develops to benefit form existing ecosystems.

Evaluation of DotnetSpider

The overall data flow in ScrapeAAS is adopted from DotnetSpider: Crawler --> Spider --> Sink .

Pro: Pub/Sub event handling for decoupled data flow.
Pro: Easy extendibility by tapping events.
Con: Terrible debugging experience using model annotations.
Con: Smelly dynamic riddeled design when storing to a database.
Con: Retry policies missing.
Con: Much boilerplate nessessary.

Evaluation of WebReaper

The Puppeteer browser handling is a mixture of the lifetime tracking http handler and the WebReaper Puppeteer integration.

Pro: Simple declarative builder API. No boilderplate needed.
Pro: Easy extendibility by implementing interfaces.
Pro: Puppeteer browser.
Con: Unable to control data flow.
Con: Unable to parse data.
Con: No ASP.NET or any DI integration possible.
Con: Dependencies for optional extendibilites, such as Redis, MySql, RabbitMq, are always included in the package.

ProphetLamb / ScrapeAAS

Scrape as a service

Quickstart

Why not WebReaper or DotnetSpider?

Evaluation of DotnetSpider

Evaluation of WebReaper

About

Languages