IonicaBizau / scrape-it

🔮 A Node.js scraper for humans.

Home Page:http://ionicabizau.net/blog/30-how-to-write-a-web-scraper-in-node-js

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature request] "OR" query

KaKi87 opened this issue · comments

commented

Hello,

Let's consider the following HTML :

<a class="itemA" href="https://example.com"><span>example.com</span></a>
<a class="itemA" href="https://example.net"><span>example.net</span></a>
<div class="itemB"><a href="https://example.org"><span>example.org</span></a></div>
<div class="itemB"><a href="https://example.edu"><span>example.edu</span></a></div>

Out of which I would like to get the following :

{
  "items": [
    {
      "title": "example.com",
      "url": "https://example.com"
    },
    {
      "title": "example.net",
      "url": "https://example.net"
    },
    {
      "title": "example.org",
      "url": "https://example.org"
    },
    {
      "title": "example.edu",
      "url": "https://example.edu"
    }
  ]
}

But since .itemA & .itemB elements have a different structure despite containing the same data, the only way to currently parse those is the following :

{
  'itemsA': {
    listItem: '.itemA',
    data: {
      'title': {
        selector: 'span'
      },
      'url': {
        attr: 'href'
      }
    }
  },
  'itemsB': {
  	listItem: '.itemB',
    data: {
    	'title': {
      	selector: 'span'
      },
      'url': {
      	selector: 'a',
        attr: 'href'
      }
    }
  }
}

(JSFiddle demo)

And then use [...itemsA, ...itemsB].

Therefore, I would like to suggest adding an OR query, e.g. :

{
  'items': {
    listItem: '.itemA, .itemB',
    data: {
      'title': {
        selector: 'span'
      },
      'url': {
        or: [
          {
            attr: 'href'
          },
          {
            selector: 'a',
            attr: 'href'
          }
        ]
      }
    }
  }
}

Which would directly return the desired output.

Additionally, I have the following question : how could I get the title property if there was no span element ?

Thanks


PS : thank you for this library, I dreamt several times for years of finding or creating something like this until I stumbled on it yesterday.