Is there a way to match against hostnames/domains?

Question

Is there a way to match against hostnames/domains?

jahamed opened this issue a year ago · comments

Trying to analyze what domains show up on adblock filter lists, would love to be able to scan lists and check if just the hostname/domain appears. Simple text searches get quite inefficient, and was wondering if any of the algorithms in this library help.

ex:
instead of ||easypic.com/js/easypicads.js I would like to match against easypic.com

Is this possible with this library? Thanks!

Rémi · Answer 1 · Fri Jan 20 2023 19:57:50 GMT+0800 (China Standard Time)

Hi @jahamed,

Trying to rephrase what you mean, let me know if I get this right. You have a domain in mind like easypic.com and you would like to know the list of filters that match this domain?

For example ||easypic.com/js/easypicads.js (exact hostname match), ||cdn.easypic.com/js/easypicads.js (hostname with a subdomain), ||examle.com/js/script.js$domain=easypicads.com (hostname appears in domain option).

But would not find: ||example.com/js/easypicads.js (easypicads appears but not with .com suffix) or ||example.com/easypicads.com/js (easypicads.com appears but not in a position that it would match on the hostname of a request).

Is that correct? If so you can probably use this library to do the parsing of rules and filtering with the domains you're interested in. Does this need to be very fast? Otherwise, simpler approaches might also work.

I hope that helps,

Javed Ahamed · Answer 2 · Sat Jan 21 2023 01:58:24 GMT+0800 (China Standard Time)

Hi @remusao sorry for the late response,

Yes all those cases you listed are ones I want to detect! I would like to see if a domain simply exists in an adblock filter list (basically either the entire website or some script from there was considered bad at some time). It needs to be a bit fast since there are a ton of filter lists to look at, is there a way to load multiple filter lists into an engine? Is there a limit on that?

Currently I have

import { FiltersEngine } from '@cliqz/adblocker'
import { Request } from '@cliqz/adblocker'
import fs from 'fs'

const engine = FiltersEngine.parse(fs.readFileSync('test.txt', 'utf-8'))

const { match } = engine.match(
  Request.fromRawDetails({
    url: 'http://exampledomain.com',
  })
)
console.log(match)

Does this look correct? What type would I use in the request for a url? Also is there a way I can combine multiple filterlists together into the filterengine? Basically hoping to use whatever parsing/searching algorithms you have built into this library since simple linear search is too slow for me 😃

Thanks!

Rémi · Answer 3 · Sun Feb 26 2023 18:35:47 GMT+0800 (China Standard Time)

Hi @jahamed,

Sorry for the delay of answer. Are you still interested in this? If so, I wanted to clarify one last point, are you giving as input a list of domains that you would like to find in the lists? Or a single one?

Best,

Javed Ahamed · Answer 4 · Sun Feb 26 2023 20:05:32 GMT+0800 (China Standard Time)

@remusao no worries! Don't really need an answer to this anymore, requirements changed slightly. Thank you np!