ghostery / adblocker

Efficient embeddable adblocker library

Home Page:https://www.ghostery.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

support filter list pre-processor rules, eg. ifFirefox

seia-soto opened this issue · comments

Containing environmental information in Filter

By including environmental information like env_<extension_name> as a bit mask in Filter object, we can dynamically choose the filter to enable and disable.

Problem: Exceeding 32 bits to include all variables

In this case, we need to prepare to expand the current structure. According to uBlock Origin wiki, there're total of 24 preprocessor variables. However, the current structure of NetworkFilter already uses 30 bits and CosmeticFilter uses 8 bits.

The mask of `NetworkFilter`
/**
 * Masks used to store options of network filters in a bitmask.
 */
export const enum NETWORK_FILTER_MASK {
  // Request Type
  fromDocument = 1 << 0,
  fromFont = 1 << 1,
  fromHttp = 1 << 2,
  fromHttps = 1 << 3,
  fromImage = 1 << 4,
  fromMedia = 1 << 5,
  fromObject = 1 << 6,
  fromOther = 1 << 7,
  fromPing = 1 << 8,
  fromScript = 1 << 9,
  fromStylesheet = 1 << 10,
  fromSubdocument = 1 << 11,
  fromWebsocket = 1 << 12, // e.g.: ws, wss
  fromXmlHttpRequest = 1 << 13,

  // Partiness
  firstParty = 1 << 14,
  thirdParty = 1 << 15,

  // Options
  // FREE - 1 << 16
  isBadFilter = 1 << 17,
  isCSP = 1 << 18,
  isGenericHide = 1 << 19,
  isImportant = 1 << 20,
  isSpecificHide = 1 << 21,

  // Kind of patterns
  isFullRegex = 1 << 22,
  isRegex = 1 << 23,
  isUnicode = 1 << 24,
  isLeftAnchor = 1 << 25,
  isRightAnchor = 1 << 26,
  isException = 1 << 27,
  isHostnameAnchor = 1 << 28,
  isRedirectRule = 1 << 29,
}
The mask of `CosmeticFilter`
/**
 * Masks used to store options of cosmetic filters in a bitmask.
 */
const enum COSMETICS_MASK {
  unhide = 1 << 0,
  scriptInject = 1 << 1,
  isUnicode = 1 << 2,
  isClassSelector = 1 << 3,
  isIdSelector = 1 << 4,
  isHrefSelector = 1 << 5,
  remove = 1 << 6,
  extended = 1 << 7,
}

https://github.com/gorhill/uBlock/wiki/Static-filter-syntax#if-condition

Some variables can be merged into one variable: e.g. adguard.

I would recommend not storing this information in the filter objects themselves but instead in some other data-structure of the FiltersEngine class or network/cosmetic buckets. One reason is what you mention about the need to add more attributes. This is going to add overhead to all filters, despite only a very small minority of them being impacted by the pre-processor rules. A second reason is more conceptual, as the pre-processor directives are not part of the filter themselves (in the lists definition) but are something outside of them to indicate which filters should be included or not depending on some external conditions. Lastly, and that's a more minor point, these directives could potentially be resolved statically at engine build-time, and in such case it would not make much sense to have extra attributes in all filters since that will be pure overhead without a function (in this sense the optional data structure stored outside of the filters make more sense to me; see below).

An alternative approach could be to have an optional set of filter IDs per environment at the FiltersEngine level, which we can then use to discard matching filters that do not belong to the current environment.

Having an optional set of filter IDs that defines what filters should be disabled looks like a nice approach for me. Also, the environmental information should be given from the external by seeing the characteristic of this project.

Not sure if preprocessor flags can be applied at the build time, for few reasons:

  • The capability conditions, like cap_html_filtering cannot be resolved at the build time.
  • in future we want users to load custom list and create own filters, a preprocessor support can be useful in the runtime
  • with 24 flags supported by uBO, we would have to generate 24 more engines, given we produce engines for many version of adblocker library, the effort would cost a lot. basically every flag is cost multiplier

So separate data structure may be a best compromise. We may want to reserve one bit to mark filters that have preprocessor conditions so we can skip the runtime checks (and cost) for the majority of filters.

I got an idea about this case, and I think we can implement preprocessor at both build-time and runtime.

First, we need a bit and byte to express:

  • A bit if the IFilter (includes NetworkFilter and CosmeticFilter objects) object has an additional byte field to express compatibility
  • A byte of compatibility table (bit window)

For example of network filter:

export const enum NETWORK_FILTER_MASK {
...
// Internals
hasCompatibilityTable = 1 << 30,
}

If we find a positive bit in 1 << 30, the deserializer of NetworkFilter will look up next single byte to parse compatibility table.

MASK [ 1 byte ]
COMPAT_MASK [ 1 byte ] (optional, decided by 31th field of MASK)
...

Also, we'll have an option to determine if filter needs to be parsed in build-time. To minimize the impact on existing user base, I want this option to be optional.

For example, we can make a following option in Config:

...
loadAdditionalCompatibilityTable: boolean;

If config.loadAdditionalCompatibilityTable is set to false, the filter parser will skip the line.
Otherwise, the filter parser will parse and save additional field in filter object: COMPAT_MASK after MASK.

By using this method, we only need to decide the behavior on runtime implementation because build-time won't parse the filter at all.

After cleaning up my head, I got more detailed solution to this and I'm working on this right now. First, I made PREPROCESSOR_MASK enum type to express conditionals:

export const enum PREPROCESSOR_MASK {
  isUnsupportedPlatform = 1 << 0,
  isManifestV3 = 1 << 1,
  isMobile = 1 << 2,
  // RESERVE = 1 << 3,

  // Browser specs
  isBrowserChromium = 1 << 4,
  isBrowserFirefox = 1 << 5,
  isBrowserSafari = 1 << 6,
  isBrowserOpera = 1 << 7,

  // Capabilities
  hasHtmlFilteringCapability = 1 << 8,
  hasUserStylesheetCapability = 1 << 9,
  // RESERVE = 1 << {10...12}

  // Else
  false = 1 << 13,
  invalid = 1 << 14,
  // RESERVE = 1 << 15
}

At the current timespan, I expect we'll have maximum 16 bits (uint16) of preprocessor masks. This makes having two operator possible. I'm going to allocate first 16 bits from left side to OR operator and allocate remaining 16 bits to AND operator.

In other words, we're going to have MUST-have bits and OPTIONAL bits.

[OR]
1 << 31
...
1 << 16
[AND]
1 << 15
...
1 << 0

Also, by looking at uBlock Origin's source code, I found that they always evaluate tokens from left to right. This means there's always evaluation priority in left side.

The following two are same:

$token $op $token $op $token

((($token) $op $token) $op $token)

So we can parse from backwards.

If the last $op is AND, we'll put the last $token to AND mask span which means the engine MUST have $token capability to use the filter.

Otherwise, if the last $op is OR, we'll put the last $token to OR mask span which means the engine will respect this filter anyway.

The only problem in this case is having OR in the middle + having AND at the end of expression:

capA _AND_ capB _OR_ capC _AND_ capD
  • Are capA and capB MUST in this case? We can't store this information.
  • capC is optional in this case.
  • capD is MUST in this case.

However, I expect this'll be a rare case and won't see this exception at this time.

Alternative: have an optional reference to Preprocessor from IFilter and evaluate the conditional at runtime

The another alternative would be having an optional reference to Preprocessor from IFilter.

export default interface Preprocessor {
  id: number;
  condition: number[]; // masks
  operators: boolean[]; // corresponding operator per condition (mask)
}
import { StaticDataView } from '../data-view';
import Preprocessor from '../somewhere';

export default interface IFilter {
  mask: number;
  preprocessorRef?: Preprocessor['id'];
  getPreprocessor: () => Preprocessor | undefined; // let me assume this feature is opt-in
  getId: () => number;
  getTokens: () => Uint32Array[];
  serialize: (buffer: StaticDataView) => void;
  getSerializedSize(compression: boolean): number;
}

The Preprocessor reference from IFilter will be created in build-time optionally.

The advantage of this solution over upper solution is that we won't need any extra fields depending on the size of conditions and operators in Preprocessor, but just an optional byte.

Also, there's no problem in serialization in Preprocessor.

Update 1: I made a minimal changes to filter parser to show how the last alternative will work: https://github.com/ghostery/adblocker/compare/master...seia-soto:adblocker:add-preprocessor?expand=1

The core changes are in lists.ts and preprocessor.ts.