mozilla / fathom

A framework for extracting meaning from web pages

Home Page:http://mozilla.github.io/fathom/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Warn when not all target elements are vectorized

erikrose opened this issue · comments

The Vectorizer should at least warn if not all the target elements are picked up by the ruleset (and thus vectorized). This happens when the dom() calls are too selective. Right now, you don't realize this until your training run goes badly, and then you have to do a bunch of backtracking to figure out why.

Can you clarify what you mean by "target elements" (I see from the glossary that it's one that is given a "type"), and "when the dom() calls are too selective"? What is the problem downstream?

Ah I think I understand what you mean now that I'm filing all these issues from our centralized document. A target element is one that has been labeled with the data-fathom="${type}" attribute. You're suggesting if that element doesn't get pulled into the ruleset, that's a flag we can and should raise as early as possible.

Exactly.

Right now, you don't realize this until your training run goes badly, and then you have to do a bunch of backtracking to figure out why.

And in the new tag-based world of 3.0, you don't even necessarily notice that anything is wrong.