[Feature Request]: Allow trafilatura kwargs for additional options
wheynelau opened this issue · comments
Wayne Lau commented
Feature Description
Trafilatura has options that can be used for collection, such as favouring recalls or precision.
# trafilatura source code
def extract(filecontent, url=None, record_id=None, no_fallback=False,
favor_precision=False, favor_recall=False,
include_comments=True, output_format="txt",
tei_validation=False, target_language=None,
include_tables=True, include_images=False, include_formatting=False,
include_links=False, deduplicate=False,
date_extraction_params=None,
only_with_metadata=False, with_metadata=False,
max_tree_size=None, url_blacklist=None, author_blacklist=None,
settingsfile=None, prune_xpath=None,
config=DEFAULT_CONFIG, options=None,
**kwargs):
# llama index source code:
response = trafilatura.extract(
downloaded,
include_comments=include_comments,
output_format=output_format,
include_tables=include_tables,
include_images=include_images,
include_formatting=include_formatting,
include_links=include_links,
)
I am proposing to use a kwargs to allow the other arguments through. This is a simple QOL, and I will make the pull request soon.
Reason
The existing approaches are working, but it may be useful to allow for more text in some websites. My current workaround is editing the source code.
Value of Feature
No response