run-llama / llama_index

LlamaIndex is a data framework for your LLM applications

Home Page:https://docs.llamaindex.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Feature Request]: Allow trafilatura kwargs for additional options

wheynelau opened this issue · comments

Feature Description

Trafilatura has options that can be used for collection, such as favouring recalls or precision.

# trafilatura source code
def extract(filecontent, url=None, record_id=None, no_fallback=False,
            favor_precision=False, favor_recall=False,
            include_comments=True, output_format="txt",
            tei_validation=False, target_language=None,
            include_tables=True, include_images=False, include_formatting=False,
            include_links=False, deduplicate=False,
            date_extraction_params=None,
            only_with_metadata=False, with_metadata=False,
            max_tree_size=None, url_blacklist=None, author_blacklist=None,
            settingsfile=None, prune_xpath=None,
            config=DEFAULT_CONFIG, options=None,
            **kwargs):

# llama index source code:
response = trafilatura.extract(
                downloaded,
                include_comments=include_comments,
                output_format=output_format,
                include_tables=include_tables,
                include_images=include_images,
                include_formatting=include_formatting,
                include_links=include_links,
            )

I am proposing to use a kwargs to allow the other arguments through. This is a simple QOL, and I will make the pull request soon.

Reason

The existing approaches are working, but it may be useful to allow for more text in some websites. My current workaround is editing the source code.

Value of Feature

No response