How could I remove part of page content before sending to LLM?

Question

How could I remove part of page content before sending to LLM?

davideuler opened this issue 8 days ago · comments

Is your feature request related to a problem? Please describe.
Lots of LLMs support only 32k tokens. And many webpage content has tokens more than 32k.
When I send the page content to LLM like qwen, deepseek, it all failed.

Describe the solution you'd like
A way to clean the HTML before sending to LLM. If I can remove some parts of the page html, the sized could be reduces so that it would not exceeds the maximum tokens for a model.

Describe alternatives you've considered
Tried different models, and try to set max_tokens.

Additional context
I am scraping some pages which has lots of HTML content. But the important content is far less than 32k.
The header content and the bottom content of the page takes lots of tokens in the HTML. I hope that I can remove them before send to LLM.

Marco Vinciguerra · Answer 1 · Wed Jun 19 2024 21:31:46 GMT+0800 (China Standard Time)

hi, when it overlap 32k our algorithm create another api call, it should not be a problem.
If you have errors please send us the script and we can take a look

david l euler · Answer 2 · Wed Jun 19 2024 23:16:34 GMT+0800 (China Standard Time)

hi, when it overlap 32k our algorithm create another api call, it should not be a problem. If you have errors please send us the script and we can take a look

Thanks for your help, Vinci.

The following is the script, it is case of a boundary condition when User's Input is smaller than while very near to 32K, and I specified the max_tokens to 4096. When max_tokens not specified, it outputs only parts of the expected links cause running out of tokens. When I specify the max_tokens to 4096, it runs with error:

File "/Users/david/.pyenv/versions/3.10.13/envs/scraper/lib/python3.10/site-packages/openai/_base_client.py", line 921, in request
    return self._request(
  File "/Users/david/.pyenv/versions/3.10.13/envs/scraper/lib/python3.10/site-packages/openai/_base_client.py", line 1020, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'detail': "This model's maximum context length is 32768 tokens. However, you requested 36317 tokens (32221 in the messages, 4096 in the completion). Please reduce the length of the messages or completion."}

The script:

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info

load_dotenv()

deepseek_key = os.getenv("DEEPSEEK_KEY") 
graph_config = {
   "llm": {
      "api_key": deepseek_key,
      "model": "deepseek-chat",
      "temperature": 0.7,
      "max_tokens": 4096, ## max output tokens limited to 4k for gpt-4o,gpt-4-turbo
      "base_url": "https://api.deepseek.com/v1"
   },
   "embeddings": {
      "model": "ollama/nomic-embed-text",
      "temperature": 0,
      "base_url": "http://localhost:11434",  # set ollama URL
   },
   "headless": False,
   "verbose": True,
}

# ************************************************
# Create the SmartScraperGraph instance and run it
# ************************************************

smart_scraper_graph = SmartScraperGraph(
   prompt="extract all page title and links under，footage-card-wrapper as format of: [{\"title\": \"xxx\", \"link\":\"xxx\" }] ",
   # also accepts a string with the already downloaded HTML code
   source="https://stock.xinpianchang.com/footages/2997636.html",
   config=graph_config
)

result = smart_scraper_graph.run()
print(result)

Marco Vinciguerra · Answer 3 · Thu Jun 20 2024 01:55:35 GMT+0800 (China Standard Time)

please update to the new version

david l euler · Answer 4 · Thu Jun 20 2024 08:07:57 GMT+0800 (China Standard Time)

please update to the new version

Thanks for your help. I updated to 1.7.3, and still got the same error:

openai.BadRequestError: Error code: 400 - {'detail': "This model's maximum context length is 32768 tokens. However, you requested 36294 tokens (32198 in the messages, 4096 in the completion). Please reduce the length of the messages or completion."}

Marco Vinciguerra · Answer 5 · Thu Jun 20 2024 14:45:03 GMT+0800 (China Standard Time)

why openai error? have you changed the provider?

david l euler · Answer 6 · Thu Jun 20 2024 22:37:58 GMT+0800 (China Standard Time)

The code has been pasted as above. The model I use is "deepseek-chat". I wonder what caused it to show openai errors.

Federico Aguzzi · Answer 7 · Tue Jun 25 2024 04:25:58 GMT+0800 (China Standard Time)

Two notes:

there's probably something in the request chunking module that got broken in a recent update. This is the second issue of this type (exceeding token size even when models are supported) in less than a week
the OpenAI errors appear because DeepSeek is invoked through the OpenAI module in LangChain. This is because LangChain does not provide direct support for DeepSeek, but DeepSeek models have an OpenAI-like API