predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Home Page:https://loraexchange.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

JSON errors in `generate()` happen for certain base models (but not for others)

alexsherstinsky opened this issue · comments

System Info

@jeffreyftang I am finding that LoRAX runs into JSON errors when used (via the Predibase SDK) to prompt the "gemma-2b" and "mistral-7b" base models (but no issues with "phi-2", and "zephyr-7b-beta"), but one time "phi-2"` also did not work. The behavior in terms of which model causes the error is inconsistent. When the error happens, the stack trace is:

>       result: GeneratedResponse = base_llm_deployment.generate(
            prompt=prompt,
            options=options,
        )

sdk/python/langchain/libs/community/langchain_community/llms/predibase.py:61: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <predibase.resource.llm.interface.LLMDeployment object at 0x3782fee60>
prompt = "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that yo...ific tasks and log results.\nInstruction:\n\nQuestion: What are the approaches to Task Decomposition?\nHelpful Answer:"
options = {'details': False, 'max_new_tokens': 256, 'temperature': 0.1}

    def generate(
        self,
        prompt: str,
        options: Optional[Dict[str, Union[str, float]]] = None,
    ) -> GeneratedResponse:
        if not options:
            options = dict()
    
        # Need to do this since the lorax client sets this to True by default
        if "details" not in options:
            options["details"] = False
        options = self._override_adapter_options(options)
>       res = self.lorax_client.generate(prompt=prompt, **options)

sdk/python/predibase/resource/llm/interface.py:307: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <lorax.client.Client object at 0x3782fcca0>
prompt = "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that yo...ific tasks and log results.\nInstruction:\n\nQuestion: What are the approaches to Task Decomposition?\nHelpful Answer:"
adapter_id = None, adapter_source = None, merged_adapters = None, api_token = None, do_sample = False, max_new_tokens = 256, ignore_eos_token = False, best_of = None, repetition_penalty = None, return_full_text = False, seed = None
stop_sequences = None, temperature = 0.1, top_k = None, top_p = None, truncate = None, typical_p = None, watermark = False, response_format = None, decoder_input_details = False, return_k_alternatives = None, details = False

    def generate(
        self,
        prompt: str,
        adapter_id: Optional[str] = None,
        adapter_source: Optional[str] = None,
        merged_adapters: Optional[MergedAdapters] = None,
        api_token: Optional[str] = None,
        do_sample: bool = False,
        max_new_tokens: Optional[int] = None,
        ignore_eos_token: bool = False,
        best_of: Optional[int] = None,
        repetition_penalty: Optional[float] = None,
        return_full_text: bool = False,
        seed: Optional[int] = None,
        stop_sequences: Optional[List[str]] = None,
        temperature: Optional[float] = None,
        top_k: Optional[int] = None,
        top_p: Optional[float] = None,
        truncate: Optional[int] = None,
        typical_p: Optional[float] = None,
        watermark: bool = False,
        response_format: Optional[Union[Dict[str, Any], ResponseFormat]] = None,
        decoder_input_details: bool = False,
        return_k_alternatives: Optional[int] = None,
        details: bool = True,
    ) -> Response:
        """
        Given a prompt, generate the following text
    
        Args:
            prompt (`str`):
                Input text
            adapter_id (`Optional[str]`):
                Adapter ID to apply to the base model for the request
            adapter_source (`Optional[str]`):
                Source of the adapter (hub, local, s3)
            merged_adapters (`Optional[MergedAdapters]`):
                Merged adapters to apply to the base model for the request
            api_token (`Optional[str]`):
                API token for accessing private adapters
            do_sample (`bool`):
                Activate logits sampling
            max_new_tokens (`Optional[int]`):
                Maximum number of generated tokens
            ignore_eos_token (`bool`):
                Whether to ignore EOS tokens during generation
            best_of (`int`):
                Generate best_of sequences and return the one if the highest token logprobs
            repetition_penalty (`float`):
                The parameter for repetition penalty. 1.0 means no penalty. See [this
                paper](https://arxiv.org/pdf/1909.05858.pdf) for more details.
            return_full_text (`bool`):
                Whether to prepend the prompt to the generated text
            seed (`int`):
                Random sampling seed
            stop_sequences (`List[str]`):
                Stop generating tokens if a member of `stop_sequences` is generated
            temperature (`float`):
                The value used to module the logits distribution.
            top_k (`int`):
                The number of highest probability vocabulary tokens to keep for top-k-filtering.
            top_p (`float`):
                If set to < 1, only the smallest set of most probable tokens with probabilities that add up to `top_p` or
                higher are kept for generation.
            truncate (`int`):
                Truncate inputs tokens to the given size
            typical_p (`float`):
                Typical Decoding mass
                See [Typical Decoding for Natural Language Generation](https://arxiv.org/abs/2202.00666) for more information
            watermark (`bool`):
                Watermarking with [A Watermark for Large Language Models](https://arxiv.org/abs/2301.10226)
            response_format (`Optional[Union[Dict[str, Any], ResponseFormat]]`):
                Optional specification of a format to impose upon the generated text, e.g.,:
                ```
                {
                    "type": "json_object",
                    "schema": {
                        "type": "string",
                        "title": "response"
                    }
                }
                ```
            decoder_input_details (`bool`):
                Return the decoder input token logprobs and ids
            return_k_alternatives (`int`):
                The number of highest probability vocabulary tokens to return as alternative tokens in the generation result
            details (`bool`):
                Return the token logprobs and ids for generated tokens
    
        Returns:
            Response: generated response
        """
        # Validate parameters
        parameters = Parameters(
            adapter_id=adapter_id,
            adapter_source=adapter_source,
            merged_adapters=merged_adapters,
            api_token=api_token,
            best_of=best_of,
            details=details,
            do_sample=do_sample,
            max_new_tokens=max_new_tokens,
            ignore_eos_token=ignore_eos_token,
            repetition_penalty=repetition_penalty,
            return_full_text=return_full_text,
            seed=seed,
            stop=stop_sequences if stop_sequences is not None else [],
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            truncate=truncate,
            typical_p=typical_p,
            watermark=watermark,
            response_format=response_format,
            decoder_input_details=decoder_input_details,
            return_k_alternatives=return_k_alternatives
        )
        request = Request(inputs=prompt, stream=False, parameters=parameters)
    
        resp = requests.post(
            self.base_url,
            json=request.dict(by_alias=True),
            headers=self.headers,
            cookies=self.cookies,
            timeout=self.timeout,
        )
    
        # TODO: expose better error messages for 422 and similar errors
>       payload = resp.json()

/opt/homebrew/anaconda3/envs/predibase/lib/python3.10/site-packages/lorax/client.py:190: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Response [503]>, kwargs = {}

    def json(self, **kwargs):
        r"""Returns the json-encoded content of a response, if any.
    
        :param \*\*kwargs: Optional arguments that ``json.loads`` takes.
        :raises requests.exceptions.JSONDecodeError: If the response body does not
            contain valid json.
        """
    
        if not self.encoding and self.content and len(self.content) > 3:
            # No encoding set. JSON RFC 4627 section 3 states we should expect
            # UTF-8, -16 or -32. Detect which one to use; If the detection or
            # decoding fails, fall back to `self.text` (using charset_normalizer to make
            # a best guess).
            encoding = guess_json_utf(self.content)
            if encoding is not None:
                try:
                    return complexjson.loads(self.content.decode(encoding), **kwargs)
                except UnicodeDecodeError:
                    # Wrong UTF codec detected; usually because it's not UTF-8
                    # but some other 8-bit codec.  This is an RFC violation,
                    # and the server didn't bother to tell us what codec *was*
                    # used.
                    pass
                except JSONDecodeError as e:
                    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    
        try:
            return complexjson.loads(self.text, **kwargs)
        except JSONDecodeError as e:
            # Catch JSON-related errors and raise as requests.JSONDecodeError
            # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
>           raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
E           requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

/opt/homebrew/anaconda3/envs/predibase/lib/python3.10/site-packages/requests/models.py:975: JSONDecodeError

requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thank you.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run generate() on "mistral-7b".

Expected behavior

There should be no errors when running generate() on any supported Predibase serverless model.

Moving to predibase internal.