Add support for combined image and text embeddings using CLIP

Question

Add support for combined image and text embeddings using CLIP

prmaxim opened this issue 4 months ago · comments

Description

We use CLIP for product recommendations in e-commerce. By generating two vectors (image + name) and then adding the concatenated result to the TS embedding field, we get more accurate recommendations than with image embedding alone.

The CLIP API allows requests for multiple fields and returns an array of embeddings back:
[[image embedding], [text embedding]]

TS now natively supports CLIP for image embeddings, but doesn't allow to create embeddings from multiple fields.

Note: the issue #1291 looks broader and covers this specific issue of combining CLIP embeddings.

Steps to reproduce

Create a collection with an image and text fields:

{
  "name": "Images",
  "fields": [
    {
      "name": "name",
      "type": "string"
    },
    {
      "name": "image",
      "type": "image",
      "store": false
    },
    {
      "name": "embedding",
      "type": "float[]",
      "embed": {
        "from": [
          "image",
          "name"
        ],
        "model_config": {
          "model_name": "ts/clip-vit-b-p32"
        }
      }
    }
  ]
}

Actual Behavior

Error: Only one field can be used in the embed.from property of an embed field when embedding from an image field.

Metadata

Typesense Version: 0.26.0.rc58