wangyuxinwhy / uniem

unified embedding model

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sentenceTransformer在encode方法上面有什么优化吗

CopyNinja1999 opened this issue · comments

我直接调huggingface上面的m3e模型

class HuggingfaceModel:
    def __init__(
        self,
        model_name: str = 'moka-ai/m3e-small',
        device: str | None = None,
    ) -> None:
        from transformers import AutoModel, AutoTokenizer  # type: ignore

        if device is None:
            self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        else:
            self.device = device
        self.model = AutoModel.from_pretrained(model_name)
        self.model.to(self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model.eval()
    def encode(self, sentences: list[str], batch_size: int = 32, **kwargs) -> list[np.ndarray]:
        all_embeddings: list[np.ndarray] = []
        for batch_texts in tqdm(
            generate_batch(sentences, batch_size),
            total=len(sentences) // batch_size,
        ):
            inputs = self.tokenizer(
                batch_texts,
                padding=True,
                truncation=True,
                return_tensors='pt',
                max_length=512,
            )
            inputs = inputs.to(self.device)
            with torch.no_grad():
                outputs = self.model(**inputs, output_hidden_states=True)
                embeddings = outputs.hidden_states[-1][:, 0, :].squeeze()
            embeddings = cast(torch.Tensor, embeddings)
            all_embeddings.extend(embeddings.cpu().numpy())
        return all_embeddings

观察到这样的调用和编码方式跟用sentenceTransformer上面调用有千分之二左右的性能差距,想问一下SentenceTransformer具体相对这样编码有什么改进,自己看没看出什么门道!感谢

我其实对 SentenceTransformer 也没研究的很深,所以我也没什么头绪,千分之二 在统计学上显著吗?

@wangyuxinwhy 理论上应该是一样的嘛。所以我在对照是否是encode编码方式上面有什么trick。因为各个任务上面都有降低所以我有所怀疑

明白,如果有什么结论,可以分享一下~