sentenceTransformer在encode方法上面有什么优化吗
CopyNinja1999 opened this issue · comments
CopyNinja1999 commented
我直接调huggingface上面的m3e模型
class HuggingfaceModel:
def __init__(
self,
model_name: str = 'moka-ai/m3e-small',
device: str | None = None,
) -> None:
from transformers import AutoModel, AutoTokenizer # type: ignore
if device is None:
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
else:
self.device = device
self.model = AutoModel.from_pretrained(model_name)
self.model.to(self.device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model.eval()
def encode(self, sentences: list[str], batch_size: int = 32, **kwargs) -> list[np.ndarray]:
all_embeddings: list[np.ndarray] = []
for batch_texts in tqdm(
generate_batch(sentences, batch_size),
total=len(sentences) // batch_size,
):
inputs = self.tokenizer(
batch_texts,
padding=True,
truncation=True,
return_tensors='pt',
max_length=512,
)
inputs = inputs.to(self.device)
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
embeddings = outputs.hidden_states[-1][:, 0, :].squeeze()
embeddings = cast(torch.Tensor, embeddings)
all_embeddings.extend(embeddings.cpu().numpy())
return all_embeddings
观察到这样的调用和编码方式跟用sentenceTransformer上面调用有千分之二左右的性能差距,想问一下SentenceTransformer具体相对这样编码有什么改进,自己看没看出什么门道!感谢
yuxin.wang commented
我其实对 SentenceTransformer 也没研究的很深,所以我也没什么头绪,千分之二 在统计学上显著吗?
CopyNinja1999 commented
@wangyuxinwhy 理论上应该是一样的嘛。所以我在对照是否是encode编码方式上面有什么trick。因为各个任务上面都有降低所以我有所怀疑
yuxin.wang commented
明白,如果有什么结论,可以分享一下~