How to Use Custom SpaCy Model (beki/en_spacy_pii_distilbert) with Anonymize and Sensitive Scanners

Question

How to Use Custom SpaCy Model (beki/en_spacy_pii_distilbert) with Anonymize and Sensitive Scanners

rakendd opened this issue 4 months ago · comments

Hello llm_guard Team,

I've been exploring the use of custom models with the Anonymize and Sensitive scanners within the llm_guard library, as mentioned in the changelog for the latest release. Specifically, I'm interested in integrating the SpaCy model beki/en_spacy_pii_distilbert for PII detection tasks.

Objective
My goal is to leverage the beki/en_spacy_pii_distilbert model, which is not a traditional Hugging Face Transformer model but rather a SpaCy model, for enhanced PII detection accuracy and reduced latency as highlighted in your changelog.

Issue
I encountered difficulties when attempting to load and use this SpaCy model with the Anonymize scanner. Typically, the process for integrating models relies on specifying a model path or configuration that is compatible with Hugging Face's Transformer models. However, given that beki/en_spacy_pii_distilbert is a SpaCy model, the standard approach doesn't seem to apply.

Attempts
Here's an outline of my approach so far, based on the available documentation and examples:

Model Specification: Attempted to specify beki/en_spacy_pii_distilbert directly as a model path or through a configuration dictionary.
Custom Recognizer: Explored creating a custom recognizer to wrap the SpaCy model loading and analysis logic.
Adapter Pattern: Considered using an adapter to bridge the gap between the expected input/output formats of the llm_guard scanners and the SpaCy model.
The last approach is kind of working. But wanted to know best practise to use this model inside llm_guard

custom_recognizer = CustomSpacyRecognizer()  
adapter = CustomRecognizerAdapter(custom_recognizer=custom_recognizer)


vault = Vault()
scanner = Anonymize(
    vault=vault,
    language="en",
    use_faker=True,
    custom_recognizer=adapter  # Passing the adapter as the custom recognizer
)

Could you provide guidance or examples on how to correctly integrate a SpaCy model like beki/en_spacy_pii_distilbert with the Anonymize and Sensitive scanners?

Thank you for developing llm_guard and for your support in enhancing its capabilities. I look forward to your advice on integrating SpaCy models for improved PII detection.

Best regards,
Rakend

Oleksandr Yaremchuk · Answer 1 · Fri Mar 22 2024 16:32:15 GMT+0800 (China Standard Time)

Hey @rakendd , thanks for reaching out. We used to have this model but then realized that it blocked updates to the latest transformers due to dependency on "spacy-transformers>=1.1.8,<1.2.0".

https://llm-guard.com/changelog/#030-2023-10-14

I think if this model can be updated, then we could make another custom recognizer or just use the spacy one like we did before.