Adding a text processor to process entire documents?
terion-name opened this issue · comments
Hello
First of all service/translate
demo is outdated and doesn't work, it would be great to update it. But I refered to ingestor and figured out how to start a service.
class ServiceWorker(Worker):
def _analyze(self, dataset, task):
entity_ids = set(task.payload.get("entity_ids"))
analyzer = None
for entity in dataset.partials(entity_id=entity_ids):
log.info(entity)
print(entity)
return list(entity_ids)
def handle(self, task):
apply_task_context(task)
name = task.context.get("ftmstore", task.job.dataset.name)
dataset = get_dataset(name, task.stage.stage)
log.info(task.stage.stage)
log.info("PROC: %r", task.payload)
entity_ids = self._analyze(dataset, task)
payload = {"entity_ids": entity_ids}
self.dispatch_pipeline(task, payload)
But...
I've added it to pipeline: ALEPH_INGEST_PIPELINE=analyze:myproc
, and while logging what I get I see that my processor gets not entire documents but already processed parts from default analyzer.
For example I've uploaded an .eml
file and my processor does not receiving the entire email content, but some striped parts of it (not even all).
Moreover task.payload
contains array of 14 entity ids (for a single eml file), while for entity in dataset.partials(entity_id=entity_ids):
I get only 4 entries.
What I need is to receive entire extracted/pased text and image data from files to extract entities (ftm models) that default analyzer can not. E.g. if it an email – entire email text, if a pdf or word – entire document text.
How can this be done?
Hi @terion-name, when you upload a document, multiple different entities will be emitted in addition to the entity representing the email (for example entities representing the sender and recipient of the email).
You might want to test for the entity type to ensure that your custom processor handles Email
entities only. Something like this (untested):
def _analyze(self, dataset, task):
entity_ids = set(task.payload.get("entity_ids"))
for entity in dataset.partials(entity_id=entity_ids):
if not entity.schema.is_a("Email"):
continue
print(entity)
@tillprochaska thank you!
One more question if you don't mind: how to get contents of media files? Image, audio, video, etc?
Check out the implementation in ingest-file here: https://github.com/alephdata/ingest-file/blob/main/ingestors/manager.py#L161-L165
I hope this helps! I’m going to close this issue to make it easier to keep track of bugs and feature requests vs. questions. Feel free to add further comments though.
@tillprochaska thank you for help)