alephdata / aleph

Search and browse documents and data; find the people and companies you look for.

Home Page:http://docs.aleph.occrp.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding a text processor to process entire documents?

terion-name opened this issue · comments

Hello
First of all service/translatedemo is outdated and doesn't work, it would be great to update it. But I refered to ingestor and figured out how to start a service.

class ServiceWorker(Worker):
    def _analyze(self, dataset, task):
        entity_ids = set(task.payload.get("entity_ids"))
        analyzer = None
        for entity in dataset.partials(entity_id=entity_ids):
            log.info(entity)
            print(entity)
        return list(entity_ids)

    def handle(self, task):
        apply_task_context(task)
        name = task.context.get("ftmstore", task.job.dataset.name)
        dataset = get_dataset(name, task.stage.stage)
        log.info(task.stage.stage)
        log.info("PROC: %r", task.payload)
        entity_ids = self._analyze(dataset, task)
        payload = {"entity_ids": entity_ids}
        self.dispatch_pipeline(task, payload)

But...
I've added it to pipeline: ALEPH_INGEST_PIPELINE=analyze:myproc, and while logging what I get I see that my processor gets not entire documents but already processed parts from default analyzer.
For example I've uploaded an .eml file and my processor does not receiving the entire email content, but some striped parts of it (not even all).
Moreover task.payload contains array of 14 entity ids (for a single eml file), while for entity in dataset.partials(entity_id=entity_ids): I get only 4 entries.

What I need is to receive entire extracted/pased text and image data from files to extract entities (ftm models) that default analyzer can not. E.g. if it an email – entire email text, if a pdf or word – entire document text.

How can this be done?

Hi @terion-name, when you upload a document, multiple different entities will be emitted in addition to the entity representing the email (for example entities representing the sender and recipient of the email).

You might want to test for the entity type to ensure that your custom processor handles Email entities only. Something like this (untested):

def _analyze(self, dataset, task):
    entity_ids = set(task.payload.get("entity_ids"))
    for entity in dataset.partials(entity_id=entity_ids):
        if not entity.schema.is_a("Email"):
            continue

        print(entity)

@tillprochaska thank you!
One more question if you don't mind: how to get contents of media files? Image, audio, video, etc?

Check out the implementation in ingest-file here: https://github.com/alephdata/ingest-file/blob/main/ingestors/manager.py#L161-L165

I hope this helps! I’m going to close this issue to make it easier to keep track of bugs and feature requests vs. questions. Feel free to add further comments though.

@tillprochaska thank you for help)