microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LayoutLMv3 has to depend on Detectron2?

7fantasysz opened this issue · comments

I am using LayoutLMv3.

If I am not interested in the layout/object detection task, but only form recogniser and document classification tasks, could I be spared for the Detectron2 installation? It's hard to install on a VM without direct public internet connection and this installation will add unnecessary burden to our pipeline that's running everyday.

Hi, thanks for your question!

The current version of the unilm/layoutlmv3 implementation uses Detectron2 in the following two aspects:

  1. To load images in datasets (e.g., FUNSD, CORD).
    You can avoid installing Detectron2 (reference) by modifying the following codes
    from detectron2.data.detection_utils import read_image
    from detectron2.data.transforms import ResizeTransform, TransformList

    def load_image(image_path):
    image = read_image(image_path, format="BGR")
    h = image.shape[0]
    w = image.shape[1]
    img_trans = TransformList([ResizeTransform(h=h, w=w, new_h=224, new_w=224)])
    image = torch.tensor(img_trans.apply_image(image).copy()).permute(2, 0, 1) # copy to make it writeable
    return image, (w, h)

    to
    from PIL import Image
    def load_image(image_path):
        image = Image.open(image_path).convert("RGB")
        w, h = image.size
        return image, (w, h)
  1. To support detection tasks.
    The current version of the unilm/layoutlmv3 implementation has set detection=False, which does not use detection components. Removing all codes related to detection in modeling_layoutlmv3.py will also work. For example, @NielsRogge has removed is_detection logic in this PR.

@HYPJUDY

Your answer is really helpful! That's what I was looking for. Two follow-up questions:

  1. After removal of this dependency, would it affect the accuracy of different tasks (form/receipt understanding, image classification, DocVQA)? If it is, do you have a metrics about how much accuracy will differ from the one on paper?

  2. By reading the paper, my understanding is that adding Detectron2 is only to finetune and compare with other models on PubLayNet dataset, not a fundamental part of this layoutlmv3 for other tasks, right?

Thanks for your help!

I'm glad it helped.

  1. The two snippet of codes should be equivalent, so switching from one to the other will not affect accuracy. I haven't verified this conclusion experimentally, but @NielsRogge's experimental results (e.g., FUNSD) support this conclusion.
  2. You are right.

That's great to know. Also appreciate your insightful research work, which is the key enabler of our project. Thank you!

My pleasure : ) Good luck with your project!