microsoft / unilm

I am using LayoutLMv3.

If I am not interested in the layout/object detection task, but only form recogniser and document classification tasks, could I be spared for the Detectron2 installation? It's hard to install on a VM without direct public internet connection and this installation will add unnecessary burden to our pipeline that's running everyday.

Hi, thanks for your question!

The current version of the unilm/layoutlmv3 implementation uses Detectron2 in the following two aspects:

To load images in datasets (e.g., FUNSD, CORD).
You can avoid installing Detectron2 (reference) by modifying the following codes

unilm/layoutlmv3/layoutlmft/data/image_utils.py

Lines 9 to 10 in ca82fd4

    
           from detectron2.data.detection_utils import read_image 
        
           from detectron2.data.transforms import ResizeTransform, TransformList

unilm/layoutlmv3/layoutlmft/data/image_utils.py

Lines 21 to 27 in ca82fd4

    
           def load_image(image_path): 
        
               image = read_image(image_path, format="BGR") 
        
               h = image.shape[0] 
        
               w = image.shape[1] 
        
               img_trans = TransformList([ResizeTransform(h=h, w=w, new_h=224, new_w=224)]) 
        
               image = torch.tensor(img_trans.apply_image(image).copy()).permute(2, 0, 1)  # copy to make it writeable 
        
               return image, (w, h)

to

from PIL import Image
def load_image(image_path):
    image = Image.open(image_path).convert("RGB")
    w, h = image.size
    return image, (w, h)

To support detection tasks.
The current version of the unilm/layoutlmv3 implementation has set detection=False, which does not use detection components. Removing all codes related to detection in modeling_layoutlmv3.py will also work. For example, @NielsRogge has removed is_detection logic in this PR.

@HYPJUDY

Your answer is really helpful! That's what I was looking for. Two follow-up questions:

After removal of this dependency, would it affect the accuracy of different tasks (form/receipt understanding, image classification, DocVQA)? If it is, do you have a metrics about how much accuracy will differ from the one on paper?
By reading the paper, my understanding is that adding Detectron2 is only to finetune and compare with other models on PubLayNet dataset, not a fundamental part of this layoutlmv3 for other tasks, right?

Thanks for your help!

I'm glad it helped.

The two snippet of codes should be equivalent, so switching from one to the other will not affect accuracy. I haven't verified this conclusion experimentally, but @NielsRogge's experimental results (e.g., FUNSD) support this conclusion.
You are right.

That's great to know. Also appreciate your insightful research work, which is the key enabler of our project. Thank you!

My pleasure : ) Good luck with your project!

	from detectron2.data.detection_utils import read_image
	from detectron2.data.transforms import ResizeTransform, TransformList

	def load_image(image_path):
	image = read_image(image_path, format="BGR")
	h = image.shape[0]
	w = image.shape[1]
	img_trans = TransformList([ResizeTransform(h=h, w=w, new_h=224, new_w=224)])
	image = torch.tensor(img_trans.apply_image(image).copy()).permute(2, 0, 1) # copy to make it writeable
	return image, (w, h)

LayoutLMv3 has to depend on Detectron2?