Add documentlayoutsegmentation_YOLOv8_ondoclaynet

Question

Add documentlayoutsegmentation_YOLOv8_ondoclaynet

Maxzurek opened this issue 2 months ago · comments

Model description

I originally wanted to train a document layout analysis model using dit-base but since I didn't have enough compute power I decided to look for alternatives.
I stumbled upon this model and after testing it, I had great results. The model uses the YOLOv8 XL architecture.

Prerequisites

The model is supported in Transformers (i.e., listed here) (I am unsure)
The model can be exported to ONNX with Optimum (i.e., listed here)

Additional information

I successfully exported the model to ONNX and tested it using the following code:

from ultralytics import YOLO

# Load the exported ONNX model
model = YOLO('yolov8x-doclaynet.pt')
model.export(format='onnx')

# Run inference
onnx_model = YOLO('yolov8x-doclaynet.onnx')
results = onnx_model('sample1.png', boxes=True)

for entry in results:
    print(entry.names)
    print(entry.boxes.data.numpy())

Results:

WARNING  Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify', or 'pose'.
Loading yolov8x-doclaynet.onnx for ONNX Runtime inference...

image 1/1 C:\Users\mcesa\miniconda3\envs\yolo\convert_script\sample1.png: 640x640 5 List-items, 1 Page-footer, 1 Page-header, 4 Section-headers, 12 Texts, 1 Title, 794.2ms
Speed: 5.0ms preprocess, 794.2ms inference, 42.5ms postprocess per image at shape (1, 3, 640, 640)
{0: 'Caption', 1: 'Footnote', 2: 'Formula', 3: 'List-item', 4: 'Page-footer', 5: 'Page-header', 6: 'Picture', 7: 'Section-header', 8: 'Table', 9: 'Text', 10: 'Title'}
[[     54.104       478.5      625.66      561.51     0.96691           9]
 [     309.46      10.659      606.84      34.639     0.95411           5]
 [     54.101      593.76       610.9      637.21     0.95381           9]
 [     56.099      658.09      617.05      701.72     0.95212           3]
 [     54.254      76.034      553.98      160.44     0.95189          10]
 [     53.743      715.95      624.31      759.17     0.94778           9]
 [      55.52      761.81      608.76      791.86     0.94533           3]
 [     53.804      169.69      95.741      186.52     0.92905           9]
 [     54.597      247.58      613.28      277.48     0.92579           9]
 [      54.05      195.19      221.24      211.95     0.91027           9]
 [     54.129      338.93      101.89       355.2     0.90539           7]
 [     53.955      221.73      99.105      238.28      0.9047           9]
 [     55.766      638.88      384.13      655.06     0.90207           3]
 [     55.492      794.23      522.05      810.18     0.89753           3]
 [     308.84       908.9      373.95      921.43     0.88651           4]
 [     55.702      812.87      447.08      829.36     0.87433           3]
 [     53.422      567.88      74.302      584.58     0.87283           7]
 [     54.287      452.57      143.76      469.43     0.87249           7]
 [     54.106      429.43      414.41      446.01     0.86868           9]
 [     56.377      310.21      316.57      325.97     0.84993           9]
 [     55.328      289.34      417.56      306.33      0.8446           9]
 [     54.093      365.53      626.07      408.28     0.75924           9]
 [     53.334      841.17      73.789      859.76     0.67448           7]
 [     54.386      364.58      298.45      380.64     0.38518           9]]

Your contribution

I am willing to provide help if needed

Joshua Lochner · Answer 1 · Mon Apr 01 2024 06:17:01 GMT+0800 (China Standard Time)

Hi there! 👋 This should be possible :) Could you upload your converted ONNX models to the HF hub? If you structure it like https://huggingface.co/Xenova/yolov9-c, it should already be able to work (even without any additions to the library).

Maxzurek · Answer 2 · Mon Apr 01 2024 11:23:02 GMT+0800 (China Standard Time)

I already uploaded the ONNX model on HF (I'm not sure if the config.json and the preprocessor_config.json are setup correctly ~~and haven't figure out how to quantize the model yet~~).

I then tested it using this code:

const model = await AutoModel.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");
const processor = await AutoProcessor.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");
const rawImage = await RawImage.fromBlob(blob);
const { pixel_values } = await processor(rawImage);
const output = await model({ images: pixel_values });
console.log(output);

The result:

I'm not sure what to do with the tensor object.

Joshua Lochner · Answer 3 · Tue Apr 02 2024 01:02:26 GMT+0800 (China Standard Time)

Permuting the output makes it far easier to understand:

const permuted = output.output0[0].transpose(1, 0);
// `permuted` is a Tensor of shape [ 8400, 15 ]:
// - 8400 potential bounding boxes
// - 15 parameters for each box:
//   - first 4 are coordinates for the bounding boxes (x-center, y-center, width, height)
//   - the remaining 11 are the probabilities for each class

Here's some example code for you to get started:

import { AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';

const model = await AutoModel.from_pretrained(
    "Oblix/yolov8x-doclaynet_ONNX",
    {
        quantized: false,
    }
);
const processor = await AutoProcessor.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");

const url = 'https://huggingface.co/DILHTWD/documentlayoutsegmentation_YOLOv8_ondoclaynet/resolve/main/sample1.png';
const rawImage = await RawImage.fromURL(url);
const { pixel_values } = await processor(rawImage);
const output = await model({ images: pixel_values });

// Post-process:
const permuted = output.output0[0].transpose(1, 0);
// `permuted` is a Tensor of shape [ 8400, 15 ]:
// - 8400 potential bounding boxes
// - 15 parameters for each box:
//   - first 4 are coordinates for the bounding boxes (x-center, y-center, width, height)
//   - the remaining 11 are the probabilities for each class

// Example code to format it nicely:
const result = [];
const threshold = 0.5;
const [scaledHeight, scaledWidth] = pixel_values.dims.slice(-2);
for (const [xc, yc, w, h, ...scores] of permuted.tolist()) {

    // Get pixel values, taking into account the original image size
    const x1 = (xc - w/2) / scaledWidth * rawImage.width;
    const y1 = (yc - h/2) / scaledHeight * rawImage.height;
    const x2 = (xc + w/2) / scaledWidth * rawImage.width;
    const y2 = (yc + h/2) / scaledHeight * rawImage.height;

    // Get best class
    const argmax = scores.reduce((maxIndex, currentVal, currentIndex, arr) => currentVal > arr[maxIndex] ? currentIndex : maxIndex, 0);
    const score = scores[argmax];
    if (score < threshold) continue; // Not confident enough

    const label = model.config.id2label[argmax];
    result.push({
        x1, x2, y1, y2, score, label, index: argmax,
    });
    
}
console.log('result', result)

The first element in result is:

  {
    x1: 54.511123010516165,
    x2: 95.3523416787386,
    y1: 169.54515953063967,
    y2: 186.98096866607668,
    score: 0.922849178314209,
    label: 'Text',
    index: 9
  },

which corresponds to:

NOTE: This produces many duplicates, so you will need to do some additional filtering based on IoU (intersection over union) scores to remove duplicates.

Maxzurek · Answer 4 · Tue Apr 02 2024 05:57:29 GMT+0800 (China Standard Time)

That worked really well! I used your code and added some filtering:

import { AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';

const model = await AutoModel.from_pretrained(
    "Oblix/yolov8x-doclaynet_ONNX",
    {
        quantized: false,
    }
);
const processor = await AutoProcessor.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");

const url = 'https://huggingface.co/DILHTWD/documentlayoutsegmentation_YOLOv8_ondoclaynet/resolve/main/sample1.png';
const rawImage = await RawImage.fromURL(url);
const { pixel_values } = await processor(rawImage);
const output = await model({ images: pixel_values });

// Post-process:
const permuted = output.output0[0].transpose(1, 0);
// `permuted` is a Tensor of shape [ 8400, 15 ]:
// - 8400 potential bounding boxes
// - 15 parameters for each box:
//   - first 4 are coordinates for the bounding boxes (x-center, y-center, width, height)
//   - the remaining 11 are the probabilities for each class

// Example code to format it nicely:
const result = [];
const threshold = 0.5;
const [scaledHeight, scaledWidth] = pixel_values.dims.slice(-2);
for (const [xc, yc, w, h, ...scores] of permuted.tolist()) {

    // Get pixel values, taking into account the original image size
    const x1 = (xc - w/2) / scaledWidth * rawImage.width;
    const y1 = (yc - h/2) / scaledHeight * rawImage.height;
    const x2 = (xc + w/2) / scaledWidth * rawImage.width;
    const y2 = (yc + h/2) / scaledHeight * rawImage.height;

    // Get best class
    const argmax = scores.reduce((maxIndex, currentVal, currentIndex, arr) => currentVal > arr[maxIndex] ? currentIndex : maxIndex, 0);
    const score = scores[argmax];
    if (score < threshold) continue; // Not confident enough

    const label = model.config.id2label[argmax];
    result.push({
        x1, x2, y1, y2, score, label, index: argmax,
    });
}

const iouThreshold = 0.5; // Adjust the threshold as needed
const filteredResults = removeDuplicates(results, iouThreshold);
console.log(filteredResults);

function removeDuplicates(detections, iouThreshold) {
    const filteredDetections = [];

    for (const detection of detections) {
        let isDuplicate = false;
        let duplicateIndex = -1;
        let maxIoU = 0;

        for (let i = 0; i < filteredDetections.length; i++) {
            const filteredDetection = filteredDetections[i];
            const iou = calculateIoU(detection, filteredDetection);
            if (iou > iouThreshold) {
                isDuplicate = true;
                if (iou > maxIoU) {
                    maxIoU = iou;
                    duplicateIndex = i;
                }
            }
        }

        if (!isDuplicate) {
            filteredDetections.push(detection);
        } else if (duplicateIndex !== -1) {
            if (detection.score > filteredDetections[duplicateIndex].score) {
                filteredDetections[duplicateIndex] = detection;
            }
        }
    }

    return filteredDetections;
}

function calculateIoU(detection1, detection2) {
    const xOverlap = Math.max(0, Math.min(detection1.x2, detection2.x2) - Math.max(detection1.x1, detection2.x1));
    const yOverlap = Math.max(0, Math.min(detection1.y2, detection2.y2) - Math.max(detection1.y1, detection2.y1));
    const overlapArea = xOverlap * yOverlap;

    const area1 = (detection1.x2 - detection1.x1) * (detection1.y2 - detection1.y1);
    const area2 = (detection2.x2 - detection2.x1) * (detection2.y2 - detection2.y1);
    const unionArea = area1 + area2 - overlapArea;

    return overlapArea / unionArea;
}

Result from the original model:

[[     54.104       478.5      625.66      561.51     0.96691           9]
 [     309.46      10.659      606.84      34.639     0.95411           5]
 [     54.101      593.76       610.9      637.21     0.95381           9]
 [     56.099      658.09      617.05      701.72     0.95212           3]
 [     54.254      76.034      553.98      160.44     0.95189          10]
 [     53.743      715.95      624.31      759.17     0.94778           9]
 [      55.52      761.81      608.76      791.86     0.94533           3]
 [     53.804      169.69      95.741      186.52     0.92905           9]
 [     54.597      247.58      613.28      277.48     0.92579           9]
 [      54.05      195.19      221.24      211.95     0.91027           9]
 [     54.129      338.93      101.89       355.2     0.90539           7]
 [     53.955      221.73      99.105      238.28      0.9047           9]
 [     55.766      638.88      384.13      655.06     0.90207           3]
 [     55.492      794.23      522.05      810.18     0.89753           3]
 [     308.84       908.9      373.95      921.43     0.88651           4]
 [     55.702      812.87      447.08      829.36     0.87433           3]
 [     53.422      567.88      74.302      584.58     0.87283           7]
 [     54.287      452.57      143.76      469.43     0.87249           7]
 [     54.106      429.43      414.41      446.01     0.86868           9]
 [     56.377      310.21      316.57      325.97     0.84993           9]
 [     55.328      289.34      417.56      306.33      0.8446           9]
 [     54.093      365.53      626.07      408.28     0.75924           9]
 [     53.334      841.17      73.789      859.76     0.67448           7]
 [     54.386      364.58      298.45      380.64     0.38518           9]]

Formatted result from transformers.js, unquantized:

[[     54.284      478.254     624.995      562.459     0.957          9]
[    310.025      10.204     606.074      35.195     0.968             5]
[     54.525      593.898     610.501      638.059     0.949           9]
[     57.024      658.039     615.768      702.723     0.922           3]
[     54.666      74.860     554.239      161.227     0.977           10]
[     54.642      715.615     624.618      759.548     0.925           9]
[     56.071      761.896     607.785      792.188     0.914           3]
[     54.419      169.526      95.195      186.826     0.939           9]
[     54.664      247.617     610.514      278.153     0.910           9]
[     54.374      195.259      221.504      212.460     0.929          9]
[     54.665      338.692      103.830      355.555     0.909          7]
[     54.337      221.946      98.782      238.693     0.911           9]
[     56.134      638.949     385.122      656.201     0.931           3]
[     56.299      794.350     520.735      811.010     0.916           3]
[    309.155      908.863     373.089      922.589     0.914           4]
[     56.693      812.905     448.480      829.919     0.890           3]
[     54.096      567.609      73.830      584.788     0.891           7]
[     54.359      452.377      145.368      469.684     0.914          7]
[     54.095      429.799     414.876      446.489     0.897           9]
[     56.867      309.951     316.465      326.458     0.873           9]
[     56.664      289.387     417.009      306.707     0.886           9]
[     54.401      365.587     624.608      408.853     0.878           9]
[     53.807      840.939      70.752      859.954     0.620           7]]

The box coordinates of the original and the ONNX models are slightly different, but despite this discrepancy, the final result is great!

Joshua Lochner · Answer 5 · Tue Apr 02 2024 06:17:22 GMT+0800 (China Standard Time)

That's amazing! 🔥 For what it's worth, these minor discrepancies are almost certainly due to the different algorithms used for resizing images, which differ from python's PIL library. We use the Canvas API if running in-browser, or sharp.js if running with node.js (both produce slightly different results than each other, and PIL).

It would be great if you can update the model card with this example usage, as I'm sure many others would find this useful! 🤗 I can also post a few tweets about it, if you're okay with that?

Great work on this!

Maxzurek · Answer 6 · Tue Apr 02 2024 06:57:28 GMT+0800 (China Standard Time)

I updated the model card and added the quantized model.

I can also post a few tweets about it, if you're okay with that?

Absolutely, the more eyes on the project the better!