Add documentlayoutsegmentation_YOLOv8_ondoclaynet
Maxzurek opened this issue · comments
Model description
I originally wanted to train a document layout analysis model using dit-base but since I didn't have enough compute power I decided to look for alternatives.
I stumbled upon this model and after testing it, I had great results. The model uses the YOLOv8 XL architecture.
Prerequisites
- The model is supported in Transformers (i.e., listed here) (I am unsure)
- The model can be exported to ONNX with Optimum (i.e., listed here)
Additional information
I successfully exported the model to ONNX and tested it using the following code:
from ultralytics import YOLO
# Load the exported ONNX model
model = YOLO('yolov8x-doclaynet.pt')
model.export(format='onnx')
# Run inference
onnx_model = YOLO('yolov8x-doclaynet.onnx')
results = onnx_model('sample1.png', boxes=True)
for entry in results:
print(entry.names)
print(entry.boxes.data.numpy())
Results:
WARNING Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify', or 'pose'.
Loading yolov8x-doclaynet.onnx for ONNX Runtime inference...
image 1/1 C:\Users\mcesa\miniconda3\envs\yolo\convert_script\sample1.png: 640x640 5 List-items, 1 Page-footer, 1 Page-header, 4 Section-headers, 12 Texts, 1 Title, 794.2ms
Speed: 5.0ms preprocess, 794.2ms inference, 42.5ms postprocess per image at shape (1, 3, 640, 640)
{0: 'Caption', 1: 'Footnote', 2: 'Formula', 3: 'List-item', 4: 'Page-footer', 5: 'Page-header', 6: 'Picture', 7: 'Section-header', 8: 'Table', 9: 'Text', 10: 'Title'}
[[ 54.104 478.5 625.66 561.51 0.96691 9]
[ 309.46 10.659 606.84 34.639 0.95411 5]
[ 54.101 593.76 610.9 637.21 0.95381 9]
[ 56.099 658.09 617.05 701.72 0.95212 3]
[ 54.254 76.034 553.98 160.44 0.95189 10]
[ 53.743 715.95 624.31 759.17 0.94778 9]
[ 55.52 761.81 608.76 791.86 0.94533 3]
[ 53.804 169.69 95.741 186.52 0.92905 9]
[ 54.597 247.58 613.28 277.48 0.92579 9]
[ 54.05 195.19 221.24 211.95 0.91027 9]
[ 54.129 338.93 101.89 355.2 0.90539 7]
[ 53.955 221.73 99.105 238.28 0.9047 9]
[ 55.766 638.88 384.13 655.06 0.90207 3]
[ 55.492 794.23 522.05 810.18 0.89753 3]
[ 308.84 908.9 373.95 921.43 0.88651 4]
[ 55.702 812.87 447.08 829.36 0.87433 3]
[ 53.422 567.88 74.302 584.58 0.87283 7]
[ 54.287 452.57 143.76 469.43 0.87249 7]
[ 54.106 429.43 414.41 446.01 0.86868 9]
[ 56.377 310.21 316.57 325.97 0.84993 9]
[ 55.328 289.34 417.56 306.33 0.8446 9]
[ 54.093 365.53 626.07 408.28 0.75924 9]
[ 53.334 841.17 73.789 859.76 0.67448 7]
[ 54.386 364.58 298.45 380.64 0.38518 9]]
Your contribution
I am willing to provide help if needed
Hi there! 👋 This should be possible :) Could you upload your converted ONNX models to the HF hub? If you structure it like https://huggingface.co/Xenova/yolov9-c, it should already be able to work (even without any additions to the library).
I already uploaded the ONNX model on HF (I'm not sure if the config.json and the preprocessor_config.json are setup correctly and haven't figure out how to quantize the model yet).
I then tested it using this code:
const model = await AutoModel.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");
const processor = await AutoProcessor.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");
const rawImage = await RawImage.fromBlob(blob);
const { pixel_values } = await processor(rawImage);
const output = await model({ images: pixel_values });
console.log(output);
I'm not sure what to do with the tensor object.
Permuting the output makes it far easier to understand:
const permuted = output.output0[0].transpose(1, 0);
// `permuted` is a Tensor of shape [ 8400, 15 ]:
// - 8400 potential bounding boxes
// - 15 parameters for each box:
// - first 4 are coordinates for the bounding boxes (x-center, y-center, width, height)
// - the remaining 11 are the probabilities for each class
Here's some example code for you to get started:
import { AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';
const model = await AutoModel.from_pretrained(
"Oblix/yolov8x-doclaynet_ONNX",
{
quantized: false,
}
);
const processor = await AutoProcessor.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");
const url = 'https://huggingface.co/DILHTWD/documentlayoutsegmentation_YOLOv8_ondoclaynet/resolve/main/sample1.png';
const rawImage = await RawImage.fromURL(url);
const { pixel_values } = await processor(rawImage);
const output = await model({ images: pixel_values });
// Post-process:
const permuted = output.output0[0].transpose(1, 0);
// `permuted` is a Tensor of shape [ 8400, 15 ]:
// - 8400 potential bounding boxes
// - 15 parameters for each box:
// - first 4 are coordinates for the bounding boxes (x-center, y-center, width, height)
// - the remaining 11 are the probabilities for each class
// Example code to format it nicely:
const result = [];
const threshold = 0.5;
const [scaledHeight, scaledWidth] = pixel_values.dims.slice(-2);
for (const [xc, yc, w, h, ...scores] of permuted.tolist()) {
// Get pixel values, taking into account the original image size
const x1 = (xc - w/2) / scaledWidth * rawImage.width;
const y1 = (yc - h/2) / scaledHeight * rawImage.height;
const x2 = (xc + w/2) / scaledWidth * rawImage.width;
const y2 = (yc + h/2) / scaledHeight * rawImage.height;
// Get best class
const argmax = scores.reduce((maxIndex, currentVal, currentIndex, arr) => currentVal > arr[maxIndex] ? currentIndex : maxIndex, 0);
const score = scores[argmax];
if (score < threshold) continue; // Not confident enough
const label = model.config.id2label[argmax];
result.push({
x1, x2, y1, y2, score, label, index: argmax,
});
}
console.log('result', result)
The first element in result
is:
{
x1: 54.511123010516165,
x2: 95.3523416787386,
y1: 169.54515953063967,
y2: 186.98096866607668,
score: 0.922849178314209,
label: 'Text',
index: 9
},
NOTE: This produces many duplicates, so you will need to do some additional filtering based on IoU (intersection over union) scores to remove duplicates.
That worked really well! I used your code and added some filtering:
import { AutoModel, AutoProcessor, RawImage } from '@xenova/transformers';
const model = await AutoModel.from_pretrained(
"Oblix/yolov8x-doclaynet_ONNX",
{
quantized: false,
}
);
const processor = await AutoProcessor.from_pretrained("Oblix/yolov8x-doclaynet_ONNX");
const url = 'https://huggingface.co/DILHTWD/documentlayoutsegmentation_YOLOv8_ondoclaynet/resolve/main/sample1.png';
const rawImage = await RawImage.fromURL(url);
const { pixel_values } = await processor(rawImage);
const output = await model({ images: pixel_values });
// Post-process:
const permuted = output.output0[0].transpose(1, 0);
// `permuted` is a Tensor of shape [ 8400, 15 ]:
// - 8400 potential bounding boxes
// - 15 parameters for each box:
// - first 4 are coordinates for the bounding boxes (x-center, y-center, width, height)
// - the remaining 11 are the probabilities for each class
// Example code to format it nicely:
const result = [];
const threshold = 0.5;
const [scaledHeight, scaledWidth] = pixel_values.dims.slice(-2);
for (const [xc, yc, w, h, ...scores] of permuted.tolist()) {
// Get pixel values, taking into account the original image size
const x1 = (xc - w/2) / scaledWidth * rawImage.width;
const y1 = (yc - h/2) / scaledHeight * rawImage.height;
const x2 = (xc + w/2) / scaledWidth * rawImage.width;
const y2 = (yc + h/2) / scaledHeight * rawImage.height;
// Get best class
const argmax = scores.reduce((maxIndex, currentVal, currentIndex, arr) => currentVal > arr[maxIndex] ? currentIndex : maxIndex, 0);
const score = scores[argmax];
if (score < threshold) continue; // Not confident enough
const label = model.config.id2label[argmax];
result.push({
x1, x2, y1, y2, score, label, index: argmax,
});
}
const iouThreshold = 0.5; // Adjust the threshold as needed
const filteredResults = removeDuplicates(results, iouThreshold);
console.log(filteredResults);
function removeDuplicates(detections, iouThreshold) {
const filteredDetections = [];
for (const detection of detections) {
let isDuplicate = false;
let duplicateIndex = -1;
let maxIoU = 0;
for (let i = 0; i < filteredDetections.length; i++) {
const filteredDetection = filteredDetections[i];
const iou = calculateIoU(detection, filteredDetection);
if (iou > iouThreshold) {
isDuplicate = true;
if (iou > maxIoU) {
maxIoU = iou;
duplicateIndex = i;
}
}
}
if (!isDuplicate) {
filteredDetections.push(detection);
} else if (duplicateIndex !== -1) {
if (detection.score > filteredDetections[duplicateIndex].score) {
filteredDetections[duplicateIndex] = detection;
}
}
}
return filteredDetections;
}
function calculateIoU(detection1, detection2) {
const xOverlap = Math.max(0, Math.min(detection1.x2, detection2.x2) - Math.max(detection1.x1, detection2.x1));
const yOverlap = Math.max(0, Math.min(detection1.y2, detection2.y2) - Math.max(detection1.y1, detection2.y1));
const overlapArea = xOverlap * yOverlap;
const area1 = (detection1.x2 - detection1.x1) * (detection1.y2 - detection1.y1);
const area2 = (detection2.x2 - detection2.x1) * (detection2.y2 - detection2.y1);
const unionArea = area1 + area2 - overlapArea;
return overlapArea / unionArea;
}
Result from the original model:
[[ 54.104 478.5 625.66 561.51 0.96691 9]
[ 309.46 10.659 606.84 34.639 0.95411 5]
[ 54.101 593.76 610.9 637.21 0.95381 9]
[ 56.099 658.09 617.05 701.72 0.95212 3]
[ 54.254 76.034 553.98 160.44 0.95189 10]
[ 53.743 715.95 624.31 759.17 0.94778 9]
[ 55.52 761.81 608.76 791.86 0.94533 3]
[ 53.804 169.69 95.741 186.52 0.92905 9]
[ 54.597 247.58 613.28 277.48 0.92579 9]
[ 54.05 195.19 221.24 211.95 0.91027 9]
[ 54.129 338.93 101.89 355.2 0.90539 7]
[ 53.955 221.73 99.105 238.28 0.9047 9]
[ 55.766 638.88 384.13 655.06 0.90207 3]
[ 55.492 794.23 522.05 810.18 0.89753 3]
[ 308.84 908.9 373.95 921.43 0.88651 4]
[ 55.702 812.87 447.08 829.36 0.87433 3]
[ 53.422 567.88 74.302 584.58 0.87283 7]
[ 54.287 452.57 143.76 469.43 0.87249 7]
[ 54.106 429.43 414.41 446.01 0.86868 9]
[ 56.377 310.21 316.57 325.97 0.84993 9]
[ 55.328 289.34 417.56 306.33 0.8446 9]
[ 54.093 365.53 626.07 408.28 0.75924 9]
[ 53.334 841.17 73.789 859.76 0.67448 7]
[ 54.386 364.58 298.45 380.64 0.38518 9]]
Formatted result from transformers.js, unquantized:
[[ 54.284 478.254 624.995 562.459 0.957 9]
[ 310.025 10.204 606.074 35.195 0.968 5]
[ 54.525 593.898 610.501 638.059 0.949 9]
[ 57.024 658.039 615.768 702.723 0.922 3]
[ 54.666 74.860 554.239 161.227 0.977 10]
[ 54.642 715.615 624.618 759.548 0.925 9]
[ 56.071 761.896 607.785 792.188 0.914 3]
[ 54.419 169.526 95.195 186.826 0.939 9]
[ 54.664 247.617 610.514 278.153 0.910 9]
[ 54.374 195.259 221.504 212.460 0.929 9]
[ 54.665 338.692 103.830 355.555 0.909 7]
[ 54.337 221.946 98.782 238.693 0.911 9]
[ 56.134 638.949 385.122 656.201 0.931 3]
[ 56.299 794.350 520.735 811.010 0.916 3]
[ 309.155 908.863 373.089 922.589 0.914 4]
[ 56.693 812.905 448.480 829.919 0.890 3]
[ 54.096 567.609 73.830 584.788 0.891 7]
[ 54.359 452.377 145.368 469.684 0.914 7]
[ 54.095 429.799 414.876 446.489 0.897 9]
[ 56.867 309.951 316.465 326.458 0.873 9]
[ 56.664 289.387 417.009 306.707 0.886 9]
[ 54.401 365.587 624.608 408.853 0.878 9]
[ 53.807 840.939 70.752 859.954 0.620 7]]
The box coordinates of the original and the ONNX models are slightly different, but despite this discrepancy, the final result is great!
That's amazing! 🔥 For what it's worth, these minor discrepancies are almost certainly due to the different algorithms used for resizing images, which differ from python's PIL library. We use the Canvas API if running in-browser, or sharp.js if running with node.js (both produce slightly different results than each other, and PIL).
It would be great if you can update the model card with this example usage, as I'm sure many others would find this useful! 🤗 I can also post a few tweets about it, if you're okay with that?
Great work on this!
I updated the model card and added the quantized model.
I can also post a few tweets about it, if you're okay with that?
Absolutely, the more eyes on the project the better!