Image Export does not work when using S3 Source Storage

Question

Image Export does not work when using S3 Source Storage

StellaASchlotter opened this issue 2 months ago · comments

Stella Alice Schlotter commented 2 months ago

Describe the bug
I use S3 as Source Storage. I can label the images but I can't export them via the UI or python SDK or via curl. The image folder is always empty. All the other data is there and looks correct. If I upload data directly via the UI and not use S3 the export works as expected.

To Reproduce
Steps to reproduce the behavior:

Create Project for object detection
Add an S3 bucket as source storage and sync
Make some annotations
export in any format you like that would normally export images

Expected behavior
I expect to be able to export the images or at least some notes in the documentation about the limitations if this is somehow expected behavior.

Environment (please complete the following information):

OS: kubernetes v1.28.2 self hosted
Label Studio Version 1.11.0

Stella Alice Schlotter · Answer 1 · Sat Mar 09 2024 01:21:29 GMT+0800 (China Standard Time)

This comment suggest that the support for downloading full datasets was deliberately dropped in case of cloud storage: HumanSignal/label-studio-converter#47 (comment)

I also checked that the download function in https://github.com/HumanSignal/label-studio-converter/blob/39d308d31e8a9bd77ef5ef4005a099918bd22ab8/label_studio_converter/converter.py#L559 is never called when you have cloud storage as source.

This should be highlighted when exporting via the UI and should also be documented. This is a very surprising behavior that took hours to find.

To me it also makes labelstudio less appealing. Now I need to write my own code that creates the yolo dataset structure before I can start training models and I need to educate everyone in my team about the limitation. I hoped this would have been faster and easier.

somunslotus1 · Answer 2 · Tue Mar 19 2024 17:02:39 GMT+0800 (China Standard Time)

I also met the same problem.I hoped this would have been faster and easier.

Alexander Lind · Answer 3 · Sun Apr 14 2024 04:22:29 GMT+0800 (China Standard Time)

Same problem

Max Tkachenko · Answer 4 · Mon Apr 29 2024 20:54:43 GMT+0800 (China Standard Time)

@StellaASchlotter

Now I need to write my own code that creates the yolo dataset structure

Why do you need to create a code that creates the yolo dataset structure? As I understand you only have to copy images from the storage to your local directory.

@somunslotus1 @al91liwo

Downloading of storage files might take a huge time. Especially it's critical for opensource version of Label studio, because it doesn't have background rqworkers and the maximum time to process it is 90 seconds (wsgi limit). So, if we try to download it, you merely never get exported annotations.

This script can download images from storages on your side using LS instance to get pre-signed urls for downloading:

pip install label-studio-converter
pip install label-studio-tools

import os
import subprocess
import time
from label_studio_sdk import Client
from label_studio_tools.core.utils.io import get_local_path

# Initialize the Label Studio SDK client
LABEL_STUDIO_URL = 'https://your-label-studio-instance.com/'
API_KEY = 'your_api_key'
PROJECT_ID = 123  # Replace with your actual project ID

client = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
project = client.get_project(PROJECT_ID)

# 1. Export JSON snapshot
snapshot = project.export_snapshot_create()
export_id = snapshot['id']

# Wait until the snapshot is ready
while not project.export_snapshot_status(export_id).is_completed():
    time.sleep(1)  # Sleep to avoid excessive requests

# Download the snapshot
status, json_file_path = project.export_snapshot_download(export_id, export_type='JSON')

# 2. Convert JSON to YOLO dataset using label-studio-converter
label_config_xml = project.params['label_config']
xml_file_path = 'label_config.xml'
with open(xml_file_path, 'w') as xml_file:
    xml_file.write(label_config_xml)

# Run label-studio-converter CLI
subprocess.run([
    'label-studio-converter', 'convert',
    '-i', json_file_path,
    '-o', 'output_yolo',
    '-c', xml_file_path,
    '-f', 'YOLO'
])

# 3. Download all images and copy to YOLO images folder
yolo_images_dir = os.path.join('output_yolo', 'images')
os.makedirs(yolo_images_dir, exist_ok=True)

# Assuming the JSON structure contains a list of tasks with image URLs
for task in project.get_tasks().all():
    image_url = task['data'].get('image')
    if image_url:
        local_image_path = get_local_path(
            url=image_url,
            hostname=LABEL_STUDIO_URL,
            access_token=API_KEY,
            download_resources=True,
            task_id=task['id']
        )
        # Copy the image to the YOLO images directory
        target_path = os.path.join(yolo_images_dir, os.path.basename(local_image_path))
        os.rename(local_image_path, target_path)

print("Conversion and image preparation complete.")

Stella Alice Schlotter · Answer 5 · Mon Apr 29 2024 21:27:35 GMT+0800 (China Standard Time)

@makseq you are right. I was confusing things there. Sorry for that. I needed a different yolo format anyway, the one that is used by ultralytics. I built a custom exporter with the sdk functionality. For me it works like this now.

get all labeled tasks
get the storage filename from the task
use the storage filename to download the image from my minio bucket
put everything in the folder structure I want.

I changed the labelstudio code in a way that if the export button is clicked a k8s pod is spawned that does this.

Stella Alice Schlotter · Answer 6 · Mon Apr 29 2024 21:29:15 GMT+0800 (China Standard Time)

Regarding your comments with no rqworkers. Is this something the enterprise version would have? We noticed that in the community version it can be quite slow when multiple people are working on it. Can this be explained by having no background workers?

Max Tkachenko · Answer 7 · Mon Apr 29 2024 22:18:26 GMT+0800 (China Standard Time)

it can be quite slow when multiple people are working on it

What slowness do you mean?

E.g. in the enterprise version there are no timeouts on exports when you have large projects, because the enterprise uses rqworkers.

Stella Alice Schlotter · Answer 8 · Mon Apr 29 2024 22:28:13 GMT+0800 (China Standard Time)

Currently we experience that submitting an annotation manually for one image is sometimes quite slow (up to 10 seconds). In our case the annotation consists of bounding boxes. The time varies between 0.3seconds till 10 seconds. If we set annotations via API the time it takes is always the same. I havent figured out what causes this. The image loading is fast. For example I can click each image in the list view one after another without a visible delay. Only when submitting an annotation I experience slowness.

I am running in a k8s cluster and the node this is running on has 128cores and 512gb RAM. And I see barely any load. I am not sure how to debug this further. Do you have any suggestions?

Max Tkachenko · Answer 9 · Mon Apr 29 2024 22:37:17 GMT+0800 (China Standard Time)

It sounds like DB issues, most likely it's not optimized. You need to check postgre buffer sizes, make vacuum, etc. Also you can easily try using enterprise, it's here: https://app.heartex.com/user/trial

Stella Alice Schlotter · Answer 10 · Tue Apr 30 2024 02:29:07 GMT+0800 (China Standard Time)

Thank you. I'll have a look at both.