HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format

Home Page:https://labelstud.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Image Export does not work when using S3 Source Storage

StellaASchlotter opened this issue · comments

Describe the bug
I use S3 as Source Storage. I can label the images but I can't export them via the UI or python SDK or via curl. The image folder is always empty. All the other data is there and looks correct. If I upload data directly via the UI and not use S3 the export works as expected.

To Reproduce
Steps to reproduce the behavior:

  1. Create Project for object detection
  2. Add an S3 bucket as source storage and sync
  3. Make some annotations
  4. export in any format you like that would normally export images

Expected behavior
I expect to be able to export the images or at least some notes in the documentation about the limitations if this is somehow expected behavior.

Environment (please complete the following information):

  • OS: kubernetes v1.28.2 self hosted
  • Label Studio Version 1.11.0

This comment suggest that the support for downloading full datasets was deliberately dropped in case of cloud storage: HumanSignal/label-studio-converter#47 (comment)

I also checked that the download function in https://github.com/HumanSignal/label-studio-converter/blob/39d308d31e8a9bd77ef5ef4005a099918bd22ab8/label_studio_converter/converter.py#L559 is never called when you have cloud storage as source.

This should be highlighted when exporting via the UI and should also be documented. This is a very surprising behavior that took hours to find.

To me it also makes labelstudio less appealing. Now I need to write my own code that creates the yolo dataset structure before I can start training models and I need to educate everyone in my team about the limitation. I hoped this would have been faster and easier.

I also met the same problem.I hoped this would have been faster and easier.

Same problem

@StellaASchlotter

Now I need to write my own code that creates the yolo dataset structure

Why do you need to create a code that creates the yolo dataset structure? As I understand you only have to copy images from the storage to your local directory.

@somunslotus1 @al91liwo

Downloading of storage files might take a huge time. Especially it's critical for opensource version of Label studio, because it doesn't have background rqworkers and the maximum time to process it is 90 seconds (wsgi limit). So, if we try to download it, you merely never get exported annotations.

This script can download images from storages on your side using LS instance to get pre-signed urls for downloading:

pip install label-studio-converter
pip install label-studio-tools
import os
import subprocess
import time
from label_studio_sdk import Client
from label_studio_tools.core.utils.io import get_local_path

# Initialize the Label Studio SDK client
LABEL_STUDIO_URL = 'https://your-label-studio-instance.com/'
API_KEY = 'your_api_key'
PROJECT_ID = 123  # Replace with your actual project ID

client = Client(url=LABEL_STUDIO_URL, api_key=API_KEY)
project = client.get_project(PROJECT_ID)

# 1. Export JSON snapshot
snapshot = project.export_snapshot_create()
export_id = snapshot['id']

# Wait until the snapshot is ready
while not project.export_snapshot_status(export_id).is_completed():
    time.sleep(1)  # Sleep to avoid excessive requests

# Download the snapshot
status, json_file_path = project.export_snapshot_download(export_id, export_type='JSON')

# 2. Convert JSON to YOLO dataset using label-studio-converter
label_config_xml = project.params['label_config']
xml_file_path = 'label_config.xml'
with open(xml_file_path, 'w') as xml_file:
    xml_file.write(label_config_xml)

# Run label-studio-converter CLI
subprocess.run([
    'label-studio-converter', 'convert',
    '-i', json_file_path,
    '-o', 'output_yolo',
    '-c', xml_file_path,
    '-f', 'YOLO'
])

# 3. Download all images and copy to YOLO images folder
yolo_images_dir = os.path.join('output_yolo', 'images')
os.makedirs(yolo_images_dir, exist_ok=True)

# Assuming the JSON structure contains a list of tasks with image URLs
for task in project.get_tasks().all():
    image_url = task['data'].get('image')
    if image_url:
        local_image_path = get_local_path(
            url=image_url,
            hostname=LABEL_STUDIO_URL,
            access_token=API_KEY,
            download_resources=True,
            task_id=task['id']
        )
        # Copy the image to the YOLO images directory
        target_path = os.path.join(yolo_images_dir, os.path.basename(local_image_path))
        os.rename(local_image_path, target_path)

print("Conversion and image preparation complete.")

@makseq you are right. I was confusing things there. Sorry for that. I needed a different yolo format anyway, the one that is used by ultralytics. I built a custom exporter with the sdk functionality. For me it works like this now.

  • get all labeled tasks
  • get the storage filename from the task
  • use the storage filename to download the image from my minio bucket
  • put everything in the folder structure I want.

I changed the labelstudio code in a way that if the export button is clicked a k8s pod is spawned that does this.

Regarding your comments with no rqworkers. Is this something the enterprise version would have? We noticed that in the community version it can be quite slow when multiple people are working on it. Can this be explained by having no background workers?

it can be quite slow when multiple people are working on it

What slowness do you mean?

E.g. in the enterprise version there are no timeouts on exports when you have large projects, because the enterprise uses rqworkers.

Currently we experience that submitting an annotation manually for one image is sometimes quite slow (up to 10 seconds). In our case the annotation consists of bounding boxes. The time varies between 0.3seconds till 10 seconds. If we set annotations via API the time it takes is always the same. I havent figured out what causes this. The image loading is fast. For example I can click each image in the list view one after another without a visible delay. Only when submitting an annotation I experience slowness.

I am running in a k8s cluster and the node this is running on has 128cores and 512gb RAM. And I see barely any load. I am not sure how to debug this further. Do you have any suggestions?

It sounds like DB issues, most likely it's not optimized. You need to check postgre buffer sizes, make vacuum, etc. Also you can easily try using enterprise, it's here: https://app.heartex.com/user/trial

Thank you. I'll have a look at both.