meilisearch / multimedia-commons-meilisearch-uploader

Repository from Github https://github.commeilisearch/multimedia-commons-meilisearch-uploaderRepository from Github https://github.commeilisearch/multimedia-commons-meilisearch-uploader

Multimedia Commons Meilisearch Uploader

A high-performance Rust application with a concurrent pipeline architecture that streams images from an S3 bucket, processes them in parallel, and uploads them to Meilisearch with optimal throughput and constant memory usage.

Features

  • Concurrent Pipeline Architecture: Three-stage pipeline with S3 listing, image processing, and uploading running concurrently
  • True Streaming: Processing starts immediately as images are discovered, no waiting for full S3 scan
  • Memory Efficient: Constant memory usage regardless of dataset size using channels and bounded queues
  • S3 Integration: Recursively scans S3 buckets using rusty-s3 with pagination
  • Parallel Image Processing: Configurable concurrent image downloads and processing
  • Intelligent Filtering: Advanced monocolor detection with compression artifact tolerance
  • Intelligent Batching: Dynamic batch uploading with adaptive sizing - never skips large images
  • Batch Deletion: Efficiently removes low-color images using Meilisearch's batch delete API
  • Built-in Resilience: Retry logic with exponential backoff for transient failures
  • Base64 Encoding: Converts images to base64 for Meilisearch storage
  • Highly Configurable: Command-line options for all performance parameters
  • Dry Run Mode: Test configuration without uploading to Meilisearch
  • Real-time Monitoring: Live progress tracking and detailed statistics

Installation

Make sure you have Rust installed, then build the project:

cargo build --release

The binary will be available at target/release/multimedia-commons-meilisearch-uploader.

Usage

Basic Usage

./target/release/multimedia-commons-meilisearch-uploader

This will use the default configuration:

  • S3 Bucket: multimedia-commons
  • S3 Region: us-west-2
  • S3 Prefix: data/images/
  • Meilisearch URL: https://ms-66464012cf08-103.fra.meilisearch.io

Custom Configuration

./target/release/multimedia-commons-meilisearch-uploader \
    --bucket my-bucket \
    --region us-east-1 \
    --prefix images/ \
    --meilisearch-url https://my-meilisearch.com \
    --meilisearch-key your-api-key \
    --max-downloads 100 \
    --max-uploads 20 \
    --batch-size 50

Dry Run

Test the configuration without uploading to Meilisearch:

./target/release/multimedia-commons-meilisearch-uploader --dry-run

Command Line Options

Option Default Description
--bucket multimedia-commons S3 bucket name
--region us-west-2 S3 region
--prefix data/images/ S3 prefix path
--meilisearch-url https://ms-66464012cf08-103.fra.meilisearch.io Meilisearch URL
--meilisearch-key (default provided) Meilisearch API key
--max-downloads 50 Maximum concurrent downloads
--max-uploads 10 Maximum concurrent uploads
--batch-size 100 Number of documents per batch
--max-batch-bytes 104857600 Maximum batch size in bytes (100MB)
--dry-run false Don't upload to Meilisearch

Output Format

Each image is converted to a JSON document with the following structure:

{
  "id": "filename_without_extension",
  "base64": "base64_encoded_image_data",
  "url": "https://bucket.s3-region.amazonaws.com/path/to/image.jpg"
}

AWS Credentials

The application uses AWS credentials from the environment. You have several options:

Option 1: Environment Variables

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

Option 2: AWS Credentials File

Create ~/.aws/credentials:

[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key

Option 3: IAM Roles (for EC2 instances)

If running on EC2, the application will automatically use IAM roles.

Option 4: Anonymous Access

For public buckets, the application will attempt anonymous access if no credentials are found.

Note: The multimedia-commons bucket is publicly accessible, so you can run the application without AWS credentials for read-only access.

Performance

The application uses a concurrent pipeline architecture for maximum performance:

Pipeline Architecture

S3 Lister → [Channel] → Image Processor → [Channel] → Batch Uploader
    ↓                        ↓                          ↓
Discovers images         Downloads &                Uploads to
continuously            processes images            Meilisearch
                        in parallel                 in batches
  • Concurrent Execution: All three stages run simultaneously for maximum throughput
  • No Blocking: Image processing starts immediately as images are discovered
  • Constant Memory: Bounded channels prevent memory buildup regardless of dataset size
  • Configurable Parallelism: Control concurrent downloads and uploads independently
  • Efficient Resource Usage: CPU, network, and memory optimally utilized

Performance Features

  • S3 Streaming: Continuous S3 object discovery with pagination
  • Parallel Processing: Up to N concurrent image downloads/processing (configurable)
  • Adaptive Batching: Intelligent batch sizing that handles large images by sending them separately
  • Batch Deletion: Groups low-color image deletions into batches of 50 for efficient API usage
  • Advanced Filtering: Grid-based monocolor detection with compression tolerance
  • Retry Logic: Exponential backoff for transient failures
  • Real-time Stats: Live progress monitoring across all pipeline stages

Performance Tuning

Throughput Optimization:

  • --max-downloads: Controls concurrent image processing (default: 50)
    • Higher values = more parallel processing but more memory/CPU usage
    • Tune based on your system resources and S3 rate limits
  • --max-uploads: Controls Meilisearch upload concurrency (default: 10)
    • Tune based on your Meilisearch instance capacity
  • --batch-size: Target documents per upload batch (default: 100)
    • Larger batches = fewer API calls but more memory per batch
    • Large images are automatically sent in separate batches to ensure no data loss

Memory Management:

  • Channel buffer sizes are automatically tuned for optimal memory usage
  • The pipeline maintains constant memory regardless of dataset size
  • Processing memory scales with --max-downloads setting only

Monitoring:

  • Watch the live output to see pipeline balance
  • Optimal setup: S3 discovery keeps ahead of processing, processing keeps ahead of uploads

Image Processing

  • Supported Formats: JPEG, PNG (detected by file extension)
  • Advanced Color Analysis: Counts unique colors per image (minimum 40 colors to be considered rich content)
  • Base64 Encoding: All valid images are encoded to base64 for Meilisearch storage
  • Pipeline Processing: Images flow through the pipeline as discovered - no batching in memory
  • Concurrent Downloads: Multiple images processed simultaneously with semaphore-based rate limiting
  • Adaptive Upload Strategy: Large images that exceed batch size limits are sent in separate batches
  • Batch Deletion Strategy: Low-color images are queued and deleted in batches of 50 using Meilisearch's batch API
  • Zero Data Loss: All images are processed - rich images uploaded, simple images deleted from index
  • Graceful Error Handling: Failed downloads/processing are logged and counted but don't stop the pipeline

Dependencies

  • anyhow - Error handling
  • base64 - Base64 encoding
  • clap - Command line parsing
  • futures - Async utilities
  • image - Image processing
  • reqwest - HTTP client
  • rusty-s3 - S3 client
  • serde - Serialization
  • tokio - Async runtime
  • url - URL parsing

Error Handling

The application includes comprehensive error handling:

  • Automatic retries for transient failures
  • Graceful handling of invalid images
  • Logging of errors without stopping the entire process
  • Final summary of errors encountered

Troubleshooting

Common Issues

  1. Certificate Errors: Make sure your system time is correct and you have updated CA certificates
  2. Access Denied: Verify your AWS credentials have S3 read permissions
  3. Out of Memory: Reduce --max-downloads if processing very large images (the pipeline itself uses constant memory)
  4. Slow Processing: Increase --max-downloads for more parallel processing, but watch system resources
  5. Large Image Handling: Very large images are automatically sent in separate batches with logging
  6. Meilisearch Errors: Check that your Meilisearch URL and API key are correct
  7. Pipeline Stalls: If one stage becomes a bottleneck, tune the related concurrency parameters

Testing

Use the --dry-run flag to test the pipeline without uploading to Meilisearch:

# Test with lower concurrency to see pipeline stages clearly
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 5 --batch-size 10

# Test with higher concurrency for performance evaluation
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 20 --batch-size 50

# Test batch handling with smaller limits to see adaptive batching
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-batch-bytes 100000 --batch-size 3

# See what would be deleted vs uploaded
./target/release/multimedia-commons-meilisearch-uploader --dry-run --max-downloads 50

License

This project is licensed under the MIT License.

About


Languages

Language:Rust 100.0%