A production-ready HTTP server for converting documents to Markdown using Microsoft's MarkItDown library.
- Stream-based processing - No temporary files, minimal memory footprint
- Production-ready - Uses Gunicorn WSGI server with worker management
- Memory-safe - Built-in file size limits, request timeouts, and worker recycling
- Secure - Runs as non-root user, multi-stage Docker build
- Observable - Health checks, structured logging, and request metrics
- Docker-optimized - Multi-stage build for minimal image size
- PDF documents
- Microsoft Office (DOCX, XLSX, PPTX)
- Images (JPEG, PNG, etc.) with EXIF metadata
- Audio files (with transcription)
- HTML, CSV, JSON, XML
- ZIP archives
- EPUB books
- And more...
docker-compose up -d# Build the image
docker build -t markitdown-server .
# Run the container
docker run -d -p 8080:8080 --name markitdown-server markitdown-server# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run development server
python server.pycurl -X POST http://localhost:8080/convert \
-F "file=@document.pdf" \
-o output.mdcurl http://localhost:8080/health| Variable | Default | Description |
|---|---|---|
GUNICORN_WORKERS |
(CPU cores * 2) + 1 |
Number of worker processes |
LOG_LEVEL |
info |
Logging level (debug, info, warning, error) |
Default maximum file size: 50 MB
To change, edit MAX_FILE_SIZE in server.py:
MAX_FILE_SIZE = 100 * 1024 * 1024 # 100 MBDocker Compose sets a 2GB memory limit by default. Adjust in docker-compose.yml:
deploy:
resources:
limits:
memory: 4G # Increase for larger files- Uses
convert_stream()to process files directly from memory - No temporary file creation = faster and less I/O
- Workers restart after 1,000 requests to prevent memory leaks
- Configurable worker count based on CPU cores
- 120-second timeout for long-running PDF conversions
- Prevents workers from hanging indefinitely
- Separates build dependencies from runtime
- Results in smaller, more secure images
- Reuses converter instance across requests
- Eliminates initialization overhead
This server implements several strategies to prevent memory leaks common with PDF processing:
- File size limits - Reject files larger than 50 MB
- Worker recycling - Restart workers after 1,000 requests
- Request timeouts - Kill long-running conversions
- Stream processing - No disk I/O, minimal memory usage
- Memory limits - Docker enforces hard memory caps
If you're processing very large PDFs (>50 MB):
- Increase
MAX_FILE_SIZEinserver.py - Increase Docker memory limit in
docker-compose.yml - Reduce
GUNICORN_WORKERSto limit concurrent processing - Consider using a dedicated PDF parser like Docling
If conversions are timing out:
- Increase
timeoutingunicorn.conf.py - Check logs for specific errors:
docker logs markitdown-server
If Docker build fails with missing dependencies:
# Clear Docker cache and rebuild
docker-compose build --no-cacheThe server includes:
- Health check endpoint (
/health) for readiness probes - Liveness checks via Docker HEALTHCHECK
- Graceful shutdown handling
- Structured JSON logging
When deploying behind Nginx/Traefik, ensure:
- Client max body size matches your file limit
- Proxy timeouts exceed worker timeouts (120s)
Example Nginx config:
location /convert {
proxy_pass http://markitdown-server:8080;
client_max_body_size 50M;
proxy_read_timeout 180s;
}┌─────────────┐
│ Client │
└──────┬──────┘
│ HTTP POST /convert
▼
┌─────────────────────┐
│ Gunicorn Workers │
│ (4 processes) │
└──────┬──────────────┘
│
▼
┌──────────────────────┐
│ MarkItDown Library │
│ (Singleton Instance)│
└──────┬───────────────┘
│
▼
┌──────────────────────┐
│ Format Converters │
│ (PDF, DOCX, etc.) │
└──────────────────────┘
# TODO: Add tests
pytest tests/pip install pre-commit
pre-commit install
pre-commit run --all-filesMIT License - See Microsoft's MarkItDown for the underlying library license.
- Built on Microsoft MarkItDown
- Optimizations inspired by production best practices and community feedback