A modern, fully typed Python library for converting HTML to Markdown. This library is a completely rewritten fork of markdownify with a modernized codebase, strict type safety and support for Python 3.9+.
If you find html-to-markdown useful, please consider sponsoring the development:
Your support helps maintain and improve this library for the community.
- Full HTML5 Support: Comprehensive support for all modern HTML5 elements including semantic, form, table, ruby, interactive, structural, SVG, and math elements
- Table Support: Advanced handling of complex tables with rowspan/colspan support
- Type Safety: Strict MyPy adherence with comprehensive type hints
- Metadata Extraction: Automatic extraction of document metadata (title, meta tags) as comment headers
- Streaming Support: Memory-efficient processing for large documents with progress callbacks
- Highlight Support: Multiple styles for highlighted text (
<mark>elements) - Task List Support: Converts HTML checkboxes to GitHub-compatible task list syntax
- Flexible Configuration: Comprehensive configuration options for customizing conversion behavior
- CLI Tool: Full-featured command-line interface with complete API parity
- Custom Converters: Extensible converter system for custom HTML tag handling
- List Formatting: Configurable list indentation with Discord/Slack compatibility
- HTML Preprocessing: Clean messy HTML with configurable aggressiveness levels
- Whitespace Control: Normalized or strict whitespace preservation modes
- BeautifulSoup Integration: Support for pre-configured BeautifulSoup instances
- Robustly Tested: Comprehensive unit tests and integration tests covering all conversion scenarios
pip install html-to-markdownFor improved performance, you can install with the optional lxml parser:
pip install html-to-markdown[lxml]The lxml parser offers faster HTML parsing and better handling of malformed HTML compared to the default html.parser.
The library automatically uses lxml when available. You can explicitly specify a parser using the parser parameter.
Convert HTML to Markdown with a single function call:
from html_to_markdown import convert_to_markdown
html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample Document</title>
<meta name="description" content="A sample HTML document">
</head>
<body>
<article>
<h1>Welcome</h1>
<p>This is a <strong>sample</strong> with a <a href="https://example.com">link</a>.</p>
<p>Here's some <mark>highlighted text</mark> and a task list:</p>
<ul>
<li><input type="checkbox" checked> Completed task</li>
<li><input type="checkbox"> Pending task</li>
</ul>
</article>
</body>
</html>
"""
markdown = convert_to_markdown(html)
print(markdown)Output:
<!--
title: Sample Document
meta-description: A sample HTML document
-->
# Welcome
This is a **sample** with a [link](https://example.com).
Here's some ==highlighted text== and a task list:
* [x] Completed task
* [ ] Pending taskIf you need more control over HTML parsing, you can pass a pre-configured BeautifulSoup instance:
from bs4 import BeautifulSoup
from html_to_markdown import convert_to_markdown
# Configure BeautifulSoup with your preferred parser
soup = BeautifulSoup(html, "lxml") # Note: lxml requires additional installation
markdown = convert_to_markdown(soup)Discord and Slack require 2-space indentation for nested lists:
Python:
from html_to_markdown import convert_to_markdown
html = "<ul><li>Item 1<ul><li>Nested item</li></ul></li></ul>"
markdown = convert_to_markdown(html, list_indent_width=2)
# Output: * Item 1\n + Nested itemCLI:
html_to_markdown --list-indent-width 2 input.htmlRemove navigation, advertisements, and forms from scraped content:
Python:
markdown = convert_to_markdown(html, preprocess_html=True, preprocessing_preset="aggressive")CLI:
html_to_markdown --preprocess-html --preprocessing-preset aggressive input.htmlMaintain exact whitespace for code documentation or technical content:
Python:
markdown = convert_to_markdown(html, whitespace_mode="strict")CLI:
html_to_markdown --whitespace-mode strict input.htmlSome editors and platforms prefer tab-based indentation:
Python:
markdown = convert_to_markdown(html, list_indent_type="tabs")CLI:
html_to_markdown --list-indent-type tabs input.htmlfrom html_to_markdown import convert_to_markdown
markdown = convert_to_markdown(
html,
# Headers and formatting
heading_style="atx",
strong_em_symbol="*",
bullets="*+-",
highlight_style="double-equal",
# List indentation
list_indent_type="spaces",
list_indent_width=4,
# Whitespace handling
whitespace_mode="normalized",
# HTML preprocessing
preprocess_html=True,
preprocessing_preset="standard",
)Custom converters allow you to override the default conversion behavior for any HTML tag. This is particularly useful for customizing header formatting or implementing domain-specific conversion rules.
from bs4.element import Tag
from html_to_markdown import convert_to_markdown
def custom_h1_converter(*, tag: Tag, text: str, **kwargs) -> str:
"""Convert h1 tags with custom formatting."""
return f"### {text.upper()} ###\n\n"
def custom_h2_converter(*, tag: Tag, text: str, **kwargs) -> str:
"""Convert h2 tags with underline."""
return f"{text}\n{'=' * len(text)}\n\n"
html = "<h1>Title</h1><h2>Subtitle</h2><p>Content</p>"
markdown = convert_to_markdown(html, custom_converters={"h1": custom_h1_converter, "h2": custom_h2_converter})
print(markdown)
# Output:
# ### TITLE ###
#
# Subtitle
# ========
#
# Contentdef smart_link_converter(*, tag: Tag, text: str, **kwargs) -> str:
"""Convert links based on their attributes."""
href = tag.get("href", "")
title = tag.get("title", "")
# Handle different link types
if href.startswith("http"):
# External link
return f"[{text}]({href} \"{title or 'External link'}\")"
elif href.startswith("#"):
# Anchor link
return f"[{text}]({href})"
elif href.startswith("mailto:"):
# Email link
return f"[{text}]({href})"
else:
# Relative link
return f"[{text}]({href})"
html = '<a href="https://example.com">External</a> <a href="#section">Anchor</a>'
markdown = convert_to_markdown(html, custom_converters={"a": smart_link_converter})All converter functions must follow this signature:
def converter(*, tag: Tag, text: str, **kwargs) -> str:
"""
Args:
tag: BeautifulSoup Tag object with access to all HTML attributes
text: Pre-processed text content of the tag
**kwargs: Additional context passed through from conversion
Returns:
Markdown formatted string
"""
passCustom converters take precedence over built-in converters and can be used alongside other configuration options.
For processing large documents with memory constraints, use the streaming API:
from html_to_markdown import convert_to_markdown_stream
# Process large HTML in chunks
with open("large_document.html", "r") as f:
html_content = f.read()
# Returns a generator that yields markdown chunks
for chunk in convert_to_markdown_stream(html_content, chunk_size=2048):
print(chunk, end="")With progress tracking:
def show_progress(processed: int, total: int):
if total > 0:
percent = (processed / total) * 100
print(f"\rProgress: {percent:.1f}%", end="")
# Stream with progress callback
markdown = convert_to_markdown(html_content, stream_processing=True, chunk_size=4096, progress_callback=show_progress)Based on comprehensive performance analysis, here are our recommendations:
π Use Regular Processing When:
- Files < 100KB (simplicity preferred)
- Simple scripts and one-off conversions
- Memory is not a concern
- You want the simplest API
π Use Streaming Processing When:
- Files > 100KB (memory efficiency)
- Processing many files in batch
- Memory is constrained
- You need progress reporting
- You want to process results incrementally
- Running in production environments
π Specific Recommendations by File Size:
| File Size | Recommendation | Reason |
|---|---|---|
| < 50KB | Regular (simplicity) or Streaming (3-5% faster) | Either works well |
| 50KB-100KB | Either (streaming slightly preferred) | Minimal difference |
| 100KB-1MB | Streaming preferred | Better performance + memory efficiency |
| > 1MB | Streaming strongly recommended | Significant memory advantages |
π§ Configuration Recommendations:
- Default chunk_size: 2048 bytes (optimal performance balance)
- For very large files (>10MB): Consider
chunk_size=4096 - For memory-constrained environments: Use smaller chunks
chunk_size=1024
π Performance Benefits:
Streaming provides consistent 3-5% performance improvement across all file sizes:
- Streaming throughput: ~0.47-0.48 MB/s
- Regular throughput: ~0.44-0.47 MB/s
- Memory usage: Streaming uses less peak memory for large files
- Latency: Streaming allows processing results before completion
The library provides functions for preprocessing HTML before conversion, useful for cleaning messy or complex HTML:
from html_to_markdown import preprocess_html, create_preprocessor
# Direct preprocessing with custom options
cleaned_html = preprocess_html(
raw_html,
remove_navigation=True,
remove_forms=True,
remove_scripts=True,
remove_styles=True,
remove_comments=True,
preserve_semantic_structure=True,
preserve_tables=True,
preserve_media=True,
)
markdown = convert_to_markdown(cleaned_html)
# Create a preprocessor configuration from presets
config = create_preprocessor(preset="aggressive", preserve_tables=False) # or "minimal", "standard" # Override preset settings
markdown = convert_to_markdown(html, **config)The library provides specific exception classes for better error handling:
from html_to_markdown import (
convert_to_markdown,
HtmlToMarkdownError,
EmptyHtmlError,
InvalidParserError,
ConflictingOptionsError,
MissingDependencyError
)
try:
markdown = convert_to_markdown(html, parser='lxml')
except MissingDependencyError:
# lxml not installed
markdown = convert_to_markdown(html, parser='html.parser')
except EmptyHtmlError:
print("No HTML content to convert")
except InvalidParserError as e:
print(f"Parser error: {e}")
except ConflictingOptionsError as e:
print(f"Conflicting options: {e}")
except HtmlToMarkdownError as e:
print(f"Conversion error: {e}")
## CLI Usage
Convert HTML files directly from the command line with full access to all API options:
```shell
# Convert a file
html_to_markdown input.html > output.md
# Process stdin
cat input.html | html_to_markdown > output.md
# Use custom options
html_to_markdown --heading-style atx --wrap --wrap-width 100 input.html > output.md
# Discord-compatible lists with HTML preprocessing
html_to_markdown \
--list-indent-width 2 \
--preprocess-html \
--preprocessing-preset aggressive \
input.html > output.mdMost Common Options:
--list-indent-width WIDTH # Spaces per indent (default: 4, use 2 for Discord)
--list-indent-type {spaces,tabs} # Indentation type (default: spaces)
--preprocess-html # Enable HTML cleaning for web scraping
--whitespace-mode {normalized,strict} # Whitespace handling (default: normalized)
--heading-style {atx,atx_closed,underlined} # Header style
--no-extract-metadata # Disable metadata extraction
--br-in-tables # Use <br> tags for line breaks in table cells
--source-encoding ENCODING # Override auto-detected encoding (rarely needed)File Encoding:
The CLI automatically detects file encoding in most cases. Use --source-encoding only when automatic detection fails (typically on some Windows systems or with unusual encodings):
# Override auto-detection for Latin-1 encoded file
html_to_markdown --source-encoding latin-1 input.html > output.md
# Force UTF-16 encoding when auto-detection fails
html_to_markdown --source-encoding utf-16 input.html > output.mdAll Available Options:
The CLI supports all Python API parameters. Use html_to_markdown --help to see the complete list.
For existing projects using Markdownify, a compatibility layer is provided:
# Old code
from markdownify import markdownify as md
# New code - works the same way
from html_to_markdown import markdownify as mdThe markdownify function is an alias for convert_to_markdown and provides identical functionality.
Note: While the compatibility layer ensures existing code continues to work, new projects should use convert_to_markdown directly as it provides better type hints and clearer naming.
list_indent_width(int, default:4): Number of spaces per indentation level (use 2 for Discord/Slack)list_indent_type(str, default:'spaces'): Use'spaces'or'tabs'for list indentationheading_style(str, default:'underlined'): Header style ('underlined','atx','atx_closed')whitespace_mode(str, default:'normalized'): Whitespace handling ('normalized'or'strict')preprocess_html(bool, default:False): Enable HTML preprocessing to clean messy HTMLextract_metadata(bool, default:True): Extract document metadata as comment header
highlight_style(str, default:'double-equal'): Style for highlighted text ('double-equal','html','bold')strong_em_symbol(str, default:'*'): Symbol for strong/emphasized text ('*'or'_')bullets(str, default:'*+-'): Characters to use for bullet points in listsnewline_style(str, default:'spaces'): Style for handling newlines ('spaces'or'backslash')sub_symbol(str, default:''): Custom symbol for subscript textsup_symbol(str, default:''): Custom symbol for superscript textbr_in_tables(bool, default:False): Use<br>tags for line breaks in table cells instead of spaces
parser(str, default: auto-detect): BeautifulSoup parser to use ('lxml','html.parser','html5lib')preprocessing_preset(str, default:'standard'): Preprocessing level ('minimal','standard','aggressive')remove_forms(bool, default:True): Remove form elements during preprocessingremove_navigation(bool, default:True): Remove navigation elements during preprocessing
convert_as_inline(bool, default:False): Treat content as inline elements onlystrip_newlines(bool, default:False): Remove newlines from HTML input before processingconvert(list, default:None): List of HTML tags to convert (None = all supported tags)strip(list, default:None): List of HTML tags to remove from outputcustom_converters(dict, default:None): Mapping of HTML tag names to custom converter functions
escape_asterisks(bool, default:True): Escape*characters to prevent unintended formattingescape_underscores(bool, default:True): Escape_characters to prevent unintended formattingescape_misc(bool, default:True): Escape miscellaneous characters to prevent Markdown conflicts
autolinks(bool, default:True): Automatically convert valid URLs to Markdown linksdefault_title(bool, default:False): Use default titles for elements like linkskeep_inline_images_in(list, default:None): Tags where inline images should be preserved
code_language(str, default:''): Default language identifier for fenced code blockscode_language_callback(callable, default:None): Function to dynamically determine code block language
wrap(bool, default:False): Enable text wrappingwrap_width(int, default:80): Width for text wrapping
parser(str, default: auto-detect): BeautifulSoup parser to use ('lxml','html.parser','html5lib')whitespace_mode(str, default:'normalized'): How to handle whitespace ('normalized'intelligently cleans whitespace,'strict'preserves original)preprocess_html(bool, default:False): Enable HTML preprocessing to clean messy HTMLpreprocessing_preset(str, default:'standard'): Preprocessing aggressiveness ('minimal'for basic cleaning,'standard'for balanced,'aggressive'for heavy cleaning)remove_forms(bool, default:True): Remove form elements during preprocessingremove_navigation(bool, default:True): Remove navigation elements during preprocessing
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before submitting PRs to avoid disappointment.
-
Clone the repo
-
Install system dependencies (requires Python 3.9+)
-
Install the project dependencies:
uv sync --all-extras --dev
-
Install pre-commit hooks:
uv run pre-commit install
-
Run tests to ensure everything works:
uv run pytest
-
Run code quality checks:
uv run pre-commit run --all-files
-
Make your changes and submit a PR
# Run tests with coverage
uv run pytest --cov=html_to_markdown --cov-report=term-missing
# Lint and format code
uv run ruff check --fix .
uv run ruff format .
# Type checking
uv run mypy
# Test CLI during development
uv run python -m html_to_markdown input.html
# Build package
uv buildThis library uses the MIT license.
This library provides comprehensive support for all modern HTML5 elements:
<article>,<aside>,<figcaption>,<figure>,<footer>,<header>,<hgroup>,<main>,<nav>,<section><abbr>,<bdi>,<bdo>,<cite>,<data>,<dfn>,<kbd>,<mark>,<samp>,<small>,<time>,<var><del>,<ins>(strikethrough and insertion tracking)
<form>,<fieldset>,<legend>,<label>,<input>,<textarea>,<select>,<option>,<optgroup><button>,<datalist>,<output>,<progress>,<meter>- Task list support:
<input type="checkbox">converts to- [x]/- [ ]
<table>,<thead>,<tbody>,<tfoot>,<tr>,<th>,<td>,<caption>- Merged cell support: Handles
rowspanandcolspanattributes for complex table layouts - Smart cleanup: Automatically handles table styling elements for clean Markdown output
<details>,<summary>,<dialog>,<menu>
<ruby>,<rb>,<rt>,<rtc>,<rp>(for East Asian typography)
<img>,<picture>,<audio>,<video>,<iframe>- SVG support with data URI conversion
<math>(MathML support)
Special thanks to the original markdownify project creators and contributors.