WikitextParser is a dependency-free .NET 9 library for parsing Wikipedia's wikitext markup into a structured object model. It provides both a low-level API for direct element access and a high-level API for parsing a full page into a semantic structure of sections, subsections, and an infobox.
The entire library was developed to parse real-world Wikipedia page and can handle a variety of common wikitext syntax.
- Dual-Level API:
- High-Level API (
Parser.ParsePage): Parses an entire page into a structuredPageobject with a lead section, an infobox, and a nested hierarchy of sections. - Low-Level API (
Parser.Parse): Parses wikitext into a flatIEnumerable<WikitextElement>for custom processing and analysis.
- High-Level API (
- Content Conversion: Convert any parsed element, section, or the entire page into either HTML or plain text.
- No Dependencies: The core library is self-contained and does not require any external packages.
This library will be available via NuGet. You can install it using the .NET CLI:
dotnet add package WikitextParserOr through the NuGet Package Manager in Visual Studio.
Using the parser is straightforward. The recommended approach is to use the high-level ParsePage API for most use cases.
This is the easiest way to get a structured representation of a wiki page.
using WikitextParser;
using WikitextParser.Models;
string wikitext = File.ReadAllText("my_wikitext_file.txt");
// 1. Parse the entire page into a Page object
Page page = Parser.ParsePage(wikitext);
// 2. Access the Infobox
if (page.Infobox != null)
{
var creatorParam = page.Infobox.Parameters.FirstOrDefault(p => p.Key == "creator");
if (creatorParam != null)
{
Console.WriteLine($"Creator: {creatorParam.Value.ConvertToText()}");
}
}
// 3. Access the Lead Content (content before the first heading)
foreach (var element in page.LeadContent)
{
// Ignores metadata templates like {{Short description|...}}
if (element is not WikiKeyValuePairElement)
{
Console.WriteLine(element.ConvertToText());
}
}
// 4. Recursively process all sections and subsections
void PrintSection(Section section, int indent = 0)
{
string indentStr = new string(' ', indent * 2);
Console.WriteLine($"{indentStr}{section.Heading.ConvertToText().Trim()}");
// Print main article links for the section
foreach (var mainLink in section.MainArticleLinks)
{
Console.WriteLine($"{indentStr} (See also: {mainLink.Parameters.First().Value.ConvertToText()})");
}
// Print the section's direct content
foreach(var content in section.ContentElements)
{
Console.WriteLine($"{indentStr} [Content: {content.GetType().Name}]");
}
// Recurse into subsections
foreach (var sub in section.Subsections)
{
PrintSection(sub, indent + 1);
}
}
Console.WriteLine("\n--- PAGE STRUCTURE ---");
foreach (var section in page.Sections)
{
PrintSection(section);
}
// 5. Convert the entire page to HTML or Text
string fullHtml = page.ConvertToHtml();
string fullText = page.ConvertToText();If you need fine-grained control or want to process elements as a simple stream, you can use the low-level Parse method.
using WikitextParser;
using WikitextParser.Elements;
string wikitext = "== Section 1 ==\nThis is ''italic'' text.\n\n[[Category:My Category]]";
var elements = Parser.Parse(wikitext);
foreach (var element in elements)
{
switch (element.Type)
{
case WikitextElementType.Heading:
var heading = (HeadingElement)element;
Console.WriteLine($"Found H{heading.HeadingLevel}: {heading.ConvertToText().Trim()}");
break;
case WikitextElementType.Paragraph:
var paragraph = (ParagraphElement)element;
Console.WriteLine($"Found Paragraph: {paragraph.ConvertToText().Trim()}");
break;
case WikitextElementType.Category:
var category = (CategoryElement)element;
Console.WriteLine($"Found Category: {category.CategoryName}");
break;
default:
Console.WriteLine($"Found other element type: {element.Type}");
break;
}
}The parser currently supports the following wikitext syntax:
| Element | Wikitext Syntax | Notes |
|---|---|---|
| Paragraphs | Some text. |
Paragraphs are separated by one or more blank lines (\n\n). Single newlines are treated as line breaks within a paragraph. |
| Headings | == H2 ==, === H3 ===, etc. |
Parses heading levels 2 through 6. |
| Bold | '''Bold Text''' |
|
| Italic | ''Italic Text'' |
|
| Bold & Italic | '''''Bold Italic''''' |
|
| Internal Links | [[Page Name]], `[[Page Name |
Display Text]]` |
| Templates | `{{TemplateName | param1 |
| Infoboxes | {{Infobox ... }} |
Recognized as a special TemplateElement and separated in the high-level API. |
| Simple KVP | `{{Short description | ...}}` |
| Tables | `{ | ... |
| References | <ref>...</ref>, <ref name="..."/>, <ref name="...">...</ref> |
Parses standard, named, and self-closing reference tags. |
| File/Image | `[[File:Name.jpg | thumb |
| Category Links | `[[Category:Category Name | Sort Key]]` |
| HTML Comments | <!-- A comment --> |
Comments are parsed and preserved. |
Every parsed element (WikitextElement), as well as the high-level Page and Section objects, includes two methods: ConvertToHtml() and ConvertToText().
ConvertToHtml(): Produces a basic HTML representation of the content.ConvertToText(): Produces a plain text representation, stripping out all markup.
Page page = Parser.ParsePage(wikitext);
// Convert the entire page
string fullHtml = page.ConvertToHtml();
string fullText = page.ConvertToText();Notice: The built-in conversion methods are designed to be simple and provide a reasonable default output. They are not customizable. For more advanced formatting, custom logic, or to handle specific templates in a unique way, it is recommended to traverse the parsed element tree and build your own custom conversion logic.
To build the project yourself:
- Install dotnet sdk (9 or newer)
- Clone the repository.
- Navigate to the root directory.
- Run
dotnet build.
This project is licensed under the MIT License.
- 0.0.1 - Initial release
- 0.1.2 - Fix "plainlist" template not parsing correctly