microsoft / XmlNotepad

XML Notepad provides a simple intuitive User Interface for browsing and editing XML documents.

Home Page:https://microsoft.github.io/XmlNotepad/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incomplete schema validation of large XML files ( > ~20 MB)

AikenBM opened this issue · comments

I'm using XML Notepad 2.9.0.5 on Windows 10 Enterprise 22H2 19045.3208.

I've discovered that XML Notepad's validation gives up around the 20 MB mark. The application will end with displaying a line number a column number of 0 in the error list. The program will then stop validating any further schema errors.

I am attaching a zipped 42 MB XML file that contains 100 schema validation errors. This example uses sample data from the state of Michigan's Department of Education state reporting system because that's what I was doing when I found the problem.

SchemaErrors.zip

XML Notepad validates and identifies the first 43 or so errors, but the last one listed doesn't appear to populate the table the same way, and it stops after that error. Even exporting the list doesn't show the remaining errors. Here's a screenshot of the error list:

image

Based on my incidental testing, any schema validation error after roughly the 20 millionth character or 565,000th line in the file will fail in this way. Only the first error in that range will show, in the error list, and the error list will not display accurate information. I don't see any setting in the application's options to increase this apparent limitation.

I've also included the following Powershell function below which uses System.Xml.XmlReader to validate the schema, and it correctly identifies all 100 schema errors.

function Validate-XmlFile {
    [CmdletBinding()]
    param (
        # The path to the XML file
        [Parameter(Mandatory = $true, Position = 0)]
        [String]
        $Path,

        # Instead of outputing the warnings, just return true if valid and false if invalid
        [Switch]
        $IsValid,

        # Force the process to download the schema again instead of potentially using a cached version
        [Switch]
        $ForceSchemaRefresh
    )

    process {
        $XmlFileName = Get-Item $Path -ErrorAction Stop

        $XmlReaderSettings = New-Object -TypeName System.Xml.XmlReaderSettings
        $XmlReaderSettings.ValidationType = [System.Xml.ValidationType]::Schema
        $XmlReaderSettings.ValidationFlags = ([System.Xml.Schema.XmlSchemaValidationFlags]::ProcessInlineSchema -bor
            [System.Xml.Schema.XmlSchemaValidationFlags]::ProcessSchemaLocation -bor
            [System.Xml.Schema.XmlSchemaValidationFlags]::ReportValidationWarnings -bor
            [System.Xml.Schema.XmlSchemaValidationFlags]::ProcessIdentityConstraints)

        $XmlUrlResolver = [System.Xml.XmlUrlResolver]::new()
        # Some versions of Powershell require credentials. Use anonymous credentials to satisfy the class.
        $XmlUrlResolver.Credentials = [System.Net.NetworkCredential]::new('anonymous','anonymous@example.com')
        $XmlUrlResolver.CachePolicy = [System.Net.Cache.RequestCacheLevel]::Revalidate
        if ($true -eq $ForceSchemaRefresh) {
            $XmlUrlResolver.CachePolicy = [System.Net.Cache.RequestCacheLevel]::Reload
        }
        $XmlReaderSettings.XmlResolver = $XmlUrlResolver

        # Create the validation handler to capture the validation errors and warnings
        $script:ValidationOutput = [System.Collections.Generic.List[String]]::new()
        $ValidationEventHandler = [System.Xml.Schema.ValidationEventHandler] {
            # $_ is the second argument of type System.Xml.ValidationEventArgs
            $script:ValidationOutput.Add(("{0} on line {2}: {1}" -f $_.Severity, $_.Message, $_.Exception.LineNumber))
        }
        $XmlReaderSettings.add_ValidationEventHandler($ValidationEventHandler)

        try {
            $XmlReader = [System.Xml.XmlReader]::Create($XmlFileName.FullName, $XmlReaderSettings);
            [System.Xml.XmlDocument]::new().Load($XmlReader)
            $XmlReader.Dispose()
            if (!$IsValid) {
                if ($script:ValidationOutput.Count -eq 0) {
                    Write-Host "No validation errors in file '$($XmlFileName.FullName)'."
                }
                else {
                    # Write-Host "Validation errors written to '$ValidationErrorFile'"
                    # $script:ValidationOutput | Set-Content -Path $ValidationErrorFile -Encoding ascii
                    $script:ValidationOutput
                    Write-Warning ("{0:n0} errors detected in '{1}'" -f $script:ValidationOutput.Count, $XmlFileName.FullName)
                }
            }
            else {
                if ($script:ValidationOutput.Count -eq 0) {
                    return $true
                }
                else {
                    return $false
                }
            }
        }
        finally {
            $XmlReader.Dispose()
        }
    }
}

Note that I typically use Powershell v7.3 with the above function. I'm not sure if it still works with Windows Powershell v5.1.

Very excellent bug report, I'll check it out, thanks.