microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

Home Page:https://github.com/microcosm-cc/bluemonday

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SanitizeReaderToWriter is REALLY slow

natefinch opened this issue · comments

I haven't looked into this too deeply, but I went straight for this method because I assumed it might be better than reading the whole HTML body into memory first. All I did was

func main() {
    in, err := os.Open("input.html")
    if err != nil {
        panic(err)
    }
    defer in.Close()

    out, err := os.Create("output.html")
    if err != nil {
        panic(err)
    }
    defer out.Close()
    
    p := bluemonday.UGCPolicy()
    err = p.SanitizeReaderToWriter(in, out)
    if err != nil {
        panic(err)
    }
}

The difference between the above and reading the input into memory and writing the output to a buffer, and then writing the buffer to disk is HUGE. For a large (6mb) HTML file on my 2021 Macbook, SanitizeReaderToWriter took 4.5s, and Sanitize took 0.15s.

I haven't looked too far into this, and I get there could be some I/O buffering issue with reading and writing directly to disk, but even then, the fact that it's 30x slower seems really weird.

yeah...

commented

I'm confused by this because it's all the same under the hood:

func (p *Policy) sanitize(r io.Reader, w io.Writer) error {

SanitizeReaderToWriter is in fact the only one that goes directly: https://github.com/microcosm-cc/bluemonday/blob/main/sanitize.go#L68-L96

I found this puzzling, so I've added a test:

: go test -run TestIssue134 -v
=== RUN   TestIssue134
=== RUN   TestIssue134/Sanitize
=== RUN   TestIssue134/SanitizeReader
=== RUN   TestIssue134/SanitizeBytes
=== RUN   TestIssue134/SanitizeReaderToWriter
--- PASS: TestIssue134 (0.00s)
    --- PASS: TestIssue134/Sanitize (0.00s)
    --- PASS: TestIssue134/SanitizeReader (0.00s)
    --- PASS: TestIssue134/SanitizeBytes (0.00s)
    --- PASS: TestIssue134/SanitizeReaderToWriter (0.00s)
PASS
ok      github.com/microcosm-cc/bluemonday      0.002s

Please re-open if you can reproduce the test showing the method is slower than others.