SanitizeReaderToWriter is REALLY slow
natefinch opened this issue · comments
I haven't looked into this too deeply, but I went straight for this method because I assumed it might be better than reading the whole HTML body into memory first. All I did was
func main() {
in, err := os.Open("input.html")
if err != nil {
panic(err)
}
defer in.Close()
out, err := os.Create("output.html")
if err != nil {
panic(err)
}
defer out.Close()
p := bluemonday.UGCPolicy()
err = p.SanitizeReaderToWriter(in, out)
if err != nil {
panic(err)
}
}
The difference between the above and reading the input into memory and writing the output to a buffer, and then writing the buffer to disk is HUGE. For a large (6mb) HTML file on my 2021 Macbook, SanitizeReaderToWriter took 4.5s, and Sanitize took 0.15s.
I haven't looked too far into this, and I get there could be some I/O buffering issue with reading and writing directly to disk, but even then, the fact that it's 30x slower seems really weird.
yeah...
I'm confused by this because it's all the same under the hood:
func (p *Policy) sanitize(r io.Reader, w io.Writer) error {
SanitizeReaderToWriter is in fact the only one that goes directly: https://github.com/microcosm-cc/bluemonday/blob/main/sanitize.go#L68-L96
I found this puzzling, so I've added a test:
: go test -run TestIssue134 -v
=== RUN TestIssue134
=== RUN TestIssue134/Sanitize
=== RUN TestIssue134/SanitizeReader
=== RUN TestIssue134/SanitizeBytes
=== RUN TestIssue134/SanitizeReaderToWriter
--- PASS: TestIssue134 (0.00s)
--- PASS: TestIssue134/Sanitize (0.00s)
--- PASS: TestIssue134/SanitizeReader (0.00s)
--- PASS: TestIssue134/SanitizeBytes (0.00s)
--- PASS: TestIssue134/SanitizeReaderToWriter (0.00s)
PASS
ok github.com/microcosm-cc/bluemonday 0.002s
Please re-open if you can reproduce the test showing the method is slower than others.