microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

Home Page:https://github.com/microcosm-cc/bluemonday

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow Formatted Email Addresses

teschste-reyrey opened this issue · comments

I am not using bluemonday for a web site, I am just using it as an HTML tag stripper, using StrictPolicy(), to generate searchable text without interference from the display tags. However, I have come across one anomaly, which is when a user enters a formatted email address in their text, such as "John Smith JohnSmith@abc.com" In that scenario, JohnSmith@abc.com is removed from the resulting text because it appears to be some type of tag.

I have tried various regex patterns with the AllowElementsMatching modifier but I have not been able to come up with a way to allow an email address in that format to remain in the result text.

Any help on how to get around this would be appreciated!

commented

Ah... interesting.

So you're getting as input John Smith <JohnSmith@abc.com> and it's seeing the email < and > as a HTML tag.

In essence the problem of using a HTML aware sanitizer on non-HTML.

I would not be trying to solve this through this library, but would instead try to look at another way to preserve this.

I don't know your input... but have you considered treating the input as Markdown prior to sanitization? In Markdown an email is <email@domain.com> and will be rendered as a HTML anchor with the email inside <a href="mailto:email@domain.com">email@domain.com</a>, and now when the strict policy is applied it would preserve the text inside the anchor. For that you could look at running this: https://github.com/russross/blackfriday before bluemonday.

Actually my input is HTML from an HTML text editor. However I also provide the ability for the user to search the text they entered and I need to ensure the search ignores the HTML tags or the result gets weird. I will look into blackfriday. Thanks for the quick response!

I played with blackfriday and was originally hopeful because when I passed it just the string that had the email address in it, such as Send email to John Smith <JohnSmith@abc.com>., it worked perfectly. However when I used the actual HTML from the text editor, for example <p>Send email to John Smith <JohnSmith@abc.com>.</p>, blackfriday appeared to ignore the text entirely and simply returned the exact same string, so it seems that solution will not work for my scenario.

I was able to find a solution for my scenario and thought I would share it in case anyone else has the issue. Basically, before I process the text with bluemonday, I replace any <JohnSmith@abc.com> with the same text minus the < and > (i.e. JohnSmith@abc.com), The code I am using is as follows:

  var r = regexp.MustCompile(`(?i)<\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b>`)
  var ranges [][]int
  var temp string

  ranges = r.FindAllIndex([]byte(text), -1)

  if len(ranges) == 0 {
    // No formatted email address exists, process the text as-is.
    temp = text
  } else {
    if ranges[0][0] == 0 {
      // The formatted email address is at the beginning of the string, so skip it.
      temp = ""
    } else {
      // Get the text up to the formatted email address, dropping the <.
      temp = text[0:ranges[0][0]]
    }

    // Loop through all occurrences of formatted email addresses in the text.
    for idx := 0; idx < len(ranges); idx++ {
      // Add the formatted email address to the temp string, dropping the < and >.
      temp += text[ranges[idx][0]+1:ranges[idx][1]-1]

      if idx < (len(ranges) - 1) {
        // there is at least one more range.
        if ranges[idx][1] < ranges[idx + 1][0] {
          // Grab any text between the current and next occurrence.
          temp += text[ranges[idx][1]:ranges[idx + 1][0]]
        }
      }
    }

    if ranges[len(ranges)-1][1] < len(text) {
      // The formatted email address is not at the end of the text, so grab the rest of the text
      // after the final occurrence.
      temp += text[ranges[len(ranges)-1][1]:]
    }
  }

  // Strip all remaining HTML, reversing any characters bluemonday escaped (to provide clean
  // searchable text).
  fmt.Println(html.UnescapeString(bluemonday.StrictPolicy().Sanitize(temp)))

This issue can be closed.