If a profanity has a 1 on the end it isn't detected

Question

If a profanity has a 1 on the end it isn't detected

DavidJBerman opened this issue 4 years ago · comments

var mf = samueljacksonsfavoriteword + "1"; // "mtherf**er1"
bool isProfanity = ProfanityFilter.IsProfanity(); // isProfanity is true
var censored = ProfanityFilter.CensorString(mf);
bool areSame = (censored == mf); // Returns true.

Censor does not censor m**********r out of the string even though ProfanityFilter recognizes the word as a profanity.

stephenhaunts · Answer 1 · Wed Apr 22 2020 15:28:41 GMT+0800 (China Standard Time)

Hi, thanks for this. I will try and get this resolved this week.

stephenhaunts · Answer 2 · Mon May 04 2020 17:22:31 GMT+0800 (China Standard Time)

This is a tricky one. The logic it falls into is the mitigation for the Scunthorpe problem which is a common problem for profanity detectors. The word scunthorpe (a town in the UK) is not a profanity, but because it contains the word c*nt it normally gets flagged. One common way of resolving this is to use a whitelist and whitelist scunthorpe, but that isn't a great solution as you then have to try and whitelist all cases like this.

My library tries to be a bit more intelligent about it. It will detect the word cnt, then it will look at the surrounding letters that contain the word cnt and then check if that whole word is profane. In this example, the surrounding word is scunthorpe which is not rude. The town of penistown has the same problem.

In your example of motherfcker1, it is the same issue. The logic detects the motherfcker, and then looks for the surrounding word which is motherf*cker1, and that word is not profane in-terms of the profanity list.

In the spirit of the solution, this is "as designed" behaviour, but I also see that from your point of view we have a rogue motherf*cker running around, which isn't great.

This one needs a little more thought to solve it in a way that doesn't completely stink.

stephenhaunts · Answer 3 · Mon May 04 2020 18:39:07 GMT+0800 (China Standard Time)

I have checked in a fix and update the NuGet package to version 0.1.4

This was an odd one to fix as technically the side effect you were seeing was as designed, but your use case was also valid. So a bug that's not a bug that needs to be treated as a bug.

I have made the fix user selectable for the moment as I am still trying to decide how much it smells.

I have added a new overload to CensorString that takes a bool to ignore numbers in the string for the moment as demonstrated in the following unit test.

    [TestMethod]
    public void CensoredStringReturnsCensoredStringMotherfucker()
    {
        var filter = new ProfanityFilter();

        var censored = filter.CensorString("You are a motherfucker1", '*', true);
        var result = "You are a *************";

        Assert.AreEqual(censored, result);
    }

I have tested lots of edge cases around it too such as:

    [TestMethod]
    public void CensoredStringReturnsCensoredStringMotherfucker11()
    {
        var filter = new ProfanityFilter();

        var censored = filter.CensorString("You are a motherfucker1 and a 'fucking twat3'.", '*', true);
        var result = "You are a ************* and a '******* *****'.";

        Assert.AreEqual(censored, result);
    }

I class this as a temporary fix at the moment until I decide the best thing to do with it. What are your thoughts?

Thanks

Steve

stephenhaunts · Answer 4 · Mon May 04 2020 18:45:02 GMT+0800 (China Standard Time)

When I say it ignores numbers, it is a little smarter than that. It ignores numbers that are joined to another word, so you can still have numbers in a sentence, as illustrated in the following test.

    [TestMethod]
    public void CensoredStringReturnsCensoredStringMotherfucker12()
    {
        var filter = new ProfanityFilter();

        var censored = filter.CensorString("I've had 10 beers, and you are a motherfucker1 and a 'fucking twat3'.", '*', true);
        var result = "I've had 10 beers, and you are a ************* and a '******* *****'.";

        Assert.AreEqual(censored, result);
    }

David Berman · Answer 5 · Mon May 04 2020 21:14:27 GMT+0800 (China Standard Time)

Hi Stephen, I hope you are doing ok. Thanks a lot for fixing the issue causing the exceptions. This was a big deal! As you are the author of the library you have the best expertise about how to handle the profanity. That’s why I didn’t try to make a change myself and submit a pull request. So I do agree you need to decide what the business rules are for it and that’s part of the value add of your filter. Wildcards. Your strategy of having a bigger list but no wildcards makes sense. Numbers. If they are leading or trailing it seems to me we would want to remove the profanity. If they are in the middle it’s trickier. Also, different environments will have different amounts of strictness to get the desired false positive false negative balance. Big companies will be very strict and will be very upset to see anything that’s obviously profanity to a human. If the user slips through “m0thërfvcker” I’m not sure that’s even profanity any more, it gets into the grey zone. Even when they bleep on tv you might hear the beginning or end of the word, and you can still see their mouth moving. I think your solution of accepting an optional parameter for numbers is good and I don’t think it smells because you need a way to have different rules for different scenarios just like you need different profanity lists for different scenarios. D.