carpii / unzalgo

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization

Home Page:https://github.kdex.de/unzalgo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

unzalgo

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization.

Installation

$ npm install -D unzalgo

About

You can use unzalgo to both detect Zalgo text and transform it back into normal text without breaking internationalization. For example, you could transform:

T͘H͈̩̬̺̩̭͇I͏̼̪͚̪͚S͇̬̺ ́E̬̬͈̮̻̕V҉̙I̧͖̜̹̩̞̱L͇͍̝ ̺̮̟̙̘͎U͝S̞̫̞͝E͚̘͝R IṊ͍̬͞P̫Ù̹̳̝͓̙̙T̜͕̺̺̳̘͝

into

THIS EVIL USER INPUT

while also having

thiŝ te̅xt displây normally, since some lângûaĝes aĉtuallŷ uŝe thêse sŷmbo̅ls,

and, at the same time, keep all diacritics in

Z nich ovšem pouze předposlední sdílí s výše uvedenou větou příliš žluťoučký kůň úpěl […]

which remains unchanged after a transformation.

Is there a demo?

Yes! You can check it out here. You can edit the text at the top; the lower part shows the text after clean using the default threshold.

How does it work?

In Unicode, every character is assigned to a character category. Zalgo text uses characters that belong to the categories Mn (Mark, Nonspacing) or Me (Mark, Enclosing).

First, the text is divided into words; each word is then assigned to a score that corresponds to the usage of the categories above, combined with small use of statistics. If the score exceeds a threshold, we're able to detect Zalgo text (which allows us to strip away all characters from the above categories).

Getting started

import { clean, isZalgo }  from "unzalgo";
/* Regular cleaning */
assert(clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋") === "this");
/* Clean only if there are no "normal" characters in the word (t, h, i and s are "normal") */
assert(clean("ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋", 1) === "ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋");
/* Clean only if there is at least one combining character  */
assert(clean("français", 0) === "francais");
/* "français" is not a Zalgo text, of course */
assert(isZalgo("français") === false);
/* Unless you define the Zalgo property as containing combining characters */
assert(isZalgo("français", 0) === true);
/* You can also define the Zalgo property as consisting of nothing but combining characters */
assert(isZalgo("français", 1) === false);

Threshold

Unzalgo functions accept a threshold option that lets you configure how sensitively unzalgo behaves. The number threshold falls between 0 and 1. A threshold of 0 indicates that the string should be classified as Zalgo text if it consists of more than 0% of Mn or Me category Unicode codepoints. A threshold of 1 indicates that all codepoints in string must either be categorized as Mn or Me. The threshold defaults to 0.5.

Exports

isZalgo(string, threshold)

Returns true if string is a Zalgo text, else false.

clean(string, threshold) [default export]

Removes all Zalgo text characters for every "likely Zalgo" word in string. Returns a representation of string without Zalgo text.

About

Transforms ť͈̓̆h̏̔̐̑ì̭ͯ͞s̈́̄̑͋ into this without breaking internationalization

https://github.kdex.de/unzalgo/

License:GNU General Public License v3.0


Languages

Language:JavaScript 100.0%