emojicode / emojicode

πŸ˜€πŸ˜œπŸ”‚ World’s only programming language that’s bursting with emojis

Home Page:https://emojicode.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enable all valid emoji per latest Unicode spec to be used in Emojicode

joeskeen opened this issue Β· comments

I tried doing this:

πŸ‡ πŸ§‡ πŸ‡
πŸ‰

But got this:
🚨 error: Expected Identifier but instead found Variable(πŸ§‡).

If I change the code to this:

πŸ‡ 🐹 πŸ‡
πŸ‰

It works fine.

I use the search feature of the docs to see where πŸ§‡ is defined/used. There are several results, but none of them mention πŸ§‡. I search the Emojicode GitHub organization. No code or issue results. I'm very confusedπŸ˜•. Why can't I make a πŸ§‡ class?

It would be nice if there was a single list someplace that defined all the reserved emoji, and ideally, what they were reserved for.

commented

The issue is that the compiler doesn't recognize πŸ§‡ as an emoji but just as some plain non-emoji character. As the error message says, this leads to the compiler classifying it as a variable name.

This shows that we need to update the compiler to the latest Unicode emoji standard. Hopefully, I'll be able to tackle this next week alongside some other open issues.

But it is not a documentation issue: πŸ§‡ (code point U+1F9C7) does not match the emoji grammar rule as specified here.

I did find this JSON file that seems to be a complete list of all supported emoji currently in the Emoji Unicode spec:

https://unpkg.com/unicode-emoji-json/data-by-emoji.json

I think a script could be written that transforms this list into code that could be inserted here: Compiler/Lex/EmojiTokenization.cpp so all valid emoji could be recognized.

commented

I have a script that can generate the code for Compiler/Lex/EmojiTokenization.cpp from the official Unicode Emoji data. It requires some manual labor nonetheless as parsing emojis for our purposes is very complex. I'll also push the script to the repository when I make the changes.

@thbwd looks like another year has come and gone, and another version of the Emoji spec has been published: https://unicode.org/Public/emoji/. Could you please share your script you used to generate Compiler/Lex/EmojiTokenization.cpp? I'd love to take a look into making updates to this file pretty painless and without manual labor.

I took some time and wrote a script to generate the dynamic parts of Compiler/Lex/EmojiTokenization.cpp:

const response = await fetch("https://unicode.org/Public/emoji/latest/emoji-test.txt");
const text = await response.text();
const emojiPattern = /^([0-9A-F ]+?)\s+;.*?# ([^A-Za-z0-9 ]+) (.*)$/gm;
const emoji = [];
let match;
while ((match = emojiPattern.exec(text)) !== null) {
  emoji.push({
    hex: match[1].split(" "),
    grapheme: match[2],
    description: match[3],
  });
}

const singlePointEmoji = emoji.filter((e) => e.hex.length === 1).map(e => e.hex[0]);
const isEmojiFunction = createCppFunctionToCheckIfCharIsInRange(
    singlePointEmoji,
    "isEmoji"
);
console.log(isEmojiFunction);

const multiPointEmojiModifierBases = emoji
    .filter((e) => e.hex.length > 1)
    .map(e => e.hex[0])
    .filter((value, index, self) => self.indexOf(value) === index);
const isEmojiModifierBaseFunction = createCppFunctionToCheckIfCharIsInRange(
    multiPointEmojiModifierBases,
    "isEmojiModifierBase"
);
console.log(isEmojiModifierBaseFunction);

function createCppFunctionToCheckIfCharIsInRange(codePoints, functionName) {
  const sortedCodePoints = codePoints
    .map((h) => parseInt(h, 16))
    .sort((a, b) => a - b);
  // create a list of ranges of consecutive code points
  const rangeLists = [];
  let range = [];
  for (let codePoint of sortedCodePoints) {
    if (range.length === 0) {
      range.push(codePoint);
    } else if (codePoint === range[range.length - 1] + 1) {
      range.push(codePoint);
    } else {
      rangeLists.push(range);
      range = [codePoint];
    }
  }
  const ranges = rangeLists.map((r) => ({
    start: r[0],
    end: r[r.length - 1],
  }));

  return `bool ${functionName}(char32_t c) {
    switch(c) {
${ranges
  .map(
    (r) =>
      `        case 0x${
        r.start === r.end
          ? r.start.toString(16)
          : `${r.start.toString(16)} ... 0x${r.end.toString(16)}`
      }:`
  )
  .join("\n")}
            return true;
        default: return false;
    }
}`;
}

Running this results in the following C++ code:

bool isEmoji(char32_t c) {
    switch(c) {
        case 0xa9:
        case 0xae:
        case 0x203c:
        case 0x2049:
        case 0x2122:
        case 0x2139:
        case 0x2194 ... 0x2199:
        case 0x21a9 ... 0x21aa:
        case 0x231a ... 0x231b:
        case 0x2328:
        case 0x23cf:
        case 0x23e9 ... 0x23f3:
        case 0x23f8 ... 0x23fa:
        case 0x24c2:
        case 0x25aa ... 0x25ab:
        case 0x25b6:
        case 0x25c0:
        case 0x25fb ... 0x25fe:
        case 0x2600 ... 0x2604:
        case 0x260e:
        case 0x2611:
        case 0x2614 ... 0x2615:
        case 0x2618:
        case 0x261d:
        case 0x2620:
        case 0x2622 ... 0x2623:
        case 0x2626:
        case 0x262a:
        case 0x262e ... 0x262f:
        case 0x2638 ... 0x263a:
        case 0x2640:
        case 0x2642:
        case 0x2648 ... 0x2653:
        case 0x265f ... 0x2660:
        case 0x2663:
        case 0x2665 ... 0x2666:
        case 0x2668:
        case 0x267b:
        case 0x267e ... 0x267f:
        case 0x2692 ... 0x2697:
        case 0x2699:
        case 0x269b ... 0x269c:
        case 0x26a0 ... 0x26a1:
        case 0x26a7:
        case 0x26aa ... 0x26ab:
        case 0x26b0 ... 0x26b1:
        case 0x26bd ... 0x26be:
        case 0x26c4 ... 0x26c5:
        case 0x26c8:
        case 0x26ce ... 0x26cf:
        case 0x26d1:
        case 0x26d3 ... 0x26d4:
        case 0x26e9 ... 0x26ea:
        case 0x26f0 ... 0x26f5:
        case 0x26f7 ... 0x26fa:
        case 0x26fd:
        case 0x2702:
        case 0x2705:
        case 0x2708 ... 0x270d:
        case 0x270f:
        case 0x2712:
        case 0x2714:
        case 0x2716:
        case 0x271d:
        case 0x2721:
        case 0x2728:
        case 0x2733 ... 0x2734:
        case 0x2744:
        case 0x2747:
        case 0x274c:
        case 0x274e:
        case 0x2753 ... 0x2755:
        case 0x2757:
        case 0x2763 ... 0x2764:
        case 0x2795 ... 0x2797:
        case 0x27a1:
        case 0x27b0:
        case 0x27bf:
        case 0x2934 ... 0x2935:
        case 0x2b05 ... 0x2b07:
        case 0x2b1b ... 0x2b1c:
        case 0x2b50:
        case 0x2b55:
        case 0x3030:
        case 0x303d:
        case 0x3297:
        case 0x3299:
        case 0x1f004:
        case 0x1f0cf:
        case 0x1f170 ... 0x1f171:
        case 0x1f17e ... 0x1f17f:
        case 0x1f18e:
        case 0x1f191 ... 0x1f19a:
        case 0x1f201 ... 0x1f202:
        case 0x1f21a:
        case 0x1f22f:
        case 0x1f232 ... 0x1f23a:
        case 0x1f250 ... 0x1f251:
        case 0x1f300 ... 0x1f321:
        case 0x1f324 ... 0x1f393:
        case 0x1f396 ... 0x1f397:
        case 0x1f399 ... 0x1f39b:
        case 0x1f39e ... 0x1f3f0:
        case 0x1f3f3 ... 0x1f3f5:
        case 0x1f3f7 ... 0x1f4fd:
        case 0x1f4ff ... 0x1f53d:
        case 0x1f549 ... 0x1f54e:
        case 0x1f550 ... 0x1f567:
        case 0x1f56f ... 0x1f570:
        case 0x1f573 ... 0x1f57a:
        case 0x1f587:
        case 0x1f58a ... 0x1f58d:
        case 0x1f590:
        case 0x1f595 ... 0x1f596:
        case 0x1f5a4 ... 0x1f5a5:
        case 0x1f5a8:
        case 0x1f5b1 ... 0x1f5b2:
        case 0x1f5bc:
        case 0x1f5c2 ... 0x1f5c4:
        case 0x1f5d1 ... 0x1f5d3:
        case 0x1f5dc ... 0x1f5de:
        case 0x1f5e1:
        case 0x1f5e3:
        case 0x1f5e8:
        case 0x1f5ef:
        case 0x1f5f3:
        case 0x1f5fa ... 0x1f64f:
        case 0x1f680 ... 0x1f6c5:
        case 0x1f6cb ... 0x1f6d2:
        case 0x1f6d5 ... 0x1f6d7:
        case 0x1f6dc ... 0x1f6e5:
        case 0x1f6e9:
        case 0x1f6eb ... 0x1f6ec:
        case 0x1f6f0:
        case 0x1f6f3 ... 0x1f6fc:
        case 0x1f7e0 ... 0x1f7eb:
        case 0x1f7f0:
        case 0x1f90c ... 0x1f93a:
        case 0x1f93c ... 0x1f945:
        case 0x1f947 ... 0x1f9ff:
        case 0x1fa70 ... 0x1fa7c:
        case 0x1fa80 ... 0x1fa88:
        case 0x1fa90 ... 0x1fabd:
        case 0x1fabf ... 0x1fac5:
        case 0x1face ... 0x1fadb:
        case 0x1fae0 ... 0x1fae8:
            return true;
        default: return false;
    }
}
bool isEmojiModifierBase(char32_t c) {
    switch(c) {
        case 0x23:
        case 0x2a:
        case 0xa9:
        case 0xae:
        case 0x203c:
        case 0x2049:
        case 0x2122:
        case 0x2139:
        case 0x2194 ... 0x2199:
        case 0x21a9 ... 0x21aa:
        case 0x2328:
        case 0x23cf:
        case 0x23ed ... 0x23ef:
        case 0x23f1 ... 0x23f2:
        case 0x23f8 ... 0x23fa:
        case 0x24c2:
        case 0x25aa ... 0x25ab:
        case 0x25b6:
        case 0x25c0:
        case 0x25fb ... 0x25fc:
        case 0x2600 ... 0x2604:
        case 0x260e:
        case 0x2611:
        case 0x2618:
        case 0x261d:
        case 0x2620:
        case 0x2622 ... 0x2623:
        case 0x2626:
        case 0x262a:
        case 0x262e ... 0x262f:
        case 0x2638 ... 0x263a:
        case 0x2640:
        case 0x2642:
        case 0x265f ... 0x2660:
        case 0x2663:
        case 0x2665 ... 0x2666:
        case 0x2668:
        case 0x267b:
        case 0x267e:
        case 0x2692:
        case 0x2694 ... 0x2697:
        case 0x2699:
        case 0x269b ... 0x269c:
        case 0x26a0:
        case 0x26a7:
        case 0x26b0 ... 0x26b1:
        case 0x26c8:
        case 0x26cf:
        case 0x26d1:
        case 0x26d3:
        case 0x26e9:
        case 0x26f0 ... 0x26f1:
        case 0x26f4:
        case 0x26f7 ... 0x26f9:
        case 0x2702:
        case 0x2708 ... 0x270d:
        case 0x270f:
        case 0x2712:
        case 0x2714:
        case 0x2716:
        case 0x271d:
        case 0x2721:
        case 0x2733 ... 0x2734:
        case 0x2744:
        case 0x2747:
        case 0x2763 ... 0x2764:
        case 0x27a1:
        case 0x2934 ... 0x2935:
        case 0x2b05 ... 0x2b07:
        case 0x3030:
        case 0x303d:
        case 0x3297:
        case 0x3299:
        case 0x1f170 ... 0x1f171:
        case 0x1f17e ... 0x1f17f:
        case 0x1f1e6 ... 0x1f1ff:
        case 0x1f202:
        case 0x1f237:
        case 0x1f321:
        case 0x1f324 ... 0x1f32c:
        case 0x1f336:
        case 0x1f344:
        case 0x1f34b:
        case 0x1f37d:
        case 0x1f385:
        case 0x1f396 ... 0x1f397:
        case 0x1f399 ... 0x1f39b:
        case 0x1f39e ... 0x1f39f:
        case 0x1f3c2 ... 0x1f3c4:
        case 0x1f3c7:
        case 0x1f3ca ... 0x1f3ce:
        case 0x1f3d4 ... 0x1f3df:
        case 0x1f3f3 ... 0x1f3f5:
        case 0x1f3f7:
        case 0x1f408:
        case 0x1f415:
        case 0x1f426:
        case 0x1f43b:
        case 0x1f43f:
        case 0x1f441 ... 0x1f443:
        case 0x1f446 ... 0x1f450:
        case 0x1f466 ... 0x1f469:
        case 0x1f46b ... 0x1f478:
        case 0x1f47c:
        case 0x1f481 ... 0x1f483:
        case 0x1f485 ... 0x1f487:
        case 0x1f48f:
        case 0x1f491:
        case 0x1f4aa:
        case 0x1f4fd:
        case 0x1f549 ... 0x1f54a:
        case 0x1f56f ... 0x1f570:
        case 0x1f573 ... 0x1f57a:
        case 0x1f587:
        case 0x1f58a ... 0x1f58d:
        case 0x1f590:
        case 0x1f595 ... 0x1f596:
        case 0x1f5a5:
        case 0x1f5a8:
        case 0x1f5b1 ... 0x1f5b2:
        case 0x1f5bc:
        case 0x1f5c2 ... 0x1f5c4:
        case 0x1f5d1 ... 0x1f5d3:
        case 0x1f5dc ... 0x1f5de:
        case 0x1f5e1:
        case 0x1f5e3:
        case 0x1f5e8:
        case 0x1f5ef:
        case 0x1f5f3:
        case 0x1f5fa:
        case 0x1f62e:
        case 0x1f635 ... 0x1f636:
        case 0x1f642:
        case 0x1f645 ... 0x1f647:
        case 0x1f64b ... 0x1f64f:
        case 0x1f6a3:
        case 0x1f6b4 ... 0x1f6b6:
        case 0x1f6c0:
        case 0x1f6cb ... 0x1f6cf:
        case 0x1f6e0 ... 0x1f6e5:
        case 0x1f6e9:
        case 0x1f6f0:
        case 0x1f6f3:
        case 0x1f90c:
        case 0x1f90f:
        case 0x1f918 ... 0x1f91f:
        case 0x1f926:
        case 0x1f930 ... 0x1f939:
        case 0x1f93c ... 0x1f93e:
        case 0x1f977:
        case 0x1f9b5 ... 0x1f9b6:
        case 0x1f9b8 ... 0x1f9b9:
        case 0x1f9bb:
        case 0x1f9cd ... 0x1f9cf:
        case 0x1f9d1 ... 0x1f9df:
        case 0x1fac3 ... 0x1fac5:
            return true;
        default: return false;
    }
}

This should match any emoji in the current Unicode spec. I'm going to do some more testing then I'll put in a PR to update it.