locutusjs / locutus

Bringing stdlibs of other programming languages to JavaScript for educational purposes

Home Page:https://locutus.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ord returns different results in JS than PHP

ian opened this issue · comments

commented
  • Have you checked the guidelines in our [Contributing]

Description

On the PHP docs the ord function returns an int between 0-255. However, I'm seeing integer values out of 65536.

Screen Shot 2020-01-08 at 4 48 43 PM

Here's PHP's output munging a binary string:
code: log_message('error', "'".$data[$i]."' -> '".ord($data[$i]));

ERROR - 2020-01-08 16:53:51 --> data 73="�˧ۥ���u
ERROR - 2020-01-08 16:53:51 --> '7' -> '55
ERROR - 2020-01-08 16:53:51 --> '3' -> '51
ERROR - 2020-01-08 16:53:51 --> '=' -> '61
ERROR - 2020-01-08 16:53:51 --> '' -> '15
ERROR - 2020-01-08 16:53:51 --> '"' -> '34
ERROR - 2020-01-08 16:53:51 --> '�' -> '156
ERROR - 2020-01-08 16:53:51 --> '�' -> '203
ERROR - 2020-01-08 16:53:51 --> '�' -> '167
ERROR - 2020-01-08 16:53:51 --> '�' -> '219
ERROR - 2020-01-08 16:53:51 --> '�' -> '165
ERROR - 2020-01-08 16:53:51 --> '�' -> '156
ERROR - 2020-01-08 16:53:51 --> '�' -> '179
ERROR - 2020-01-08 16:53:51 --> '�' -> '149
ERROR - 2020-01-08 16:53:51 --> 'u' -> '117
ERROR - 2020-01-08 16:53:51 --> '
' -> '10
ERROR - 2020-01-08 16:53:51 --> '�' -> '173

here's JS:
const strings = require("locutus/php/strings")
console.log('${data[i]}' -> ${strings.ord(data[i])})

console.log lib/crypto.js:122
  data 73="�˧ۥ���u

console.log lib/crypto.js:149
  '7' -> 55

console.log lib/crypto.js:149
  '3' -> 51

console.log lib/crypto.js:149
  '=' -> 61

console.log lib/crypto.js:149
  '' -> 15

console.log lib/crypto.js:149
  '"' -> 34

console.log lib/crypto.js:149
  '�' -> 65533

console.log lib/crypto.js:149
  '˧' -> 743

console.log lib/crypto.js:149
  'ۥ' -> 1765

console.log lib/crypto.js:149
  '�' -> 65533

console.log lib/crypto.js:149
  '�' -> 65533

console.log lib/crypto.js:149
  '�' -> 65533

console.log lib/crypto.js:149
  'u' -> 117

console.log lib/crypto.js:149
  '
  ' -> 10
commented

@kvz any thoughts on what I might be doing wrong? Would appreciate any help or insight you might have.

Hi @ian! ord in JavaScript is an inherently flawed concept. PHP's strings are series of 8-bit bytes and are therefore suitable for both binary data and text (using encodings such as UTF-8). JavaScript's strings are based on 16-bit UTF-16 code units and were not designed for binary data, only for text.

There are multiple ways to store binary data in JS strings and none are good: you either waste memory or get an inconvenient way of accessing individual bytes. You can limit yourself to using only 8 bits per 16-bit element (so two bytes with values 1 and 2 would be "\u0001\u0002" - Locutus's ord handles this correctly) or you can pack two bytes together ("\u0102" or "\u0201" - Locutus's ord is not built for this). They're both valid choices.

To answer your question of what you're doing wrong, it's probably that you're storing binary data in JavaScript strings. They're just not made for it. I recommend looking into using Uint8Array instead.

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 14 days.

The PHP documentation traditionally mentions that the ord function returns an integer value representing the ASCII code of the first character of the string. This value typically ranges between 0 to 255 for standard ASCII characters, which is probably why you're seeing references to it returning a value within that range.

However, what you're experiencing with values exceeding 255 (and even reaching 65536) is related to how ord behaves with multi-byte characters, particularly those outside the standard ASCII range. This is more apparent when dealing with strings in UTF-8 or other multi-byte character encodings where characters can indeed have values beyond the traditional 255 limit of the ASCII table.

The JavaScript code you've shared uses charCodeAt(0), which can handle Unicode characters that are represented as surrogate pairs in JavaScript. Surrogate pairs are a method of encoding characters with code points that require more than 16 bits. The range you've noticed (up to 65536 and beyond) is due to this mechanism, as JavaScript's charCodeAt method returns the UTF-16 code unit at the specified index, which can represent these higher values for characters that are outside the BMP (Basic Multilingual Plane).

The ord function in your JavaScript example is explicitly designed to handle surrogate pairs, which is why you see it returning values like 65536 for a single Unicode character represented by a surrogate pair (\uD800\uDC00 in your example). This is an enhancement over the PHP ord function's behavior, which is not inherently designed to handle characters represented by surrogate pairs without specific handling.

In summary, the PHP ord function's behavior is traditionally documented with a return range of 0-255 because it was originally designed with single-byte character sets in mind. The values exceeding this range that you're encountering are likely due to encountering multi-byte characters, and the JavaScript implementation you've shown is explicitly designed to account for Unicode characters that may require surrogate pairs, hence the higher values.