点击 中文文档 查看中文文档
mbstring_enhanced is a PHP mbstring extension enhanced for CJK (Chinese, Japanese, Korean) characters.
Somtimes PHP does not detect character encoding, like below:
<?php
$str = "头痛";
echo mb_detect_encoding($str, "UTF-8, CP936", true); // prints UTF-8
echo "\n";
echo mb_detect_encoding($str, "CP936, UTF-8", true); // prints CP936
echo "\n";
That is not an error. That is because the binary form of the string we want to detect can be found both in UTF-8 and CP936 (aka GBK).
PHP code do its best, but that really confuse.
There are some clues for CJK (Chinese, Japanese, Korean), we sloved the problem with the function mbe_is_utf8cjk
.
Yes, GBK (aka CP936) is smaller, fatser and much more simple than UTF-8 for Chinese.
But people have problems when they have storages or databases encoding with GBK and then receive some characters encoding with UTF-8.
We hope the function mbe_strip_utf8_left_cjk
can help.
(tested for PHP 5.3, PHP 5.4, PHP 5.5, PHP 5.6)
mbe_is_utf8cjk
- Check if the string is valid CJK in UTF-8 character encoding
bool mbe_is_utf8cjk ( string $str )
str
The string being detected.
Returns TRUE when the string being detected is in UTF-8 character encoding and contains ascii characters and CJK characters only.
Returns FALSE otherwise.
Example #1 mbe_is_utf8cjk() example
<?php
$is_utf8 = mbe_is_utf8cjk("i had a badly 头痛 yestoday night.");
$encoding = $is_utf8 ? "UTF-8" : "GBK";
echo $encoding; // prints UTF-8
Example #2 mbe_is_utf8cjk() a practical example for chinese encoding detecting
<?php
function mbe_detect_utf8_or_gbk ($str) {
$encoding1 = mb_detect_encoding($str, "UTF-8, CP936", true);
$encoding2 = mb_detect_encoding($str, "CP936, UTF-8", true);
if ($encoding1 == $encoding2) {
return $encoding1;
}
return mbe_is_utf8cjk ($str) ? "UTF-8" : "GBK";
}
$encoding = mbe_detect_utf8_or_gbk("i had a badly 头痛 yestoday night.");
echo $encoding;
(tested for PHP 5.3, PHP 5.4, PHP 5.5, PHP 5.6)
mbe_strip_utf8_left_cjk
- Strip UTF-8 encoding characters that is not ascii character or CJK character from a string
string mbe_strip_utf8_left_cjk ( string $str )
Replace characters that is not ascii character or CJK character in the string with blank.
str
The UTF-8 string being striped.
Note: Input a string that is not encoding in UTF-8 may cause errors.
Returns the stripped string that left ascii characters and CJK characters only.
Note: The return string keeps the same
strlen
of the input string.
Example #1 mbe_strip_utf8_left_cjk() example
<?php
$striped_str = mbe_strip_utf8_left_cjk("abcdefg€\n€zz中文dffh");
echo $striped_str; // prints "abcdefg \n zz中文dffh"
- Install autoconf and php-devel.
cd mbstring_enhanced && phpize
./configure
make
make test
make install
- Add the following line to your php.ini:
extension=mbstring_enhanced.so
mbstring_enhanced.so is installed to the default extension directory and you can call function mbe_is_utf8cjk
and mbe_strip_utf8_left_cjk
in your php code.
Please fill bug reports at https://github.com/chengang/mbstring_enhanced/issues