open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust

Home Page:https://crates.io/crates/unic

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WB3d: Keep horizontal whitespace together.

scottmcm opened this issue · comments

From the examples in https://docs.rs/unic-segment/0.9.0/unic_segment/ it appears that this crate (like unicode-segmentation) treats the boundary between two spaces as a word bound:

assert_eq!(
    WordBounds::new("The quick (\"brown\")  fox").collect::<Vec<&str>>(),
    &["The", " ", "quick", " ", "(", "\"", "brown", "\"", ")", " ", " ", "fox"]
);

However WB3d says "Keep horizontal whitespace together.", with the rule "WSegSpace × WSegSpace". The test file https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakTest.txt confirms that there should not be a break between sequential spaces:

÷ 0020 × 0020 ÷	#  ÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷ [0.3]

Is this a bug, or am I misunderstanding something?

Ah this is a dup of #259 because the rule showed up in the re-issue of UAX#29 for Unicode 11 (http://www.unicode.org/reports/tr29/tr29-33.html#Modifications).