Wrapping based on byte length instead of unicode width

Question

Wrapping based on byte length instead of unicode width

anna-is-cute opened this issue 3 years ago · comments

Hi, thanks for the awesome crate.

I have a situation where I need to pass UTF-8 strings to an external system, each string no more than 500 bytes in length. If the user provides a string longer than 500 bytes, I need to break it into separate messages 500 bytes or shorter and send them individually.

I was hoping that disabling the unicode-width feature would allow me to do this with textwrap, but that doesn't appear to be the case. Is there an easy way to do this using textwrap?

As a really simple example, the Japanese katakana for "ka" is カ, which is 3 bytes in UTF-8. That means that 166 of them could fit in a single line (498 bytes), but I would need to wrap any further カ characters into a new line.

Martin Geisler · Answer 1 · Wed Jun 09 2021 05:52:37 GMT+0800 (China Standard Time)

Hi @ascclemens, thanks for the kind words!

You're right that disabling the unicode-width feature isn't enough: this simply makes Textwrap count one char as 1 column, regardless of how many bytes the UTF-8 encoding of that char is.

There is no built-in way to use the byte length as the "display width", but you can write your own Fragment implementation and go nuts. This seems to do the trick:

use textwrap::core::{Fragment, Word};
use textwrap::word_separators::{UnicodeBreakProperties, WordSeparator};
use textwrap::wrap_algorithms::wrap_optimal_fit;
use unicode_segmentation::UnicodeSegmentation;

#[derive(Debug)]
struct ByteLengthWord<'a> {
    word: &'a str,
    whitespace: &'a str,
}

impl Fragment for ByteLengthWord<'_> {
    #[inline]
    fn width(&self) -> usize {
        self.word.len() // <- byte length becomes the word width!
    }

    #[inline]
    fn whitespace_width(&self) -> usize {
        self.whitespace.len()
    }

    #[inline]
    fn penalty_width(&self) -> usize {
        0 // <- these words will not be hyphenated, so no "-" inserted anywhere
    }
}

impl<'a> ByteLengthWord<'a> {
    fn from(word: Word<'a>) -> Self {
        ByteLengthWord {
            word: word.word,
            whitespace: word.whitespace,
        }
    }

    fn break_apart(self, max_width: usize) -> Vec<ByteLengthWord<'a>> {
        if self.width() <= max_width {
            return vec![self];
        }

        let mut start = 0;
        let mut parts = Vec::new();
        for (idx, grapheme) in self.word.grapheme_indices(true) {
            let with_grapheme = &self.word[start..idx + grapheme.len()];
            let without_grapheme = &self.word[start..idx];
            if idx > 0 && with_grapheme.len() > max_width {
                parts.push(ByteLengthWord {
                    word: &without_grapheme,
                    whitespace: "",
                });
                start = idx;
            }
        }

        parts.push(ByteLengthWord {
            word: &self.word[start..],
            whitespace: self.whitespace,
        });

        parts
    }
}

fn chunk_line(line: &str, chunk_length: usize) -> Vec<&str> {
    let words = UnicodeBreakProperties.find_words(line);
    let byte_length_words = words
        .flat_map(|word| ByteLengthWord::from(word).break_apart(chunk_length))
        .collect::<Vec<_>>();

    let line_lengths = [chunk_length];
    let wrapped_words = wrap_optimal_fit(&byte_length_words, &line_lengths);

    let mut idx = 0;
    let mut chunks = Vec::new();
    for words_in_line in wrapped_words {
        let line_len = words_in_line
            .iter()
            .map(|w| w.word.len() + w.whitespace.len())
            .sum::<usize>();

        // If you want to avoid trailing whitespace, subtract the last
        // whitespace here...
        //
        // let line_len = line_len - words_in_line.last().map_or(0, |w| w.whitespace.len());

        chunks.push(&line[idx..idx + line_len]);
        idx += line_len;

        // ... and then skip over the whitespace here:
        //
        // idx += words_in_line.last().map_or(0, |w| w.whitespace.len());
    }

    chunks
}

fn main() {
    let chunks = chunk_line(
        "To split 😂 or not to split 😂... that is the question!",
        20,
    );
    println!("chunks: {:#?}", chunks);
}

This code is a simplified version of the code in textwrap::wrap. It is also inspired by similar code in the Wasm demo. in short, it's using the building blocks directly and does so with a kind of "word" which uses the byte length as its width, instead of something more complicated.

Depending on your usecase, I would probably not use the above code... instead, see if you can iterate over line.split(' ') and send those substrings. The only real reason to use the code would be

you need to handle "words" longer than 500 bytes. This can be long URLs, it could be someone's cat sleeping on the keyboard, etc... The example handles this by forcibly breaking apart such long words on grapheme boundaries (you could also break on char boundaries if that is good enough).
you need to handle languages without ' ' as the word separator. The examples uses the full Unicode line breaking algorithm, which means that it will break between emojis and East-Asian characters.

Please let me know if I can help more!

Anna · Answer 2 · Wed Jun 09 2021 06:44:33 GMT+0800 (China Standard Time)

Thank you for the very detailed response! As it stands, I do need to handle languages that don't use ' ' to separate words, notably CJK languages.

I figured I'd need to implement some trait, but I wasn't sure which ones... so thank you for the example code. Indeed, as you recommended, I am not going to use it, though. I ended up using the below, using unicode-segmentation and unicode-linebreak:

pub(crate) fn inner_wrap(input: &str, width: u32) -> Vec<String> {
    let width = width as usize;
    let mut strings = Vec::new();
    let mut last_break = 0;
    let mut last_idx = 0;
    for (idx, _) in unicode_linebreak::linebreaks(input) {
        if idx == input.len() {
            continue;
        }

        if idx > last_break + width {
            let segment = input[last_break..last_idx].trim();
            if segment.len() <= width {
                strings.push(segment.to_owned());
            } else {
                hard_break(&mut strings, segment, width);
            }
            last_break = last_idx;
        }

        last_idx = idx;
    }

    let last = input[last_break..].trim();
    if !last.is_empty() {
        if last.len() <= width {
            strings.push(last.to_owned());
        } else {
            hard_break(&mut strings, last, width);
        }
    }

    strings
}

fn hard_break(lines: &mut Vec<String>, segment: &str, width: usize) {
    let mut last_char_break = 0;
    let mut last_char_idx = 0;
    let mut string = String::with_capacity(width);

    for (idx, grapheme) in segment.grapheme_indices(true) {
        if idx + grapheme.len() > last_char_break + width {
            last_char_break = last_char_idx;
            lines.push(string.clone());
            string.clear();
        }

        string.push_str(grapheme);
        last_char_idx = idx;
    }

    if !string.is_empty() {
        lines.push(string);
    }
}

It gets sent to C, so that's why there's so much copying going on - I don't want to fuss with it. 😆

I'll examine your example code and see if there's some things I can optimise or handle better. Again, thank you so much for the thorough example. I'll close the issue, since you've answered my question and I found a solution; maybe someone else will find this exchange useful, as well.

Martin Geisler · Answer 3 · Sun Jun 13 2021 05:42:35 GMT+0800 (China Standard Time)

I'll examine your example code and see if there's some things I can optimise or handle better.
Again, thank you so much for the thorough example. I'll close the issue, since you've answered my question and I found a solution; maybe someone else will find this exchange useful, as well.

Happy to help, I'm glad you found a good solution. Textwrap makes sense if you need flexibility and if you might have weird corner-cases:

what happens if the line length is zero? Does not apply.
what happens if the input has colored text via ANSI escape sequences? Textwrap will ignore the color codes—probably doesn't apply to your situation.
etc...