utf8proc_map_custom parses first codepoint twice?
WillAyd opened this issue · comments
Thank you for the great library. Apologies for any user error as I am just starting out. I am trying to map a function that titles characters which appear at the start of the word or following any whitespace. In doing so I was able to title any character following whitespace, but the first character always came back as is.
During debugging I noticed that the first character was always processed twice during the map - is this expected behavior? Here is an MRE:
#include <stdio.h>
#include "utf8proc.h"
static utf8proc_int32_t wrapper(utf8proc_int32_t codepoint, void *data) {
int should_capitalize = *((int *)data);
printf("entered my wrapper with 'should_capitalize'==%d\n", should_capitalize);
if (should_capitalize) {
*((int *)data) = 0;
return utf8proc_totitle(codepoint);
}
const utf8proc_category_t category = utf8proc_category(codepoint);
switch (category) {
case UTF8PROC_CATEGORY_ZS:
case UTF8PROC_CATEGORY_ZL:
case UTF8PROC_CATEGORY_ZP:
*((int *)data) = 1;
break;
default:
break;
}
return codepoint;
}
int main(int argc, char** argv) {
const unsigned char data[] = "x";
unsigned char *dst;
int should_capitalize = 1;
ssize_t result = utf8proc_map_custom(data, strlen(data), &dst, UTF8PROC_NULLTERM,
wrapper, &should_capitalize);
printf("My title string is %s\n", dst);
free(dst);
return 0;
}
Running this yields:
entered my wrapper with 'should_capitalize'==1
entered my wrapper with 'should_capitalize'==0
My title string is x
Can answer my own question here - looks like its not just the first character that gets processed twice and there is no guarantee of number of iterations / order.