JuliaStrings / utf8proc

a clean C library for processing UTF-8 Unicode data

Home Page:http://juliastrings.github.io/utf8proc/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

utf8proc_map_custom parses first codepoint twice?

WillAyd opened this issue · comments

Thank you for the great library. Apologies for any user error as I am just starting out. I am trying to map a function that titles characters which appear at the start of the word or following any whitespace. In doing so I was able to title any character following whitespace, but the first character always came back as is.

During debugging I noticed that the first character was always processed twice during the map - is this expected behavior? Here is an MRE:

#include <stdio.h>
#include "utf8proc.h"

static utf8proc_int32_t wrapper(utf8proc_int32_t codepoint, void *data) {
  int should_capitalize = *((int *)data);
  printf("entered my wrapper with 'should_capitalize'==%d\n", should_capitalize);  
  if (should_capitalize) {
    *((int *)data) = 0;
    return utf8proc_totitle(codepoint);
  }

  const utf8proc_category_t category = utf8proc_category(codepoint);
  switch (category) {
  case UTF8PROC_CATEGORY_ZS:
  case UTF8PROC_CATEGORY_ZL:
  case UTF8PROC_CATEGORY_ZP:
    *((int *)data) = 1;
    break;
  default:
    break;
  }
  
  return codepoint;
}

int main(int argc, char** argv) {
  const unsigned char data[] = "x";
  unsigned char *dst;

  int should_capitalize = 1;  
  ssize_t result = utf8proc_map_custom(data, strlen(data), &dst, UTF8PROC_NULLTERM,
                      wrapper, &should_capitalize);

  printf("My title string is %s\n", dst);
  free(dst);
  return 0;
}

Running this yields:

entered my wrapper with 'should_capitalize'==1
entered my wrapper with 'should_capitalize'==0
My title string is x

Can answer my own question here - looks like its not just the first character that gets processed twice and there is no guarantee of number of iterations / order.