Mozer / talk-llama-fast

Port of OpenAI's Whisper model in C/C++ with xtts and wav2lip

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does it really work with extended ASCII used for Cyrillic symbols?

dmikushin opened this issue · comments

The Cyrillic support added by this project raises an interesting problem.

If the source files are saved in Unicode (which is usual for Linux), the '«' and '»' symbols are represented in UTF-8. Char codes for these symbols are outside of the 0..255 range of 1-byte char, which is strongly pointed out by Clang:

talk-llama-fast.cpp:1248:55: error: character too large for enclosing character literal type
 1248 |                                 text_heard = RemoveTrailingCharacters(text_heard, '«');
      |                                                                                   ^
talk-llama-fast.cpp:1249:55: error: character too large for enclosing character literal type
 1249 |                                 text_heard = RemoveTrailingCharacters(text_heard, '»');

I think what you meant to have was actually the same symbols of extended ASCII table given by 0xab and 0xbb codes:

--- a/src/talk-llama-fast.cpp
+++ b/src/talk-llama-fast.cpp
@@ -1245,8 +1245,8 @@ int run(int argc, const char ** argv) {
                                text_heard = RemoveTrailingCharacters(text_heard, '!');
                                text_heard = RemoveTrailingCharacters(text_heard, ',');
                                text_heard = RemoveTrailingCharacters(text_heard, '.');
-                               text_heard = RemoveTrailingCharacters(text_heard, '«');
-                               text_heard = RemoveTrailingCharacters(text_heard, '»');
+                               text_heard = RemoveTrailingCharacters(text_heard, 0xab); // '«'
+                               text_heard = RemoveTrailingCharacters(text_heard, 0xbb); // '»'
                                if (text_heard[0] == '.') text_heard.erase(0, 1);
                                if (text_heard[0] == '!') text_heard.erase(0, 1);
                                trim(text_heard);

I'm afraid this is not all, yet. The other Cyrillic symbols in the source file also use Unicode. Therefore, they all are likely to be mishandled in the same way. On the other hand, does LLAMA really output extended ASCII, not Unicode or both?? This code clearly assumes that it is always extended ASCII.

Source files are in utf8 without bom as far as I know.
I am compiling using cmake. No errors for me.

Does llama output in utf8? I don't remember, will take a look later, but most likely - yes.

If the source files are in UTF-8, then text_heard = RemoveTrailingCharacters(text_heard, '«'); will never work as intended, because the Unicode symbol as a char literal does not have a valid representation. As I understand, UTF incorporates the ASCII range, that's why all common symbols like '!' or ',' will work this way, but not '«' and '»', because they are outside of the ASCII range.

These '«' and '»' symbols are called lapki (лапки) and are specific to Russian typographic standard.

Does llama output in utf8? I don't remember, will take a look later, but most likely - yes.

I don't know either, I'm new to LLAMA. Best is to ensure that UTF output is always guaranteed. If it is, RemoveTrailingCharacters should accept std::string (not char) as the second arg, and remove trailing substrings.

Yes, llama outputs UTF-8.
Yes, RemoveTrailingCharacters was not working with '«' and '»'. I changed it now to RemoveTrailingCharactersUtf8.

Ugh, one line has been lost in translation:

text_heard = RemoveTrailingCharactersUtf8(text_heard, U"«");

Anyway, I cleaned this up on your behalf in this commit: dmikushin@d7f49fa

Kindly please explain what the rest of the patch is doing:

From ed7ea378177fcad6dfac11163b884dfc7d24c94c Mon Sep 17 00:00:00 2001
From: Mozer <mo3rnt@gmail.com>
Date: Sun, 14 Apr 2024 20:32:28 +0300
Subject: [PATCH] What is this patch about?

---
 examples/talk-llama/talk-llama.cpp | 45 +++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/examples/talk-llama/talk-llama.cpp b/examples/talk-llama/talk-llama.cpp
index a731ffcd8c..4adf461c2b 100644
--- a/examples/talk-llama/talk-llama.cpp
+++ b/examples/talk-llama/talk-llama.cpp
@@ -880,7 +892,7 @@ int run(int argc, const char ** argv) {
 	int reply_part = 0;
 	std::string text_to_speak_arr[150];
 	int reply_part_arr[150];
-	bool last_output_has_username = false;
+	bool last_output_has_username = false;	
 	
     if (whisper_params_parse(argc, argv, params) == false) {
         return 1;
@@ -892,17 +904,17 @@ int run(int argc, const char ** argv) {
         exit(0);
     }
 	
-	const std::string fileName{params.xtts_control_path};
-	std::ifstream readStream{fileName};	
-	if(!readStream.good()){
-		printf("Warning: %s file not found, xtts wont stop on user speech without it\n", params.xtts_control_path.c_str());
-		readStream.close();
-	}
-	else // control file is ok
-	{
-		readStream.close();
-		allow_xtts_file(params.xtts_control_path, 1); // xtts can play
-	}
+	//const std::string fileName{params.xtts_control_path};
+	//std::ifstream readStream{fileName};	
+	//if(!readStream.good()){
+	//	printf("Warning: %s file not found, xtts wont stop on user speech without it\n", params.xtts_control_path.c_str());
+	//	readStream.close();
+	//}
+	//else // control file is ok
+	//{
+	//	readStream.close();
+	//	allow_xtts_file(params.xtts_control_path, 1); // xtts can play
+	//}
 	
 
     // whisper init

xtts_control_path is now shouldn't be passed in params. xtts_play_allowed.txt is now stored in OS /temp dir. The file check at start-up is no longer needed.