perl5-dbi / DBD-MariaDB

Perl MariaDB driver

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

utf8mb4 problems

Gazoo opened this issue · comments

commented

There seem to be some utf8mb4 encoding problems with the DBD-MariaDB driver. I confirmed that the data is scrambled in the database so this isn't a display problem.

The DB schema column is created with:
subject varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT '',

Switching over to the perl-DBD-MySQL driver fixes the issue. This is a Centos 8 server with perl 5, version 26, subversion 3 (v5.26.3)

2020-08-28_01h25_05

It is more likely the data is being inserted incorrectly, and that DBD::mysql's bugs are resulting in correct output. Freenode IRC could assist in debugging if you are able to show the code involved.

commented

@Grinnz the code is just the standard Amavis 2.12 package from Centos 8. I'll look at opening a ticket with Amavis too and then we can see if the problem is with DBD::MariaDB driver or Amavis.

This is the insert code from Amavis:

https://gitlab.com/amavis/amavis/-/blob/v2.12.0/amavisd?expanded=true&viewer=simple#L27529

for ($subj,$from) {  # character set decoding, sanitation
        chomp; s/\n(?=[ \t])//gs; s/^[ \t]+//s; s/[ \t]+\z//s;  # unfold, trim
        eval {  # convert to UTF-8 octets, truncate to 255 bytes
          my $chars  = safe_decode_mime($_);      # to logical characters
          my $octets = safe_encode_utf8($chars);  # to bytes, UTF-8 encoded
          $octets = truncate_utf_8($octets,255);
          # man DBI: Drivers should accept [unicode and non-unicode] strings
          # and, if required, convert them to the character set of the
          # database being used. Similarly, when fetching from the database
          # character data that isn't iso-8859-1 the driver should convert
          # it into UTF-8.
          $_ = $octets; 1;  # pass bytes to SQL, UTF-8, works better
        } or do {
          my $eval_stat = $@ ne '' ? $@ : "errno=$!";  chomp $eval_stat;
          do_log(1,"save_info_final INFO: header field ".
                   "not decodable, keeping raw bytes: %s", $eval_stat);
          substr($_,255) = ''  if length($_) > 255;
          die $eval_stat  if $eval_stat =~ /^timed out\b/; # resignal timeout
        };
}

# update message record with additional information
      $conn_h->execute($upd_msg,
               $content_type, $q_type, $q_to, $dsn_sent,
               0+untaint($min_spam_level), $m_id, $from, $subj,
               untaint($msginfo->client_addr), # we may have a better info now
               $sql_schema_version < 2.007000 ? () : $orig,
               $msginfo->partition_tag, $mail_id);
               # $os_fp, $rfc2822_sender, $rfc2822_from, $checks_performed, ...
      # SQL_CHAR, SQL_VARCHAR, SQL_VARBINARY, SQL_BLOB, SQL_INTEGER, SQL_FLOAT,
      # SQL_TIMESTAMP, SQL_TYPE_TIMESTAMP_WITH_TIMEZONE, ...
      $conn_h->commit;  1;

Yes this is absolutely broken. Strings should not be encoded to UTF-8 when using DBD::MariaDB, it is probably to work around DBD::mysql's bugs.

Instead of converting to octets, if you run utf8::upgrade $chars (operates in place) that should ensure that character string is interpreted correctly by both drivers.

commented

OK thanks @Grinnz . I'll file a bug with Amavis and switch over using a patched DBD::mysql (to work around a data type bug) as it will probably take some time to support the DBD::MariaDB driver in Amavis (I thought that it would be an easy switch over between the two drivers but I guess not).