rafelafrance / line-align

Multiple sequence alignment on lines of text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

line-align

line-align CI

  1. Description
  2. Install
  3. Test

Description

The problem: I had several version of the "same" string coming from different sources, Although they were all supposed to be identical they were actually all slightly different, and with no indication of which one, if any, was correct. To find the "correct" string, I decided to use an algorithm bioinformatics and use a multiple sequence alignment (MSA) algorithm on the text lines to find a single representation of the "true" string.

The Multiple Sequence Alignment algorithm I am using is directly analogous to the ones used for biological sequences, but instead of using a PAM or BLOSUM substitution matrix I use a visual similarity matrix. Visual similarity of characters depends on the font so an exact distance is not always feasible. Instead, I use a rough similarity score that ranges from +2 for characters that are identical, to -2 where the characters are wildly different like a period "." and a "W". I also use a gap open penalty and a gap extend penalty just like the bioinformatics algorithm.

These are naive implementations of string algorithms based on Gusfield, 1997. I.e. There's plenty of room for improvement.

NOTE: The functions are geared towards OCR errors and not human errors. OCR engines will often mistake one letter for another or drop/add a character (particularly from the ends) but will seldom transpose characters, which humans do often. Therefore: I do not consider transpositions in the Levenshtein or Needleman Wunsch distances, and substitutions are based on visual similarity, etc.

For example, if given these 4 similar strings:

MOJAVE DESERT, PROVIDENCE MTS.: canyon above
E. MOJAVE DESERT , PROVIDENCE MTS . : canyon above
E MOJAVE DESERT PROVTDENCE MTS. # canyon above
Be ‘MOJAVE DESERT, PROVIDENCE canyon “above

The alignment may look like the following, depending on the MSA parameters. "⋄" is the gap character.

⋄⋄⋄⋄MOJAVE DESERT⋄, PROVIDENCE MTS⋄.⋄: canyon ⋄above
E.⋄ MOJAVE DESERT , PROVIDENCE MTS . : canyon ⋄above
E⋄⋄ MOJAVE DESERT ⋄⋄PROVTDENCE MTS⋄. # canyon ⋄above
Be ‘MOJAVE DESERT⋄, PROVIDENCE ⋄⋄⋄⋄⋄⋄⋄⋄canyon “above

I can use a character (or word) selection algorithm to build a single "correct" string from this alignment. The result may look like:

E. MOJAVE DESERT, PROVIDENCE MTS.: canyon above

API

Install

You will need to have Python3.11+ installed, as well as pip, a package manager for Python. If you have make you can install the requirements into your python environment like so:

git clone https://github.com/rafelafrance/line-align.git
cd line-align
make install

Every time you run any script in this repository, you'll have to activate the virtual environment once at the start of your session.

cd line-align
source .venv/bin/activate

Test

There are tests which you can run like so:

make test

About

Multiple sequence alignment on lines of text

License:MIT License


Languages

Language:Python 67.4%Language:C++ 31.3%Language:Makefile 1.3%