
Word alignments in natural languages
This library and program are designed for the alignment of words in human
languages. Each word is encoded as a list of characters. A single
character, however, is encoded as a list of unicode symbols, not just a
single symbol.
This particular encoding is necessary because there is no general one-to-one
mapping of atomic alphabet symbols to unicode characters. Just consider
character decorations, such as Umlaute (even though Umlaute are available in
unicode).
As user, you send a list to stdin that is formatted as follows (with all fields
mandatory; the tab symbol being the separator between each column):
- word id
- language name
- meaning identifier (this is a string, not a number)
- length of the word
- space-separated characters (each character is actually a string)
(In case you read the plain-text version of this document, the 4 whitespace
characters should be left out.)
0 Albanian_Tosk 1.100 5 \' b o t ə
2 Albanian_Tosk 1.100 10 r̃ o k u lʸ i a\' lʸ e m
The WordAlignment program comes with two modes, twowaysimple and twoway.
TwoWay - Simple
WordAlign twowaysimple
aligns words based on a simple scoring model. Gaps
are scored with linear costs. Matches are scored based on a simple scoring file
handling unigrams of characters. Even the simple model allows for the concept
of equivalence classes. This makes it possible to not only score exact matches,
but to give somewhat high scores to, say, two vowels that should be matched.
The following command-line options are provided:
--scorefile=ITEM
--lpblock=ITEM,ITEM
--showmanual
--prettystupid
--outfile=ITEM
--scorefile
is the simple score file to be used by the mode. the
defaultSimpleScoring file can be used as a template.
--lpblock
expects a pair of language names (Breton,Breton) or a pair of
integers (3,3 or 4,6) and will then align only the given language pairs with
each other. This option should be very helpful in case you want to parallelize
the program.
--showmanual
will show this manual in plain text.
--prettystupid
will show a progress bar of the current language pair.
It's pretty and helpful with smaller tasks but should not be used when you
parallelize on a grid engine.
--outfile
writes to the given output file, not stdout. Actually required
when --prettystupid
is active.
--nobacktrack
disables backtracking. Sometimes just the alignment score
is enough.
TwoWay (Complex Scoring)
The complex scoring model uses linear gap costs. In contrast to the simple
model above, however, character matching is now performed in a bigram context.
The required score file is currently using an in-house format with the
following columns all required (with whitespace, not tab between the entries):
- language name
- character (bigram 1.1)
- character (bigram 1.2)
- language name
- character (bigram 2.1)
- character (bigram 2.2)
- score
Three example lines
Albanian_Tosk \' a Albanian_Tosk \' a 4.25238
Albanian_Tosk \' a Albanian_Tosk \' b 0.402228
Albanian_Tosk \' a Albanian_Tosk \' g 1.07432
In addition to the scoring file, set via --scorefile
, the default score
constant and the gap cost constant need to be set.
--bigramdef=NUM
is the score given to unknown matches.
--gapopen=NUM
is the score given to each gap character (and not just the
opening score)
--lpblock
as above
--showmanual
as above
--prettystupid
as above
--outfile
as above
--nobacktrack
as above
An example output is given below.
The output are four lines for each alignment. An info line with the word ids
(IDS), the alignment score (SCORE), the normalized scores (NSCORE) and the
actual words, started by (WORDS) and interleaved by (WORD). The next two lines
are the alignment, with deletions showning up as minus symbols (---) in the
deletion field. Note that a deletion does not delete a character from the
input, it merely aligns an existing character in one alignment with the symbol
for deletion (--) in the other. The final line provides the per-column score
for the alignment. After the alignment follows one empty line.
The normalized score is defined as SCORE / maximum of input word lengths
cat albanian.input | head -n 3 | ./dist/build/WordAlign/WordAlign twoway -s albanian.scores
yields:
IDS: 1 2 SCORE: 32.85 NSCORE: 5.48 WORDS: ^ d o u a r $ WORD ^ d o u a r $
^ d o u a r $
^ d o u a r $
0.0 4.8 5.9 4.1 6.6 4.6 6.8
IDS: 1 3 SCORE: -12.36 NSCORE: -1.77 WORDS: ^ d o u a r $ WORD ^ p o u l t r $
^ d - o u a r - - $
^ p o - u - l t r $
0.0 0.3 -5.0 -5.0 4.1 -5.0 -20.0 -5.0 -5.0 6.8
Measured on a core i5-3570K @ 3.40 GHz; single-threaded.
The program is not compiled for multi-threading, if you need this consider the
--lpblock
option first. Otherwise, send a mail.
The running time for calculating 100 000 alignments is:
Mode Seconds Alignments per Second
==== ======= =====================
TwoWay Simple: 6.7 s ~ 15 000
TwoWay Complex: 15.8 s ~ 6 300
Disabling backtracking (to get just the alignment scores) improves performance
by roughly a factor of x 2
. Words are very short, backtracking overhead
is quite large!
As in: the number of actual /unicode/ symbols is approximately the same as the
number of forward DP cells to be filled, since each /character/ can be a
combination of multiple unicdoe symbols.
Three- and Four-way Alignments
These are currently disabled. If you need them, consider contacting me.
Christian Hoener zu Siederdissen
Leipzig University, Leipzig, Germany
choener@bioinf.uni-leipzig.de
http://www.bioinf.uni-leipzig.de/~choener/