Lingua::Translit Developer Documentation

This document shows you how to add support for a new transliteration to Lingua::Translit, build a development version and test your transliteration.

Used Conventions

Every non absolute path is relative to Lingua::Translit's source code directory.

Adding Transliteration Tables

If you want to add a new transliteration to Lingua::Translit just…

write an XML file (the "transliteration table")
build a development version containing your table
write and run some tests to check if your transliteration is working as expected
integrate your table into the set of upstream tables and consider contributing it

Writing a Transliteration Table

Each XML transliteration table consists of meta data and a set of transliteration rules.

The meta data tags cover the name of the transliteration, a short description and the information whether the transliteration can be used in both directions. For example:

<name>DIN 1460</name>
<desc>DIN 1460: Cyrillic to Latin</desc>
<reverse>true</reverse>

The rules can be simple one to one mappings:

<rule>
    <from>X</from>
    <to>Y</to>
</rule>

…but you can also specify a context in which the rule should be evaluated only:

<rule>
    <from>A</from>
    <to>B</to>
    <context>
        <after>x</after>
        <before>y</before>
    </context>
</rule>

To get an easy start, you can copy the file xml/template.xml, rename it as needed and edit it right away. Additionally, xml/Common_DEU.xml may be used as a complete example.

Although editing an XML file is technically quite easy, some things have to be considered. The most important thing to keep in mind is that the rules are applied in sequence - one after another. Therefore the order of rules is important if you specify a context or transliterate multiple characters.

Unicode Notation

If you are determining characters that are non-ASCII characters, use an entity that represents the Unicode code point in hex-notation to specify them and leave a comment on the character.

<rule>
    <!-- CYRILLIC CAPITAL LETTER A -->
    <from>А</from>
    <to>A</to>
</rule>

This assures that the correct character is transformed and it can be exactly determined if it is not represented correctly.

Specifying a Context

The context is evaluated as a Perl regular expression. So for specifying the context literal ASCII characters, entities or meta characters can be used.

If a character has two mappings depending on the context, the context-sensitive rule must be applied before the context-free rule. Otherwise every character is replaced at once through the context-free rule and the context-sensitive rule will never match.

1. rule:

<rule>
    <!-- GREEK CAPITAL LETTER GAMMA & SMALL LETTER KAPPA -->
    <from>&#x0393;&#x03BA;</from>

    <to>Gk</to>
    <context>
        <after>\b</after> <!-- word initial -->
    </context>
</rule>

2. rule:

<rule>
    <!-- GREEK CAPITAL LETTER GAMMA & SMALL LETTER KAPPA -->
    <from>&#x0393;&#x03BA;</from>

    <to>Nk</to>
</rule>

The following pattern matching contexts are available:

<after>: if the transliteration rule should only be applied after a certain character (corresponds to Perl's lookbehind)
<before>: if the rule should only be applied before a certain character (corresponds to Perl's lookahead)
<after> & <before>: if the rule should only be applied if the character is in between two characters

Multiple Characters

As all rules are applied in sequence, and hence the order of rules is important, all rules concerning multiple characters must precede all single character rules.

1. rule:

<rule>
    <!-- GREEK SMALL LETTER ALPHA & SMALL LETTER UPSILON -->
    <from>&#x03B1;&#x03C5;</from>
    <to>au</to>
</rule>

2. rule:

<rule>
    <!-- GREEK SMALL LETTER ALPHA -->
    <from>&#x03B1;</from>
    <to>a</to>
</rule>

If you switch the order of the rules in the example above, every single "alpha" would be transliterated first and the digraph pattern will never match.

Building a Development Version

Your new transliteration table has to be converted to a Perl data structure and stored in xml/tables.dump in order to be put to use and tested as a development version of Lingua::Translit.

xml2dump.pl is a tool that processes XML transliteration table definitions and converts them to Perl data structures. Normally, all stable transliteration tables are processed once, stored in xml/tables.dump and included in the Lingua::Translit::Tables module at build time.

Using xml2dump.pl

To accomplish this task the xml2dump.pl tool comes in handy:

$ ./xml2dump.pl -v -o tables.dump mytable.xml
Parsing mytable.xml... (MyTable: rules=2, contexts=1)
1 transliteration table(s) dumped to tables.dump.

It reads an XML definition, processes it and dumps the resulting data structure to a given file (-o switch).

Your transliteration table is now ready to be included by Lingua::Translit::Tables so it can be tested and evaluated.

Building a Temporary Lingua::Translit

Use the standard toolchain to build a temporary development version of Lingua::Translit which contains nothing but your new transliteration table.

$ perl Makefile.PL && make

Given the resulting development version, it's time to test the transliteration table for completeness and correct functionality.

Testing a Transliteration Table

To verify that your set of transliteration rules works correctly, you need to make some tests using your favorite Perl test framework. For an easy and complete example that utilizes the Test::More framework, have a look at t/11_tr_Common_DEU.t.

Lingua::Translit comes with a ready to use test template that you could use as a starting point and suite it to your transliterations specific needs. It is located at t/xx_tr_template.t.pl - to follow Lingua::Translit's naming convention, rename it to NN_tr_NAME.t.

Hints on What to Test

If your transliteration is straight forward (only "1:1" mappings), just test a small text and have a look at the result. At best, everything is correct and you are ready.
If the transliteration is reversible, you should check if both directions are transliterated correctly.
All the context-sensitive and multi-character transliterations should be tested explicitly, to assure, that the error-prone mappings also work as expected.

Running the Tests

While testing it is convenient to define the environment variable PERL5LIB (have a look at perlrun(1)) so that the Perl interpreter knows where your development version of Lingua::Translit is located. The following example session assumes that you are using bash(1) or a similar shell:

$ export PERL5LIB="blib/lib"
$ perl t/66_tr_mytest.t
1..2
ok 1 - MyTable: not reversible
ok 2 - MyTable: transliteration

If all tests work as expected and hence your transliteration table is ready for usage, clean up your shell's environment and prepare to integrate your table into the existing set of transliteration tables:

$ unset PERL5LIB

Integrating a New Table

Change to the xml/ directory and let make(1) call xml2dump.pl in order to build a data structure ("tables.dump") from all available XML transliteration tables, including yours:

$ make all-tables

Now, clean up the old files from the development version you used to write your tests. Change into the source directory's root and run

$ make distclean
$ perl Makefile.PL
$ make

The result is a complete version of Lingua::Translit that contains all upstream tables, as well as your own addition.

$ make test

…assures everything is alright and ready for installation or packaging. Congratulations!

Contributing Your Table

If you like to contribute your transliteration table under the license terms of Lingua::Translit (which uses the same license terms as Perl itself), it can be included in the official upstream version.

To accomplish this, create a patch of your changes and send it to us for review along with a description and comments.

balls