The Digital Unification of Multilingual Data: Multiple Options

Unicode is a wonderful thing, but it has not solved all the problems of incorporating multiple languages into digital history.  I do not wish to sell it short!  Before Unicode, there was a choice to be made, with neither option good.  Either one wrote languages in Latin script (perhaps with addition of diacritic dots to distinguish characters mapped onto the same Latin character), or one created a font which displayed Latin characters by substitution as graphemes in a different script.  The transliteration option was unintelligible to people who read that language but not a European language.  The font option was merely a display trick: nothing in the encoding of the characters indicated whether ‘j’ was an English ‘j’ or a Greek ξ.  And the right-to-left languages needed to be input backward so that (when displayed left-to-right, with a font visual substitution) they came out correctly.  Except then line-breaks broke the word order.  Et cetera, ad nauseam.  Unicode is a wonderful thing.  But now our problems with multiple languages are of a different order.

I am in the initial stages of a project about the medieval Middle East, and I have data in Arabic, Armenian, Hebrew, Persian, and Syriac, as well as some in Greek and Latin!  I wish to make this data available to users who do not necessarily read all of these languages, and the question is how best to do so.  

One option, the Unicode option, is to render every language in its native script.  The user is confronted with a multiplicity of scripts, which is accurate to the society and culture being described!  But a user who reads Arabic but not Armenian would find the string “Մօսլ” unintelligible, and if the reader has the reverse linguistic ability, the name “الموصل” would not be meaningful.  If both are displayed on a page, the implicit assumption is that a user should be able to read both languages.  Users might be able to learn enough of an unfamiliar alphabet to roughly sound out a single word (although this is more difficult with Semitic languages which do not supply vowels!).  But if there is a longer passage quoted in the original language, it will simply not be communicated to users who have not studied it.  And a user who reads neither Arabic nor Armenian would not even have learned that we are talking about the city of Mosul in northern Iraq!

A second option, the opposite extreme, is to translate everything.  As long as users understand English, they can understand everything on the page.  That is a major advantage!  But it also has the disadvantage of flattening the cultural specificity of the different languages.  If a Syriac source uses the name ܢܝܢܘܐ for Mosul, translating it as “Mosul” loses the fact that this is a (probably deliberate) archaism, referring to the city of Mosul by the name of the older city of Nineveh across the Tigris river.  But if you translate the name simply as “Nineveh,” then you lose the fact that the city to which the text refers is in reality Mosul.

Perhaps in a multilingual context, when serving users of uncertain linguistic ability, a third option ought to be considered, reviving the transliteration schemes.  Then the Armenian “Mōsl” and the Arabic “al-Mawṣil” are clearly seen to be related, while something different must be going on with the Syriac “Nīnāwā.”  One difficulty here will be deciding which transliteration scheme to use, since each Middle Eastern language has different schemes in scholarly use.  A further difficulty is that each language’s transliteration scheme caters to that language’s needs, and may be inconsistent with the transliteration schemes for other languages.  For example, the West Semitic emphatic voiceless uvular stop (Arabic ق, Hebrew ק, Syriac ܩ) is transliterated ‘q’ in Arabic and Syriac, but often ‘k’ in Hebrew, even though ‘k’ is also used for a different letter in Hebrew and other Semitic languages.  The Arabic script was adapted to write Persian and Turkish as well, and this letter ‘q’ is transliterated ‘k’ in modern Turkish, and sometimes ‘gh’ in Persian (which is closer to how it is pronounced in Persian, but ‘gh’ also represents another Arabic & Persian letter).  Futhermore, one variety of Armenian transliteration uses ‘gh’ to represent the unrelated letter ղ, which other transliteration schemes represent as ‘ɫ’.  So it is basically impossible to achieve a transliteration scheme which works well for several different languages, across multiple language families.  But without such a scheme, the same string of Latin characters may represent several different sequences of sounds, depending on which language is being transliterated.  There is the danger both of false identifications and of failure to identify two references to the same entity, as an artifact of imperfect transliteration.

A fourth option is to always provide the original and the transliteration for names, and for longer passages both the original and the translation.  This allows users with linguistic ability to benefit from it, while still communicating to users who lack one language or another.  But it also requires more screen space than any of the previous three options, as “Մօսլ = Mōsl” substitutes for a single word, and “الموصل = al-Mawṣil” takes its place alongside of it.  If there are several names given, some in left-to-right scripts and some in right-to-left scripts, a user might easily become lost in the tangled nest of equivalencies and lose track of which transliteration corresponded to which original name.  Graphic design would be very important to help users sort out these issues, but that might well require even more screen real estate.

A fifth option is to mix these possibilities, customizing to a particular user’s abilities, by using cookies or something similar.  Thus a user could choose to see Arabic & Persian in the original, but see Armenian only transliterated, and Syriac both transliterated and in the original in order to work out the correspondences with the related Arabic script.  This might save some screen space over the fourth option, but at the cost of breaking visual parallelism across language boundaries, although admittedly in a way requested by the user.  It also would imply that different users might not be looking at the same data, although one could hope that the differences would not be meaningful.  Most seriously, it would require a more sophisticated web programmer to enable this session-based visualization, rather than serving the same HTML to all users, which makes this option more expensive than the others.

There are probably other solutions, but each of these listed has both benefits and drawbacks.  Unicode is wonderful, and it resolved a specific technical problem.  But engaging with multiple languages, one or more of which may not be familiar to a user, is a human cultural problem more than a computer problem.  And as usual with human problems, there is no complete solution, only multiple available partial ameliorations.  Different options might make more sense in distinct contexts, and for diverse goals.  Digital humanities are necessarily tactical.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s