Lingua::DetectCyrillic. Detection of 7 Cyrillic codings and 2 languages


Lingua::DetectCyrillic. The package detects 7 Cyrillic codings as well as the language - Russian or Ukrainian. Uses embedded frequency dictionaries; usually one word is enough for correct detection.


  use Lingua::DetectCyrillic;
   -or (if you need translation functions) -
  use Lingua::DetectCyrillic qw ( &TranslateCyr &toLowerCyr &toUpperCyr );
  # New class Lingua::DetectCyrillic. By default, not more than 100 Cyrillic
  # tokens (words) will be analyzed; Ukrainian is not detected.
  $CyrDetector = Lingua::DetectCyrillic ->new();
  # The same but: analyze at least 200 tokens, detect both Russian and
  # Ukrainian.
  $CyrDetector = Lingua::DetectCyrillic ->new( MaxTokens => 200, DetectAllLang => 1 );
  # Detect coding and language
  my ($Coding,$Language,$CharsProcessed,$Algorithm)= $CyrDetector -> Detect( @Data );
  # Write report
  $CyrDetector -> LogWrite(); #write to STDOUT
  $CyrDetector -> LogWrite('report.log'); #write to file
  # Translating to Lower case assuming the source coding is windows-1251
  $s=toLowerCyr($String, 'win');
  # Translating to Upper case assuming the source coding is windows-1251
  $s=toUpperCyr($String, 'win');
  # Converting from one coding to another
  # Acceptable coding definitions are win, koi, koi8u, mac, iso, dos, utf
  $s=TranslateCyr('win', 'koi',$String);

See Additional information on usage of this package .


This package permits to detect automatically all live Cyrillic codings - windows-1251, koi8-r, koi8-u, iso-8859-5, utf-8, cp866, x-mac-cyrillic, as well as the language - Russian or Ukrainian. It applies 3 algorithms for detection: formal analysis of alphabet hits, frequency analysis of words and frequency analysis of 2-letter combinations.

It also provides routines for conversion between different codings of Cyrillic texts which can be imported if necessary.

The package permits to detect coding with one or two words only. Certainly, in case of one word reliability will be low, especially if you wrote the words for testing completely in lower or uppercase, as capitalization is a very important attribute for coding detection. Nethertheless the package correctly recognizes coding in a message containing one single word, even all lowercase - 'privet' ('hello' in Russian), 'ivan', 'vodka', 'sputnik'. ;-)))

Ukrainian language will be specified only if the text contains specific Ukrainian letters.

Performance is good as the analysis passes two stages: on the first only formal and fast analysis of proper capitalization and alphabet hit is carried out and only if these data are not enough, the input is analyzed second time - on frequency dictionaries.


The package requires so far Unicode::String and Unicode::Map8 which can be downloaded from See Additional information on packages to be installed .

I plan to implement my own support of character decoding so these packages will be not required in future releases.

  1. Unicode::Map8
    Basic package for conversion between different one-byte codings. Available at .

    Warning! This module requires preleminary compilation with a C++ compiler; under Unix this procedure goes smoothly and doesn't need commenting; but under Win32 with ActiveState Perl you must

    1. use MS Visual C++ and
    2. make some manual changes to the listing after having run Makefile.PL
      Open map8x.c and change the line 97 from
          ch = PerlIO_getc(f);


          ch = getc(f);

      In one word, you need to replace Perl wrapper for C function getc to the function itself. The compiler produces warnings, but as a result you'll get a 100% working DLL.

  2. Unicode::String
    Provides support for Unicode::Map8. Available at .



Stage 1. Formal analysis of alphabet hits and capitalization

Started programming, I came from an obvious fact: a 'human' reader can easily determine the coding and language from one sight, or at least to say the text to be displayed in a wrong coding. The thing is that the alphabets, i.e. letters of most Cyrillic codings do not coincide so if we try to display text in a bad coding we will inevitably see on screen messy characters inside words which can not be typed with Russian or Ukrainian keyboard layout in a standard way - valuta signs, punctuation marks, Serbian letters, sometimes binary characters etc etc.

Indeed we have only one hard case: the two most popular Cyrillic codings - windows-1251 and koi8-r - have their alphabets in the same range from 192 to 255, but uppercase letters of windows-1251 are placed on the codes of lowercase letters of koi8-r and vice versa, so 'Ivan Petrov' in one of these codings will look like 'iVAN pETROV' in another, i.e. have absolutely wrong capitalization which can be also easily determined by formal analysis of characters. And as you may guess any more or less consistent Cyrillic text must have at least one word starting with a capital letter (I don't take in consideration some weird Internet inhabitants WRITING ALL WITH CAPITAL LETTERS ;-).

Also on the first stage of analysis the program consequently assumes the given text has been written in one of 6 or 7 Cyrillic codings and calculates:
1. how many tokens have inside 'bad' characters which are not part of the Russian or Ukrainian alphabet and cannot be typed with standard keyboard layout;
2. how many tokens have improper capitalization which differs from normal UPPERCASE, lowercase, and Proper words capitalization.

This formal analysis is very fast and suits for 99.9% of real texts. Wrong codings are easily filtered out and we get only one 'absolute winner'. This method is also reliable: I can hardly imagine a normal person writing in reverse capitalization. But what if we have only a few words and all them are in upper- or lowerscase?

Stage 2. Frequency analysis of words and 2-letter combinations.

In this case we apply frequency analysis of words and 2-letter combinations, called also hashes (not in Perl sense, certainly ;-).

The package has dictionaries for 300 most frequent Russian and Ukrainian words and for nearly 600 most frequent Russian and Ukrainian 2-letter combinations, built by myown (the input texts were maybe not be very typical for Internet authors but any linguist can assure you this is not very principal: first hundreds of the most popular words in any language are very stable, nothing to say about letter combinations).

Also the text is analyzed second time (this shouldn't take too much time as we may get into situation like this only in case of a very short text); all the Cyrillic letters analized, no matter in which capitalization they are. If we found at least one word - the coding is determined on it, otherwise - on comparison of letter hashes.

In some very rare cases (usually in a very artificial situation when we have only one short word written all in lower- or uppercase) the statistics on several codings are equal. In this case we prefer windows-1251 to mac, koi8-r to koi8-u and - if nothing helps - windows-1251 to koi8-r.

To judge about which algorithm was applied you may wish to analyze the 4th variable, returned by the function Detect - $Algorithm. More detailed explanation of it is in the table Algorithm codes explanation.


Modern Cyrillic codings and where are they used

The supported codings are:

Algorithm codes explanation

Algorithm codes explanation
11Formal analysis of quantity/capitalization of Cyrillic characters; only one alternative found
21Formal analysis of quantity/capitalization of Cyrillic characters; two alternatives found (koi8-r and koi8-u); koi8-r chosen
22Formal analysis of quantity/capitalization of Cyrillic characters; two alternatives found (win1251 and mac); win1251 chosen
31At least one word from the dictionary found and there is only one alternative
32At least one hash from the hash dictionary found and there is only one alternative
33Formally win1251 defined (most probably on analysis of hash)
34Formally koi8-r defined (most probably on analysis of hash)
40Most probable results were chosen, but reliability is very low
100No single Cyrillic character detected


December 01, 2002 - Extensive Russian documentation added. Version changed to 0.02.

November 19, 2002 - version 0.01 released.


1. Own Unicode support.

2. Option to detect only necessary codings from a list.

What else? Need your feedback!!


The author: Alexei Rudenko, Russia, Moscow. My home phone is (095) 468-95-63


CPAN address:


Copyright (c) 2002 Alexei Rudenko. All rights reserved.

 Lingua::DetectCyrillic. Detection of 7 Cyrillic codings and 2 languages