package Unicode::Transform; require 5.006001; use strict; no warnings 'utf8'; use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS); require Exporter; require DynaLoader; $VERSION = '0.34'; @ISA = qw(Exporter DynaLoader); our @UTF_names = qw(utf16le utf16be utf32le utf32be utf8 utf8mod utfcp1047); our @Codenames = ("unicode", @UTF_names); our %EXPORT_TAGS = ( 'from' => [ map('unicode_to_'.$_, @UTF_names) ], 'to' => [ map($_.'_to_unicode', @UTF_names) ], 'chr' => [ map('chr_'.$_, @Codenames) ], 'ord' => [ map('ord_'.$_, @Codenames) ], ); for my $a (@Codenames) { for my $b (@Codenames) { push @{ $EXPORT_TAGS{'conv'} }, "${a}_to_${b}"; } } @EXPORT = (map @$_, @EXPORT_TAGS{qw(from to)}); @EXPORT_OK = (map @$_, @EXPORT_TAGS{qw(conv chr ord)}); $EXPORT_TAGS{'all'} = \ @EXPORT_OK; bootstrap Unicode::Transform $VERSION; 1; __END__ =head1 NAME Unicode::Transform - conversion among Unicode Transformation Formats =head1 SYNOPSIS use Unicode::Transform ':all'; $unicode_string = utf16be_to_unicode($utf16be_string); $utf16le_string = unicode_to_utf16le($unicode_string); $utf8_string = utf32be_to_utf8 ($utf32be_string); $utf8_string = utf32be_to_utf8(\&chr_utf8, $utf32be_string); # ill-formed octet sequences are allowed. =head1 DESCRIPTION This module provides some functions to convert a string among some Unicode Transformation Formats (UTF). =head2 Conversion Between UTF (Exporting: C) =over 4 =item CSRC_UTF_NAMEE_to_EDST_UTF_NAMEE([CALLBACK,] STRING)> =back Returns a string in DST_UTF_NAME corresponding to STRING in SRC_UTF_NAME. B A function name consists of SRC_UTF_NAME, a string '_to_', and DST_UTF_NAME. SRC_UTF_NAME and DST_UTF_NAME must be one in the list of hyphen-removed and lowercased names following: unicode (for Perl internal Unicode encoding; see perlunicode) utf16le (for UTF-16LE) utf16be (for UTF-16BE) utf32le (for UTF-32LE) utf32be (for UTF-32BE) utf8 (for UTF-8) utf8mod (for UTF-8-Mod) utfcp1047 (for CP1047-oriented UTF-EBCDIC). In all, 64 (i.e. 8 times 8) functions are available. Available function names include C and C. DST_UTF_NAME may be same as SRC_UTF_NAME like C. Conversions where both SRC_UTF_NAME and DST_UTF_NAME begin at 'utf' are defined well and stably. In contrast to these UTF, the Perl internal Unicode encoding is influenced by the platform-dependent features (e.g. 32bit/64bit, ASCII/EBCDIC). B If the first parameter is a reference, that is regarded as the CALLBACK. Any reference will not allowed as STRING. If CALLBACK is given, the second parameter is STRING; otherwise the first is. Currently, only code references are allowed as CALLBACK. If CALLBACK is omitted, only Unicode scalar values (C<0x0000..0xD7FF> and C<0xE000..0x10FFFF>) are allowed. Ill-formed octet sequences (corresponding to a code point outside the range of Unicode scalar values) and partial octets (which does not correspond to any code point) are deleted, as if a code reference constantly returning an empty string, C, was used as CALLBACK. Examples of partial octets: the first octet without following octets in UTF-8 like C<"\xC2">; the last octet in UTF-16BE,LE with odd number of octets. If CALLBACK is specified, the appearance of an ill-formed octet sequences or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet. The return value from CALLBACK will be inserted there. You may use CDST_UTF_NAMEE()> as CALLBACK (see below). Return value from CALLBACK should be in UTF of DST_UTF_NAME. You can call C or C in CALLBACK when you want to stop the operation if the whole STRING would not be well-formed. =head2 Conversion from Code Point to String (Exporting: C) =over 4 =item chr_CDST_UTF_NAMEE(CODEPOINT)> =back Returns a string in DST_UTF_NAME corresponding to CODEPOINT. CODEPOINT should be an unsigned integer. If CODEPOINT is outside the range of Unicode scalar values, a corresponding ill-formed octet sequence will be returned. If CODEPOINT is greater than the maximum value, returns C. The maximum value of CODEPOINT is: 0x0010_FFFF for chr_utf16le() and chr_utf16be() 0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047() 0xFFFF_FFFF for chr_utf32le(), chr_utf32be() The maximum value of CODEPOINT for C depends on the platform features (e.g. 32bit/64bit, ASCII/EBCDIC). B The full list of functions provided: =over 4 =item C =item C =item C =item C =item C =item C =item C =item C =back =head2 Numeric Value of the First Character (Exporting: C) =over 4 =item ord_CSRC_UTF_NAMEE(STRING)> =back Returns an unsigned integer value of the first character of STRING in SRC_UTF_NAME. STRING may begin at an ill-formed octet sequence corresponding to a surrogate code point (C<0xD800..0xDFFF>) or an out-of-range code point (C<0x110000> and greater). If STRING is empty or begins at a partial octet, returns C. B The full list of functions provided: =over 4 =item C =item C =item C =item C =item C =item C =item C =item C =back =head1 AUTHOR SADAHIRO Tomoyuki Copyright(C) 2002-2005, SADAHIRO Tomoyuki. Japan. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =head1 SEE ALSO =over 4 =item http://www.unicode.org/reports/tr16/ UTF-EBCDIC (and UTF-8-Mod) =back =cut