NAME
    Unicode::Japanese - Convert encoding of japanese text

SYNOPSIS
     use Unicode::Japanese;
     use Unicode::Japanese qw(unijp);
 
     # convert utf8 -> sjis
 
     print Unicode::Japanese->new($str)->sjis;
     print unijp($str)->sjis; # same as above.
 
     # convert sjis -> utf8
 
     print Unicode::Japanese->new($str,'sjis')->get;
 
     # convert sjis (imode_EMOJI) -> utf8
 
     print Unicode::Japanese->new($str,'sjis-imode')->get;
 
     # convert zenkaku (utf8) -> hankaku (utf8)
 
     print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION
    The Unicode::Japanese module converts encoding of japanese text from one
    encoding to another.

  FEATURES
    * An instance of Unicode::Japanese internally holds a string in UTF-8.

    * This module is implemented in two ways: XS and pure perl. If
      efficiency is important for you, you should build and install the XS
      module. If you don't want to, or if you can't build the XS module, you
      may use the pure perl module instead. In that case, only you have to
      do is to copy Japanese.pm into somewhere in @INC.

    * This module can convert characters from zenkaku (full-width) form to
      hankaku (half-width) form, and vice versa. Conversion between hiragana
      (one of two sets of japanese phonetical alphabet) and katakana
      (another set of japanese phonetical alphabet) is also supported.

    * This module has mapping tables for emoji (graphic characters) defined
      by various japanese mobile phones; DoCoMo i-mode, ASTEL dot-i and
      J-PHONE J-Sky. Those letters are mapped on Unicode Private Use Area so
      unicode strings it outputs are still valid even if they contain emoji,
      and you can safely pass them to other softwares that can handle
      Unicode.

    * This module can map some emoji from one set to another. Different
      mobile phones define different sets of emoji, so mapping each other is
      not always possible. But since some emoji exist in two or more sets
      with similar appearance, this module considers those emoji to be the
      same.

    * This module uses the mapping table for MS-CP932 instead of the
      standard Shift_JIS. The Shift_JIS encoding used by MS-Windows
      (MS-SJIS/MS-CP932) slightly differs from the standard.

    * When the module converts strings from Unicode to Shift_JIS, EUC-JP or
      ISO-2022-JP, unicode letters which can't be represented in those
      encodings will be encoded in "&#dddd;" form (decimal character
      reference). Note, however, that letters in Unicode Private Use Area
      will be replaced with '?' mark ('QUESTION MARK'; U+003F) instead of
      being encoded. In addition, encoding to character sets for mobile
      phones makes every unrepresentable letters being '?' mark.

    * On perl-5.8.0 or later, this module handles the UTF-8 flag: the method
      utf8() returns UTF-8 *byte* string, and the method getu() returns
      UTF-8 *character* string.

      Currently the method get() returns UTF-8 *byte* string but this
      behavior may be changed in the future.

      Methods like sjis(), jis(), utf8(), and such like return *byte*
      string. new(), set(), getcode() methods just ignore the UTF-8 flag of
      strings they take.

REQUIREMENT
    *   perl 5.10.x, 5.8.x, etc. (5.004 and later)

    *   (optional) C Compiler. This module supports both XS and Pure Perl.
        If you have no C Compilers, Unicode::Japanese will be installed as
        Pure Perl module.

    *   (optional) Test.pm and Test::More for testing.

    No other modules are required at run time.

METHODS
    $s = Unicode::Japanese->new($str [, $icode [, $encode]])
        Create a new instance of Unicode::Japanese.

        Any given parameters will be internally passed to the method
        "set"().

    $s = unijp($str [, $icode [, $encode]])
        Same as Unicode::Jananese->new(...).

    $s->set($str [, $icode [, $encode]])

        $str: string
        $icode: optional character encoding (default: 'utf8')
        $encode: optional binary encoding (default: no binary encodings are
        assumed)

        Store a string into the instance.

        Possible character encodings are:

         auto
         utf8 ucs2 ucs4
         utf16-be utf16-le utf16
         utf32-be utf32-le utf32
         sjis cp932 euc euc-jp jis
         sjis-imode sjis-imode1 sjis-imode2
         utf8-imode utf8-imode1 utf8-imode2
         sjis-doti sjis-doti1
         sjis-jsky sjis-jsky1 sjis-jsky2
         jis-jsky  jis-jsky1  jis-jsky2
         utf8-jsky utf8-jsky1 utf8-jsky2
         sjis-au sjis-au1 sjis-au2
         jis-au  jis-au1  jis-au2
         sjis-icon-au sjis-icon-au1 sjis-icon-au2
         euc-icon-au  euc-icon-au1  euc-icon-au2
         jis-icon-au  jis-icon-au1  jis-icon-au2
         utf8-icon-au utf8-icon-au1 utf8-icon-au2
         ascii binary

        (see also "SUPPORTED ENCODINGS".)

        If you want the Unicode::Japanese detect the character encoding of
        string, you must explicitly specify 'auto' as the second argument.
        In that case, the given string will be passed to the method
        getcode() to guess the encoding.

        For binary encodings, only 'base64' is currently supported. If you
        specify 'base64' as the third argument, the given string will be
        decoded using Base64 decoder.

        Specify 'binary' as the second argument if you want your string to
        be stored without modification.

        When you specify 'sjis-imode' or 'sjis-doti' as the character
        encoding, any occurences of '&#dddd;' (decimal character reference)
        in the string will be interpreted and decoded as code point of
        emoji, just like emoji implanted into the string in binary form.

        Since encoded forms of strings in various encodings are not clearly
        distinctive to each other, it is not always certainly possible to
        detect what encoding is used for a given string.

        When a given string is possibly interpreted as both Shift_JIS and
        UTF-8 string, this module considers such a string to be encoded in
        Shift_JIS. And if the encoding is not distinguishable between
        'sjis-au' and 'sjis-doti', this module considers it 'sjis-au'.

    $str = $s->get

        $str: string (UTF-8)

        Get the internal string in UTF-8.

        This method currently returns a byte string (whose UTF-8 flag is
        turned off), but this behavior may be changed in the future.

        If you absolutely want a byte string, you should use the method
        utf8() instead. And if you want a character string (whose UTF-8 flag
        is turned on), you have to use the method getu().

    $str = $s->getu

        $str: string (UTF-8)

        Get the internal string in UTF-8.

        On perl-5.8.0 or later, this method returns a character string with
        its UTF-8 flag turned on.

    $code = $s->getcode($str)

        $str: string
        $code: name of character encoding

        Detect the character encoding of given string.

        Note that this method, exceptionaly, doesn't deal with the internal
        string of an instance.

        To guess the encoding, the following algorithm is used:

        (For pure perl implementation)

        1   If the string has an UTF-32 BOM, its encoding is 'utf32'.

        2   If it has an UTF-16 BOM, its encoding is 'utf16'.

        3   If it is valid for UTF-32BE, its encoding is 'utf32-be'.

        4   If it is valid for UTF-32LE, its encoding is 'utf32-le'.

        5   If it contains no ESC characters or bytes whose eighth bit is
            on, its encoding is 'ascii'. Every ASCII control characters
            (0x00-0x1F and 0x7F) except ESC (0x1B) are considered to be in
            the range of 'ascii'.

        6   If it contains escape sequences of ISO-2022-JP, its encoding is
            'jis'.

        7   If it contains any emoji defined for J-PHONE, its encoding is
            'sjis-jsky'.

        8   If it is valid for EUC-JP, its encoding is 'euc'.

        9   If it is valid for Shift_JIS, its encoding is 'sjis'.

        10  If it contains any emoji defined for au, and everything else is
            valid for Shift_JIS, its encoding is 'sjis-au'.

        11  If it contains any emoji defined for i-mode, and everything else
            is valid for Shift_JIS, its encoding is 'sjis-imode'.

        12  If it contains any emoji defined for dot-i, and everything else
            is valid for Shift_JIS, its encoding is 'sjis-doti'.

        13  If it is valid for UTF-8, its encoding is 'utf8'.

        14  If no conditions above are fulfilled, its encoding is 'unknown'.

        (For XS implementation)

        1   If the string has an UTF-32 BOM, its encoding is 'utf32'.

        2   If it has an UTF-16 BOM, its encoding is 'utf16'.

        3   Find all possible encodings that might have been applied to the
            string from the following:

            ascii / euc / sjis / jis / utf8 / utf32-be / utf32-le /
            sjis-jsky / sjis-imode / sjis-au / sjis-doti

        4   If any encodings have been found possible, this module picks out
            one encoding having the highest priority among them. The
            priority order is as follows:

            utf32-be / utf32-le / ascii / jis / euc / sjis / sjis-jsky /
            sjis-imode / sjis-au / sjis-doti / utf8

        5   If no conditions above are fulfilled, its encoding is 'unknown'.

        Pay attention to the following pitfalls in the above algorithm:

        * UTF-8 strings might be accidentally considered to be encoded in
          Shift_JIS.

        * UCS-2 strings (sequence of raw UCS-2 letters in big-endian; each
          letters has always 2 bytes) can't be detected because they look
          like nothing but sequences of random bytes whose length is an even
          number.

        * UTF-16 strings must have BOM to be detected.

        * Emoji are only be recognized if they are implanted into the string
          in binary form. If they are described in '&#dddd;' form, they
          aren't considered to be emoji.

        Since the XS and pure perl implementations use different algorithms
        to guess encoding, they may guess differently for the same string.
        Especially, the pure perl implementation finds Shift_JIS strings
        containing ESC character (0x1B) to be actually encoded in Shift_JIS
        but XS implementation doesn't. This is because such strings can
        hardly be distinguished from 'sjis-jsky'. In addition, EUC-JP
        strings containing ESC character are also rejected for the same
        reason.

    $code = $s->getcodelist($str)

        $str: string
        $code: name of character encodings

        Detect the character encoding of given string.

        Unlike the method getcode(), getcodelist() returns a list of
        possible encodings.

    $str = $s->conv($ocode, $encode)

        $ocode: character encoding (possible encodings are:)
           utf8 ucs2 ucs4 utf16
           sjis cp932 euc euc-jp jis
           sjis-imode sjis-imode1 sjis-imode2
           utf8-imode utf8-imode1 utf8-imode2
           sjis-doti sjis-doti1
           sjis-jsky sjis-jsky1 sjis-jsky2
           jis-jsky  jis-jsky1  jis-jsky2
           utf8-jsky utf8-jsky1 utf8-jsky2
           sjis-au sjis-au1 sjis-au2
           jis-au  jis-au1  jis-au2
           sjis-icon-au sjis-icon-au1 sjis-icon-au2
           euc-icon-au  euc-icon-au1  euc-icon-au2
           jis-icon-au  jis-icon-au1  jis-icon-au2
           utf8-icon-au utf8-icon-au1 utf8-icon-au2
           binary

          (see also "SUPPORTED ENCODINGS".)

          Some encodings for mobile phones have a trailing digit like
          'sjis-au2'. Those digits represent the version number of
          encodings. Such encodings have a variant with no trailing digits,
          like 'sjis-au', which is the same as the latest version among its
          variants.

        $encode: optional binary encoding
        $str: string

        Get the internal string of instance with encoding it using a given
        character encoding method.

        If you want the resulting string to be encoded in Base64, specify
        'base64' as the second argument.

        On perl-5.8.0 or later, the UTF-8 flag of resulting string is turned
        off even if you specify 'utf8' to the first argument.

    $s->tag2bin
        Interpret decimal character references (&#dddd;) in the instance,
        and replaces them with single characters they represent.

    $s->z2h
        Replace zenkaku (full-width) letters in the instance with hankaku
        (half-width) letters.

    $s->h2z
        Replace hankaku (half-width) letters in the instance with zenkaku
        (full-width) letters.

    $s->hira2kata
        Replace any hiragana in the instance with katakana.

    $s->kata2hira
        Replace any katakana in the instance with hiragana.

    $str = $s->jis
        $str: byte string in ISO-2022-JP

        Get the internal string of instance with encoding it in ISO-2022-JP.

    $str = $s->euc
        $str: byte string in EUC-JP

        Get the internal string of instance with encoding it in EUC-JP.

    $str = $s->utf8
        $str: byte string in UTF-8

        Get the internal UTF-8 string of instance.

        On perl-5.8.0 or later, the UTF-8 flag of resulting string is turned
        off.

    $str = $s->ucs2
        $str: byte string in UCS-2

        Get the internal string of instance as a sequence of raw UCS-2
        letters in big-endian. Note that this is different from UTF-16BE as
        raw UCS-2 sequence has no concept of surrogate pair.

    $str = $s->ucs4
        $str: byte string in UCS-4

        Get the internal string of instance as a sequence of raw UCS-4
        letters in big-endian. This is practically the same as UTF-32BE.

    $str = $s->utf16
        $str: byte string in UTF-16

        Get the insternal string of instance with encoding it in UTF-16 in
        big-endian with no BOM prepended.

    $str = $s->sjis
        $str: byte string in Shift_JIS

        Get the internal string of instance with encoding it in Shift_JIS
        (MS-SJIS / MS-CP932).

    $str = $s->sjis_imode
        $str: byte string in 'sjis-imode'

        Get the internal string of instance with encoding it in
        'sjis-imode'.

    $str = $s->sjis_imode1
        $str: byte string in 'sjis-imode1'

        Get the internal string of instance with encoding it in
        'sjis-imode1'.

    $str = $s->sjis_imode2
        $str: byte string in 'sjis-imode2'

        Get the internal string of instance with encoding it in
        'sjis-imode2'.

    $str = $s->sjis_doti
        $str: byte string in 'sjis-doti'

        Get the internal string of instance with encoding it in 'sjis-doti'.

    $str = $s->sjis_jsky
        $str: byte string in 'sjis-jsky'

        Get the internal string of instance with encoding it in 'sjis-jsky'.

    $str = $s->sjis_jsky1
        $str: byte string in 'sjis-jsky1'

        Get the internal string of instance with encoding it in
        'sjis-jsky1'.

    $str = $s->sjis_jsky
        $str: byte string in 'sjis-jsky'

        Get the internal string of instance with encoding it in 'sjis-jsky'.

    $str = $s->sjis_icon_au
        $str: byte string in 'sjis-icon-au'

        Get the internal string of instance with encoding it in
        'sjis-icon-au'.

    $str_arrayref = $s->strcut($len)

        $len: maximum length of each chunks (in number of full-width
        characters)
        $str_arrayref: reference to array of strings

        Split the internal string of instance into chunks of a given length.

        On perl-5.8.0 or later, UTF-8 flags of each chunks are turned on.

    $len = $s->strlen
        $len: character width of the internal string

        Calculate the character width of the internal string. Half-width
        characters have width of one unit, and full-width characters have
        width of two units.

    $s->join_csv(@values);
        @values: array of strings

        Build a line of CSV from the arguments, and store it into the
        instance. The resulting line has a trailing line break ("\n").

    @values = $s->split_csv;
        @values: array of strings

        Parse a line of CSV in the instance and return each columns. The
        line will be chomp()ed before getting parsed.

        If the internal string was decoded from 'binary' encoding (see
        methods new() and set()), the UTF-8 flags of the resulting array of
        strings are turned off. Otherwise the flags are turned on.

SUPPORTED ENCODINGS
     +---------------+----+-----+-------+
     |encoding       | in | out | guess |
     +---------------+----+-----+-------+
     |auto           : OK : --  | ----- |
     +---------------+----+-----+-------+
     |utf8           : OK : OK  | OK    |
     |ucs2           : OK : OK  | ----- |
     |ucs4           : OK : OK  | ----- |
     |utf16-be       : OK : --  | ----- |
     |utf16-le       : OK : --  | ----- |
     |utf16          : OK : OK  | OK(#) |
     |utf32-be       : OK : --  | OK    |
     |utf32-le       : OK : --  | OK    |
     |utf32          : OK : --  | OK(#) |
     +---------------+----+-----+-------+
     |sjis           : OK : OK  | OK    |
     |cp932          : OK : OK  | ----- |
     |euc            : OK : OK  | OK    |
     |euc-jp         : OK : OK  | ----- |
     |jis            : OK : OK  | OK    |
     +---------------+----+-----+-------+
     |sjis-imode     : OK : OK  | OK    |
     |sjis-imode1    : OK : OK  | ----- |
     |sjis-imode2    : OK : OK  | ----- |
     |utf8-imode     : OK : OK  | ----- |
     |utf8-imode1    : OK : OK  | ----- |
     |utf8-imode2    : OK : OK  | ----- |
     +---------------+----+-----+-------+
     |sjis-doti      : OK : OK  | OK    |
     |sjis-doti1     : OK : OK  | ----- |
     +---------------+----+-----+-------+
     |sjis-jsky      : OK : OK  | OK    |
     |sjis-jsky1     : OK : OK  | ----- |
     |sjis-jsky2     : OK : OK  | ----- |
     |jis-jsky       : OK : OK  | ----- |
     |jis-jsky1      : OK : OK  | ----- |
     |jis-jsky2      : OK : OK  | ----- |
     |utf8-jsky      : OK : OK  | ----- |
     |utf8-jsky1     : OK : OK  | ----- |
     |utf8-jsky2     : OK : OK  | ----- |
     +---------------+----+-----+-------+
     |sjis-au        : OK : OK  | OK    |
     |sjis-au1       : OK : OK  | ----- |
     |sjis-au2       : OK : OK  | ----- |
     |jis-au         : OK : OK  | ----- |
     |jis-au1        : OK : OK  | ----- |
     |jis-au2        : OK : OK  | ----- |
     |sjis-icon-au   : OK : OK  | ----- |
     |sjis-icon-au1  : OK : OK  | ----- |
     |sjis-icon-au2  : OK : OK  | ----- |
     |euc-icon-au    : OK : OK  | ----- |
     |euc-icon-au1   : OK : OK  | ----- |
     |euc-icon-au2   : OK : OK  | ----- |
     |jis-icon-au    : OK : OK  | ----- |
     |jis-icon-au1   : OK : OK  | ----- |
     |jis-icon-au2   : OK : OK  | ----- |
     |utf8-icon-au   : OK : OK  | ----- |
     |utf8-icon-au1  : OK : OK  | ----- |
     |utf8-icon-au2  : OK : OK  | ----- |
     +---------------+----+-----+-------+
     |ascii          : OK : --  | OK    |
     |binary         : OK : OK  | ----- |
     +---------------+----+-----+-------+
     (#): guessed when it has bom.

  GUESSING ORDER
     1.  utf32 (#)
     2.  utf16 (#)
     3.  utf32-be
     4.  utf32-le
     5.  ascii
     6.  jis
     7.  sjis-jsky (pp)
     8.  euc
     9.  sjis
     10. sjis-jsky (xs)
     11. sjis-au
     12. sjis-imode
     13. sjis-doti
     14. utf8
     15. unknown

DESCRIPTION OF UNICODE MAPPING
    Transcoding between Unicode encodings and other ones is performed as
    below:

    Shift_JIS
      This module uses the mapping table of MS-CP932.

      <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TX
      T>

      When the module tries to convert Unicode string to Shift_JIS, it
      represents most letters which isn't available in Shift_JIS as decimal
      character reference ('&#dddd;'). There is one exception to this: every
      graphic characters for mobile phones are replaced with '?' mark.

      For variants of Shift_JIS defined for mobile phones, every
      unrepresentable characters are replaced with '?' mark unlike the plain
      Shift_JIS.

    EUC-JP/ISO-2022-JP
      This module doesn't directly convert Unicode string from/to EUC-JP or
      ISO-2022-JP: it once converts from/to Shift_JIS and then do the rest
      translation. So characters which aren't available in the Shift_JIS can
      not be properly translated.

    DoCoMo i-mode
      This module maps emoji in the range of F800 - F9FF to U+0FF800 -
      U+0FF9FF.

    ASTEL dot-i
      This module maps emoji in the range of F000 - F4FF to U+0FF000 -
      U+0FF4FF.

    J-PHONE J-SKY
      The encoding method defined by J-SKY is as follows: first an escape
      sequence "\e\$" comes to indicate the beginning of emoji, then the
      first byte of an emoji comes next, then the second bytes of at least
      one emoji comes next, then "\x0f" comes last to indicate the end of
      emoji. If a string contains a series of emoji whose first bytes are
      identical, such sequence can be compressed by cascading second bytes
      of them to the single first byte.

      This module considers a pair of those first and second bytes to be one
      letter, and map them from 4500 - 47FF to U+0FFB00 - U+0FFDFF.

      When the module encodes J-SKY emoji, it performs the compression
      automatically.

    AU
      This module maps AU emoji to U+0FF500 - U+0FF6FF.

PurePerl mode
       use Unicode::Japanese qw(PurePerl);

    If you want to explicitly take the pure perl implementation, pass
    'PurePerl' to the argument of the "use" statement.

BUGS
    Please report bugs and requests to "bug-unicode-japanese at rt.cpan.org"
    or <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-Japanese>. If
    you report them to the web interface, any progress to your report will
    be automatically sent back to you.

    * This module doesn't directly convert Unicode string from/to EUC-JP or
      ISO-2022-JP: it once converts from/to Shift_JIS and then do the rest
      translation. So characters which aren't available in the Shift_JIS can
      not be properly translated.

    * The XS implementation of getcode() fails to detect the encoding when
      the given string contains \e while its encoding is EUC-JP or
      Shift_JIS.

    * Japanese.pm is composed of textual perl script and binary character
      conversion table. If you transfer it on FTP using ASCII mode, the file
      will collapse.

SUPPORT
    You can find documentation for this module with the perldoc command.

        perldoc Unicode::Japanese

    You can find more information at:

    *   AnnoCPAN: Annotated CPAN documentation

        <http://annocpan.org/dist/Unicode-Japanese>

    *   CPAN Ratings

        <http://cpanratings.perl.org/d/Unicode-Japanese>

    *   RT: CPAN's request tracker

        <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Unicode-Japanese>

    *   Search CPAN

        <http://search.cpan.org/dist/Unicode-Japanese>

CREDITS
    Thanks very much to:

    NAKAYAMA Nao

    SUGIURA Tatsuki & Debian JP Project

COPYRIGHT & LICENSE
    Copyright 2001-2008 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio, all
    rights reserved.

    This program is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

