404

[ Avaa Bypassed ]




Upload:

Command:

elspacio@3.137.211.49: ~ $
=head1 NAME

perlunitut - Perl Unicode Tutorial

=head1 DESCRIPTION

The days of just flinging strings around are over. It's well established that
modern programs need to be capable of communicating funny accented letters, and
things like euro symbols. This means that programmers need new habits. It's
easy to program Unicode capable software, but it does require discipline to do
it right.

There's a lot to know about character sets, and text encodings. It's probably
best to spend a full day learning all this, but the basics can be learned in
minutes. 

These are not the very basics, though. It is assumed that you already
know the difference between bytes and characters, and realise (and accept!)
that there are many different character sets and encodings, and that your
program has to be explicit about them. Recommended reading is "The Absolute
Minimum Every Software Developer Absolutely, Positively Must Know About Unicode
and Character Sets (No Excuses!)" by Joel Spolsky, at
L<http://joelonsoftware.com/articles/Unicode.html>.

This tutorial speaks in rather absolute terms, and provides only a limited view
of the wealth of character string related features that Perl has to offer. For
most projects, this information will probably suffice.

=head2 Definitions

It's important to set a few things straight first. This is the most important
part of this tutorial. This view may conflict with other information that you
may have found on the web, but that's mostly because many sources are wrong.

You may have to re-read this entire section a few times...

=head3 Unicode

B<Unicode> is a character set with room for lots of characters. The ordinal
value of a character is called a B<code point>.   (But in practice, the
distinction between code point and character is blurred, so the terms often
are used interchangeably.)

There are many, many code points, but computers work with bytes, and a byte has
room for only 256 values.  Unicode has many more characters than that,
so you need a method to make these accessible.

Unicode is encoded using several competing encodings, of which UTF-8 is the
most used. In a Unicode encoding, multiple subsequent bytes can be used to
store a single code point, or simply: character.

=head3 UTF-8

B<UTF-8> is a Unicode encoding. Many people think that Unicode and UTF-8 are
the same thing, but they're not. There are more Unicode encodings, but much of
the world has standardized on UTF-8. 

UTF-8 treats the first 128 codepoints, 0..127, the same as ASCII. They take
only one byte per character. All other characters are encoded as two to
four bytes using a complex scheme. Fortunately, Perl handles this for
us, so we don't have to worry about this.

=head3 Text strings (character strings)

B<Text strings>, or B<character strings> are made of characters. Bytes are
irrelevant here, and so are encodings. Each character is just that: the
character.

On a text string, you would do things like:

    $text =~ s/foo/bar/;
    if ($string =~ /^\d+$/) { ... }
    $text = ucfirst $text;
    my $character_count = length $text;

The value of a character (C<ord>, C<chr>) is the corresponding Unicode code
point.

=head3 Binary strings (byte strings)

B<Binary strings>, or B<byte strings> are made of bytes. Here, you don't have
characters, just bytes. All communication with the outside world (anything
outside of your current Perl process) is done in binary.

On a binary string, you would do things like:

    my (@length_content) = unpack "(V/a)*", $binary;
    $binary =~ s/\x00\x0F/\xFF\xF0/;  # for the brave :)
    print {$fh} $binary;
    my $byte_count = length $binary;

=head3 Encoding

B<Encoding> (as a verb) is the conversion from I<text> to I<binary>. To encode,
you have to supply the target encoding, for example C<iso-8859-1> or C<UTF-8>.
Some encodings, like the C<iso-8859> ("latin") range, do not support the full
Unicode standard; characters that can't be represented are lost in the
conversion.

=head3 Decoding

B<Decoding> is the conversion from I<binary> to I<text>. To decode, you have to
know what encoding was used during the encoding phase. And most of all, it must
be something decodable. It doesn't make much sense to decode a PNG image into a
text string.

=head3 Internal format

Perl has an B<internal format>, an encoding that it uses to encode text strings
so it can store them in memory. All text strings are in this internal format.
In fact, text strings are never in any other format!

You shouldn't worry about what this format is, because conversion is
automatically done when you decode or encode.

=head2 Your new toolkit

Add to your standard heading the following line:

    use Encode qw(encode decode);

Or, if you're lazy, just:

    use Encode;

=head2 I/O flow (the actual 5 minute tutorial)

The typical input/output flow of a program is:

    1. Receive and decode
    2. Process
    3. Encode and output

If your input is binary, and is supposed to remain binary, you shouldn't decode
it to a text string, of course. But in all other cases, you should decode it.

Decoding can't happen reliably if you don't know how the data was encoded. If
you get to choose, it's a good idea to standardize on UTF-8.

    my $foo   = decode('UTF-8', get 'http://example.com/');
    my $bar   = decode('ISO-8859-1', readline STDIN);
    my $xyzzy = decode('Windows-1251', $cgi->param('foo'));

Processing happens as you knew before. The only difference is that you're now
using characters instead of bytes. That's very useful if you use things like
C<substr>, or C<length>.

It's important to realize that there are no bytes in a text string. Of course,
Perl has its internal encoding to store the string in memory, but ignore that.
If you have to do anything with the number of bytes, it's probably best to move
that part to step 3, just after you've encoded the string. Then you know
exactly how many bytes it will be in the destination string.

The syntax for encoding text strings to binary strings is as simple as decoding:

    $body = encode('UTF-8', $body);

If you needed to know the length of the string in bytes, now's the perfect time
for that. Because C<$body> is now a byte string, C<length> will report the
number of bytes, instead of the number of characters. The number of
characters is no longer known, because characters only exist in text strings.

    my $byte_count = length $body;

And if the protocol you're using supports a way of letting the recipient know
which character encoding you used, please help the receiving end by using that
feature! For example, E-mail and HTTP support MIME headers, so you can use the
C<Content-Type> header. They can also have C<Content-Length> to indicate the
number of I<bytes>, which is always a good idea to supply if the number is
known.

    "Content-Type: text/plain; charset=UTF-8",
    "Content-Length: $byte_count"

=head1 SUMMARY

Decode everything you receive, encode everything you send out. (If it's text
data.)

=head1 Q and A (or FAQ)

After reading this document, you ought to read L<perlunifaq> too, then
L<perluniintro>.

=head1 ACKNOWLEDGEMENTS

Thanks to Johan Vromans from Squirrel Consultancy. His UTF-8 rants during the
Amsterdam Perl Mongers meetings got me interested and determined to find out
how to use character encodings in Perl in ways that don't break easily.

Thanks to Gerard Goossen from TTY. His presentation "UTF-8 in the wild" (Dutch
Perl Workshop 2006) inspired me to publish my thoughts and write this tutorial.

Thanks to the people who asked about this kind of stuff in several Perl IRC
channels, and have constantly reminded me that a simpler explanation was
needed.

Thanks to the people who reviewed this document for me, before it went public.
They are: Benjamin Smith, Jan-Pieter Cornet, Johan Vromans, Lukas Mai, Nathan
Gray.

=head1 AUTHOR

Juerd Waalboer <#####@juerd.nl>

=head1 SEE ALSO

L<perlunifaq>, L<perlunicode>, L<perluniintro>, L<Encode>


Filemanager

Name Type Size Permission Actions
perl.pod File 15.89 KB 0644
perl5004delta.pod File 54.92 KB 0644
perl5005delta.pod File 33.48 KB 0644
perl5100delta.pod File 54.23 KB 0644
perl5101delta.pod File 42.86 KB 0644
perl5120delta.pod File 87.18 KB 0644
perl5121delta.pod File 9.9 KB 0644
perl5122delta.pod File 9.38 KB 0644
perl5123delta.pod File 4 KB 0644
perl5124delta.pod File 3.59 KB 0644
perl5125delta.pod File 7.5 KB 0644
perl5140delta.pod File 140.94 KB 0644
perl5141delta.pod File 7.78 KB 0644
perl5142delta.pod File 6.73 KB 0644
perl5143delta.pod File 7.58 KB 0644
perl5144delta.pod File 6.18 KB 0644
perl5160delta.pod File 130.52 KB 0644
perl5161delta.pod File 6 KB 0644
perl5162delta.pod File 3.51 KB 0644
perl5163delta.pod File 3.99 KB 0644
perl5180delta.pod File 116.63 KB 0644
perl5181delta.pod File 6.44 KB 0644
perl5182delta.pod File 5.21 KB 0644
perl5184delta.pod File 4.53 KB 0644
perl5200delta.pod File 112.99 KB 0644
perl5201delta.pod File 10.64 KB 0644
perl5202delta.pod File 12.22 KB 0644
perl5203delta.pod File 9.17 KB 0644
perl5220delta.pod File 127.89 KB 0644
perl5221delta.pod File 10.51 KB 0644
perl5222delta.pod File 12.33 KB 0644
perl5223delta.pod File 8.26 KB 0644
perl5224delta.pod File 4.36 KB 0644
perl5240delta.pod File 63.41 KB 0644
perl5241delta.pod File 8.02 KB 0644
perl5242delta.pod File 4.02 KB 0644
perl5243delta.pod File 11.16 KB 0644
perl5244delta.pod File 4.4 KB 0644
perl5260delta.pod File 99.45 KB 0644
perl5261delta.pod File 7.74 KB 0644
perl5262delta.pod File 7.7 KB 0644
perl5263delta.pod File 6.9 KB 0644
perl5280delta.pod File 70.42 KB 0644
perl561delta.pod File 121.79 KB 0644
perl56delta.pod File 104.69 KB 0644
perl581delta.pod File 37.17 KB 0644
perl582delta.pod File 4.37 KB 0644
perl583delta.pod File 6.19 KB 0644
perl584delta.pod File 7.19 KB 0644
perl585delta.pod File 5.75 KB 0644
perl586delta.pod File 4.54 KB 0644
perl587delta.pod File 8.16 KB 0644
perl588delta.pod File 24.68 KB 0644
perl589delta.pod File 52.64 KB 0644
perl58delta.pod File 112.47 KB 0644
perlaix.pod File 19.96 KB 0644
perlamiga.pod File 5.61 KB 0644
perlandroid.pod File 7.69 KB 0644
perlapi.pod File 433.14 KB 0644
perlapio.pod File 18.83 KB 0644
perlartistic.pod File 6.85 KB 0644
perlbook.pod File 8.14 KB 0644
perlboot.pod File 294 B 0644
perlbot.pod File 304 B 0644
perlbs2000.pod File 7.87 KB 0644
perlcall.pod File 55.38 KB 0644
perlce.pod File 14.26 KB 0644
perlcheat.pod File 4.38 KB 0644
perlclib.pod File 9.39 KB 0644
perlcn.pod File 4.58 KB 0644
perlcommunity.pod File 7.05 KB 0644
perlcygwin.pod File 26.56 KB 0644
perldata.pod File 45.65 KB 0644
perldbmfilter.pod File 4.86 KB 0644
perldebguts.pod File 37.63 KB 0644
perldebtut.pod File 21.63 KB 0644
perldebug.pod File 38.34 KB 0644
perldelta.pod File 6.9 KB 0644
perldeprecation.pod File 17.74 KB 0644
perldiag.pod File 277.9 KB 0644
perldos.pod File 10.28 KB 0644
perldsc.pod File 25.01 KB 0644
perldtrace.pod File 7.77 KB 0644
perlebcdic.pod File 82.26 KB 0644
perlembed.pod File 36.32 KB 0644
perlexperiment.pod File 7.03 KB 0644
perlfork.pod File 13.04 KB 0644
perlform.pod File 16.22 KB 0644
perlfreebsd.pod File 1.57 KB 0644
perlfunc.pod File 383.75 KB 0644
perlgit.pod File 32.72 KB 0644
perlgpl.pod File 13.49 KB 0644
perlguts.pod File 136.06 KB 0644
perlhack.pod File 39.5 KB 0644
perlhacktips.pod File 54.21 KB 0644
perlhacktut.pod File 6.01 KB 0644
perlhaiku.pod File 1.47 KB 0644
perlhist.pod File 52.29 KB 0644
perlhpux.pod File 29.79 KB 0644
perlhurd.pod File 1.95 KB 0644
perlintern.pod File 53.29 KB 0644
perlinterp.pod File 32.9 KB 0644
perlintro.pod File 21.6 KB 0644
perliol.pod File 33.38 KB 0644
perlipc.pod File 69.17 KB 0644
perlirix.pod File 4.29 KB 0644
perljp.pod File 7.34 KB 0644
perlko.pod File 11.97 KB 0644
perllexwarn.pod File 355 B 0644
perllinux.pod File 1.45 KB 0644
perllocale.pod File 67.07 KB 0644
perllol.pod File 9.36 KB 0644
perlmacos.pod File 1001 B 0644
perlmacosx.pod File 11.78 KB 0644
perlmod.pod File 25.63 KB 0644
perlmodinstall.pod File 12.49 KB 0644
perlmodlib.pod File 74.69 KB 0644
perlmodstyle.pod File 22.05 KB 0644
perlmroapi.pod File 3.14 KB 0644
perlnetware.pod File 6.49 KB 0644
perlnewmod.pod File 10.78 KB 0644
perlnumber.pod File 8.16 KB 0644
perlobj.pod File 34.7 KB 0644
perlootut.pod File 26.16 KB 0644
perlop.pod File 133.06 KB 0644
perlopenbsd.pod File 1.18 KB 0644
perlopentut.pod File 9.23 KB 0644
perlos2.pod File 91.16 KB 0644
perlos390.pod File 15.31 KB 0644
perlos400.pod File 4.66 KB 0644
perlpacktut.pod File 50.08 KB 0644
perlperf.pod File 48.71 KB 0644
perlplan9.pod File 5 KB 0644
perlpod.pod File 21.68 KB 0644
perlpodspec.pod File 66.87 KB 0644
perlpolicy.pod File 25.03 KB 0644
perlport.pod File 85.55 KB 0644
perlpragma.pod File 5.05 KB 0644
perlqnx.pod File 6.52 KB 0644
perlre.pod File 118.07 KB 0644
perlreapi.pod File 29.62 KB 0644
perlrebackslash.pod File 31.07 KB 0644
perlrecharclass.pod File 47.88 KB 0644
perlref.pod File 34.48 KB 0644
perlreftut.pod File 18.35 KB 0644
perlreguts.pod File 37.43 KB 0644
perlrepository.pod File 509 B 0644
perlrequick.pod File 18.06 KB 0644
perlreref.pod File 14.4 KB 0644
perlretut.pod File 118.42 KB 0644
perlriscos.pod File 1.49 KB 0644
perlrun.pod File 52.29 KB 0644
perlsec.pod File 25.57 KB 0644
perlsolaris.pod File 29.12 KB 0644
perlsource.pod File 6.71 KB 0644
perlstyle.pod File 8.43 KB 0644
perlsub.pod File 71.26 KB 0644
perlsymbian.pod File 15 KB 0644
perlsyn.pod File 43.47 KB 0644
perlsynology.pod File 7.6 KB 0644
perlthrtut.pod File 45.37 KB 0644
perltie.pod File 37.7 KB 0644
perltoc.pod File 677.89 KB 0644
perltodo.pod File 376 B 0644
perltooc.pod File 294 B 0644
perltoot.pod File 294 B 0644
perltrap.pod File 10.37 KB 0644
perltru64.pod File 8.29 KB 0644
perltw.pod File 4.37 KB 0644
perlunicode.pod File 80.56 KB 0644
perlunicook.pod File 24.89 KB 0644
perlunifaq.pod File 13.33 KB 0644
perluniintro.pod File 37.44 KB 0644
perluniprops.pod File 278.62 KB 0644
perlunitut.pod File 7.76 KB 0644
perlutil.pod File 7.46 KB 0644
perlvar.pod File 76.53 KB 0644
perlvms.pod File 49.63 KB 0644
perlvos.pod File 3.75 KB 0644
perlwin32.pod File 38.38 KB 0644
perlxs.pod File 77.07 KB 0644
perlxstut.pod File 48.92 KB 0644
perlxstypemap.pod File 23.44 KB 0644