See Also: Home Links Personal Site Blogroll  FriendFeed CV

Tags:

Topic Image

Perl And Unicode

Tim Bray: On the Goodness of Unicode http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode

Tim Bray: Characters vs. Bytes http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

SunDevelopersNew Page article on Unicode support in Perl 5.6.1 http://developers.sun.com/dev/gadc/unicode/perl/perl561.html

Simon Cozens paper on Perl and Unicode http://www.netthink.co.uk/downloads/unicode.pdf

Perl, Unicode and i18N FAQ http://rf.net/~james/perli18n.html

Unicode organisation http://www.unicode.org/unicode/faq/

Unicode in XML and other Markup Languages http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/

MySQLNew Page Unicode Support http://www.mysql.com/doc/en/Charset-Unicode.html

Frank Tang's Iñtërnâtiônàlizætiøn Secrets http://people.netscape.com/ftang/i18n.html#detect

Character entity references in HTML 4 http://www.w3.org/TR/REC-html40/sgml/entities.html

official names for character sets http://www.iana.org/assignments/character-sets

UTF-16/UCS-2 http://en.wikipedia.org/wiki/UTF-16

UTF-8 and Unicode FAQ for Unix/Linux http://www.cl.cam.ac.uk/~mgk25/unicode.html

Related Perldocs... perlunicode, perllocale, utf8, perlre


Browser Support:
(from http://rf.net/~james/perli18n.html)

Q26. How can I i18N my web pages and CGI programs?

To make your HTML render correctly, avoid font face tags in CJKV pages that users may not have.

When generating static pages, I would emit the correct META tag to help browsers of foreign readers default to the correct character encoding. The correct meta tag is helpful even when the server could send the correct header, because some browser versions do not understand non ISO-8859-1 charset set header names

Forms may be constructed with a hidden field with charset input and locale (language and country) fields. IE5 and later will fill a field called charset.

The recommended character set encoding for Japanese web pages is Shift-JIS.

Q28. Can web servers automatically detect the language of the browser and display the correct localized page?

A28. Yes. HTTP/1.1 defines the details of how content negotiation works, including language content. WWW browsers send an Accept-Language request header specifying which languages are preferred for responses. This technique works fairly well, although some versions of Netscape Navigator send an improperly formatted request parameter. Also, switching language preferences in either Navigator or IE 4 doesn't always "take" without first deleting a language hint.

(from http://rf.net/~james/br_detect.txt )

Here are some general guidelines you might like to consider :-

o Have the UI layer pass a tag identifying the character encoding unless the UI layer maps the data to one of the Unicode representations (UTF-8, UTF-16) before passing the data on.

o Have the UI layer pass a tag identifying the locale (language+region). You'll need this if your back end does any locale sensitive operations such as sorting and is independent of the encoding issue.

o Have all the pages generated include a META-CHARSET tag in the HTML Header. This will insure that the browser(s) submit form post data in the same encoding as the original html page. May be the source of your original problem.

Jim

you could also examine the Accept-Language HTTP > header, the highest priority value in which you can take and map to a > likely charset by relying on your environment's Locale resource bundles

Some applications just outright put a select box in the form and rely on the > user to pick the language they're using.

> I think the only way to do it right is to come up with some fixed strings in > various language scripts that you can pass as hidden parameters, examine the > bytes that come through, and look them up in a custom mapping table that > will deduce the charset based on the byte sequences received.


See Also: Perl | Web Development | Notes Index