Thursday, June 17, 2010

Character Encoding and Commerce

Character Encoding 101:

The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters. This conversion fits naturally with the scheme of Web activity: servers send HTML documents to user agents as a stream of bytes; user agents interpret them as a sequence of characters. The conversion method can range from simple one-to-one correspondence to complex switching schemes or algorithms...

Reference: Section 5.2 Character encodings of the HTML Document Representation W3C Recommendations

Character encoding tells the browser and validator what set of characters to use when converting the bits to characters.

Types: what does this mean, there are different standards of character set encoding
You see various encoding types

Traditional ASCII is 7 bits and has limited character representation.
encoding="ISO-8859-1"
encoding="windows-1252
encoding="utf-8"

Even though UTF-8 is one of the most popular some of the old browsers might not support hence we used to use ISO-8859-1 encoding in some pages.

windows-1252 :
Windows-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages
ISO-8859-1 :
UTF-8,UTF-16,UTF-32 :
Unicode is an international character set standard which supports all of the major scripts of the world, as well as common technical symbols.
Unicode transformation unit and each character is store in 21-bits so depending on each of these UTF types it is stored as a multiple of 8 bytes
e.g. UTF-8 , it could be saved in upto 4 bytes.


Developers:

Java:
In the Java programming language char values represent Unicode characters. Unicode is a 16-bit character encoding that supports the world's major language
UTF-16 (Java).

In WCS:

The language table has a ENCODING column that specifies the encoding for language
e.g.

select encoding from language where language_id=-1;

From end in JSP:
Character encoding

We set the character encoding to UTF-8 as this is what is present in the DB for language encoding.

<%
response.setContentType("text/xml");
response.setCharacterEncoding("UTF-8");
response.setHeader("Cache-Control","max-age=0");
%>


Form Submit: In commerce what ever is specified in front end should match with what ever is specified in the encoding attribute of the language table.

select encoding from language where language_id= language passed in the command context.

Database:
e.g. If you need the character encoding to windows 1252, you can do that in the session while executing SQL statements.

set NLS_LANG=AMERICAN_AMERICA.WE8MSWIN1252

Database Client:

I use today and for the client to show the correct characterset, we have modified.

HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\Key_OraClient...
Change “NLS_LANG” to AMERICAN_AMERICA.UTF8




Good reference URLs:
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp
http://java.sun.com/docs/books/tutorial/i18n/text/convertintro.html
http://tlt.its.psu.edu/suggestions/international/web/codehtml.html

No comments:

Post a Comment