Chinese Text - This Is Geeky But Really Really Important

Do you display Chinese on your website? If so you need to read an understand this.

Are Your Users Able To Read Chinese? Fonts?

So you're carrying content in Chinese, and users should be able to view this. For some of your users, Chinese may show up as '??' or ˹, or some other gibberish. If so this is because your users cannot correctly view the font (or encoding you use). The main tip here would be don't specify Chinese fonts as the user's browser probably doesn't have that many available, if you specify one that they don't have their browser may very well not be able to display anything other than "˹˹˹˹˹˹˹˹". Maybe Simsum looks cool, but how many of your readers have taken the time to download many different Chinese fonts?

Many sites in China, focussed on English speakers, may have their content in English but provide handy addresses/namecards in Chinese. In many cases, viewers of such websites may have a low Chinese ability, a low IT ability, and not installed even a single Chinese font. Their computers are incapable of displaying any Chinese text at all! If contact info is intended for someone that indeed may not be interested in installing Chinese fonts on their computer (and for English language 'Informational'/'Community Sites' these people may be numerous) stick all of the Chinese into a picture, so it no longer depends on fonts and is 100% readable by any computer. This is important.

Encoding Options

Do you know what UTF-8, gb2312 or BIG-5 are? In short: all text has an encoding; English typically uses ASCII which represents the English alphabet, number system and a few other things like punctuation marks and funky squiggles. ASCII cannot display Chinese. Notice this is in reference to the output of a webpage; some software can perform internal transformations changing an input encoding to a different output encoding.

  • BIG-5 is an encoding system for Traditional Chinese characters as used in Taiwan and Hong Kong. It actually encodes characters in a partial form but that is a subject in itself. Check the bottom of this post for links explaining this.
  • gb2312 is an encoding system for Simplified Chinese characters, used in mainland China and perhaps Singapore. It has actually been superceeded but remains in common use. There are other GBs, but they retain this as a subset, and GBs retain ASCII as a subset too.
  • UTF-8 stands for Unicode Transformation Concept and is a way for characters of all languages to be transmitted/encoded in the same format (if a browser only supports ACSII, it is absolutely unable to display anything other than English, but with UTF-8 it can display almost all languages)
  • UTF-16 it exists but I will not cover it other than saying it is slightly more efficient than UTF-8 for Chinese content, but less efficient for English language content. Rare to find a website in UTF-16 anywhere in the world.

There are a load of other encoding options too. So many. But these are the main ones when talking about China.

Today there's no hard-and-fast rule for one encoding and traditional vs. simplified font display, it's just how the system developed with the lack of a unifying standard. Transformation from simplified to traditional, or particularly tradittional characters can be tricky, as sometimes a single simplified character may map to multiple traditional characters.

Practicality

So... with all of these encoding options, which to choose?

The simple answer would be unicode, or UTF-8 (but MAKE SURE your users have the fonts installed, or otherwise use pictures to display Chinese words).

The more complex answer is... more complex. This site aggregates things on Chinalyst.net. Chinalyst seems not to support Chinese through its feeds. In a case like this, there really is nothing one can do, Chinese gets output as '??'. If another site quotes you and they're ASCII based they cannot display your Chinese content, no matter what the encoding; if a site with one encoding directly quotes a site with another codng, things get confusing. It's consider a mobile phone user in China - does their mobile phone/GPRS/mapping/whatever support UTF-8? It may do, but if it doesn't, you are equally screwy. If you're developing an application for the Chinese domestic market, there seems little harm developing it in GB form, precisely because of these mobile phone/PDA issues. Weighing this up against delivering content in another UTF-8 supported language (Arabic, for example), has the disadvantage f not being able to display that language correctly in GB format; a multilingual site would be administratively much simpler if using UTF-8.

Ideally, these problems would have been thought of ahead of time. They were not. It seems best to use UTF-8 for a general purpose global site that features Chinese, but if you're creating a website that focuses on the mainland China domestic market with content in Chinese, then GB may be the best choice.

'Issues'

Find it tragically funny when things go wrong? Check out John's upgrade post. The principal is... make sure any upgrade being performed is actually the thing one expects to happen. In this case MySQL was somewhat less than truthful about what it was doing (4.0xx was not fully UTF-8 enabled), and the connection between his Wordpress and MySQL installations was ironically 'lost in translation'. The principal is sound: if you want to implement any standard, make sure it's implemented not only on the output page, but at any point in the process.

Links For The Geeky
Big-5
GB 2312
Unicode and UTF-8

MySql

Hmm, not quite the whole story.

In John's case the problem was that his application was putting the data in without specifying the correct encoding.
If you put in a different character set encoding into MySQL it would accept it.

The problem was that the default encoding in 4.xx and 3.xx was latin_swedish_ci
4.1 got a little bit stricter about these things, and *badly* coded stuff got caught.

It also affected Arabic and other charsets which didn't fit into latin code sets.

The moral of that story wasn't MySQL was being less than truthful - the moral was do it right in the first place....

For what its worth you could fix it quite easily by creating a table with the correct encoding type and reimporting the data to that table (or changing the content type from latin_swedish_ci to gb2312), OR using SET NAMES gb2312...

More info here - http://doc.51windows.net/mysql/?url=/MySQL/ch10s10.html

MySQL 4 - > 5 also has a few little gotcha's like that.

Some good reading here for those going to UTF8 http://nicknettleton.com/zine/php/php-utf-8-cheatsheet

My recommendation - stick with UTF-8. Its an easier migrate path for the future, and it displays correctly on pretty much everything.
But do make sure that the text *is* UTF-8, and not GB2312 or BIG5 masquerading as...

Also, make sure you know what you are doing, or know someone who does...

(This is my second attempt at a reply. Drupal ate the first one. God I hate Drupal... Never developing anything in it ever again now I discovered MODx...)

Lawrence / http://www.computersolutions.cn / http://www.shanghaiguide.com

Re: MySQL

Thanks for the clarification on John's situation. He always makes interesting reading, from tech to fast food. I understood that MySQL 4.0xx initially advertised itself as being UTF-8 compatible, but it wasn't fully so. Damn those Swedes. It can be tough learning the ins-and-outs of running websites when one 'falls into' it. That's what this website is about - to fill in the blanks about those specialist nitty-gritty topics. While one may be an expert on networking, SEO may be an alien world, likewise for someone that knows about SEO but nothing about hosting options. I cannot say I'm an expert in any area, but I hope to document and learn.

Sorry for the email problem - what happened? The site froze, or just no email arrived? Drupal really is love it or hate it CMS, but all should be transparent for the user. I love it, not as much as I love Zope/Plone, but finding an ISP that will host that is possible but less dirt-cheap (plus some other shared hosting issues). I'm using it on this as a result of developing another site on it (and running away from Joomla as fast as I can).

Drupal ...

Take a look at www.modxcms.com

Its way better than Drupal (and I've developed a fair amount of websites in Drupal). Its a branch of Etomite.
We pretty much use it now for most of the sites we develop for clients.

I always found myself fighting with Drupal, with Modx there is a learning curve, but its got an elegant and well thought out design.

If you need Plone / Zope hosting, let us know, we can hook a box up with the relevant software. I've played around with it a little bit, but no more than a 2-3 hour look.

We currently have a tomcat / jsp box, a php, python, perl and ruby box, as well as a couple of others.
Next up will be setting up Xen (although that means a kernel upgrade, so I have to go to the datacentre and play for a little bit).

As to why I lost the post - The caching is the issue on Drupal. A page back / forth loses the text.
My laptop has separate keys for page up/down just under the shift key, and its easy to hit it, although annoying when you lose what you've spent 10 minutes writing...

Cheers,

http://www.computersolutions.cn / http://www.shanghaiguide.com

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h2>
  • Lines and paragraphs break automatically.

More information about formatting options