The search engine now supports the latest Unicode 5.2

Jan 29, 2010 09:55 GMT  ·  By

Character encoding on the web is probably not the most exciting topic out there but, whether you care about it or not, it is making a big difference on the way most people use the web. Historically, web pages have used, and still do, a variety of character encodings. This works well for English and most Latin alphabet languages, but most of the world doesn't use the Latin alphabet. This is where Unicode comes in, the encoding standard aims to support every character in any script out there and has done a pretty good job at it so far. Unicode is being used by almost 50 percent of web pages on Earth, according to Google which is also showcasing what it's doing to extend support for the standard.

Based on the data it has gathered from the web, meaning all the pages it indexes, Google says that Unicode is used on about 46 percent of all web pages, far more than any other encoding standard. This is a great uptick from just a year and a half ago when Unicode had just become the most used standard passing 25 percent market share.

Having web pages written in any language in the world is great, but it's not that useful if a search engine can't 'read' them. Google is working on this front though and has recently updated with support for the latest version, Unicode 5.2. The latest Unicode revision comes with over 6,600 new characters, bringing the total supported to a little over 107,000, among them being Egyptian Hieroglyphs! How cool is that?! If you're a geek, anyway.

On the more practical side, Google explains what it's doing to extend its support for the standard. "[T]he characters "fi" can either be represented as two characters ("f" and "i"), or a special display form "fi". A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this," Mark Davis, senior international software architect at Google writes. "[A]fter extensive testing, we just recently turned on support for these and thousands of other characters; your searches will now also find these documents."

Photo Gallery (2 Images)

Google now supports the latest Unicode 5.2
Unicode support on the web is getting close 50 percent
Open gallery