Non Unicode ISCII Text Can be Converted to Unicode Now!

Posted by Subhashish Panigrahi at Dec 19, 2012 06:50 AM |
Odia Wikipedian Manoj Sahukar has designed a new tool which can convert non Unicode ISCII text to Odia Unicode text. A majority of the digitized text and web content of newspapers and books are in non unicode text which now could be used for Wikipedia and other Odia Wiki projects. This opens a new arena for digitized free license books in Odia language.
Non Unicode ISCII Text Can be Converted to Unicode Now!

Screenshot of OR-TTsarala2 UNICODE CONVERTER, Non Unicode to Unicode font converter for Odia language

Akruti Sarala is a known name in Odisha. Every single DTP operator who knows typing uses this font. Until now thousands of books have been created using this font. Even today many mainstream newspapers and magazines use this font for typing. Sadly, no one ever realized the content they are creating would be useless when it comes to sharing and reusing, especially on internet. Because, internet accepts a universal standard called "Unicode" for all the languages. When a book is limited only for printing purpose then use of non Unicode font is absolutely of no problem. But, when one copies text from an e-book created using non Unicode fonts (e.g. Sarala) and pastes it elsewhere strange characters gets displayed instead of Odia characters. This is the same situation for  all other non-Latin languages.

Why this happens?

Non Unicode fonts use a funny technique. English/Roman characters are removed and Indian (or any other Non-Latin language) language characters are inserted instead of English characters. So, when you type any key from your keyboard the corresponding Indian language character displays instead of the English character. When you have a particular non Unicode font installed this technique works effectively. But, imagine when you don't have the font in your computer! By default it will show the English characters.

How Unicode fonts work?

Unicode fonts contain Indian language characters along with English characters. There is no character/glyph displacement. It is a global standard and fixed by The Unicode Consortium for all the languages. When one text is typed in an Indian language it displays the same character on Ubuntu, Windows or Mac operating system. As most of the operating systems come with Unicode fonts included there is nothing to be worried for installing them again.

How it got started?

Manoj Sahukar, a third year mechanical engineering student who is very enthusiast about the Odia science articles found Odia Wikipedia and realized that the volunteers are working so hard to write content which could actually be simplified rather than merely writing the same content. There are many science articles which he wanted to read on Wikipedia were not there. Then he realized the gap of the the non availability of Unicode content. This is also one more reason Google doesn't have a button for Odia unlike some of the other Indian languages. "Nothing could be such open and great platform like Odia Wikiepdia if one is searching content in Odia language. My tool is dedicated to the Odia Wikipedians who have been working hard for my language", expressed Manoj in the release note. He shared his interest and ideas with Odia Wikipedian Jnanaranjan Sahu and started working on building a tool which could convert the available science articles of Bigyana Diganta (a sciene magazine published in Odia language by Orissa Bigyana Academy) into Unicode. There are many articles which could be used for reference and some of the free content for WikiSource or WikiBooks. Finally he released his tool on internet on 12.12.12, the last one of the repeating dates of this century. It is still in its beta stage and Manoj is working on making it more user friendly. He is also keen on organizing technical events which will bring more individuals to create such open source tools. "My next target is developing OCR (Optical character recognition) software in Odia", says an excited Manoj.

What this tool does?

This tool could be used to convert text typed in non Unicode ISCII fonts to Odia Unicode text. The detailed procedure for using this tool and Unicode conversion is explained in a tutorial on Odia Wikipedia. The tool is released under GFDL license and is available for download on SourceForge.

Non Unicode text being copied from a PDF

Non Unicode text being copied from a PDF
Unicode font after conversion

Unicode font after conversion

Quick links:

Manoj Sahukar talks about his ideas about the usability of Odia Unicode Converter tool

Document Actions