How the Odia Wikimedia community is enriching Wikipedia with character encoding technology
A user inputting text in Odia language in his computer during a Odia Wikipedia workshop in Bhubaneswar, Odisha (by Subhashish Panigrahi, CC_BY_4.0)
The original blog post published on Wikimedia Blog is here.
Character encoding is used to represent a collection of characters through an encoding system, and is used in computation, data storage, and transmission of textual data. Fonts in different scripts used to have several different encoding systems before the onset of Unicode.
However, most media outlets—as well as the state government—are still using old encoding systems for Odia. These require the installation of a particular font using the same encoding system to read documents. Unicode makes this much easier, as most modern computers come with Unicode fonts preinstalled.
A character encoding converter is generally used to convert from one encoding system to another. Massive amounts of content, not archived on a regular basis, could now be converted to Unicode and, in turn, provide Wikipedia editors with easily accessible sources to create new articles and enhance existing ones.
The Odia language is spoken by over 40 million people in eastern India, accross various Indian cities, and by expatriates abroad. It is one of the oldest languages in South Asia, and is recognized as a “classical language” by the Indian government. The Odia Wikipedia celebrates its thirteenth anniversary today, June 3.
However, the “classical language” status has not yet boosted knowledge production or use of the language on the Internet. Almost all online newspapers and state publications, such as Odia-language journals, public announcements, and information portals, host their content in various legacy character encodings that do not allow users to easily access and share information. This has, unsurprisingly, proven to be a major hurdle for the small Odia Wikimedia community, who hope to enrich their project with Odia-language citations.
To help solve these problems, the community tried using two encoding converters. These were previously developed by friends from Srujanika, a non-profit based in Bhubaneswar, Odisha, that works on promoting science education in school curricula in the Odia language, as well as the digitization of early Odia literature. These converters became the building blocks on top of which Wikimedian Manoj Sahukar built converters after massively rewriting their code. I was also part of the re-building process, from the initial development of the converters to the design of their interface, and I helped to design handouts teaching new users how to use them.
The community played a major role in promoting the converters on social media. An op-ed in the Odia newspaper Samaja helped to reach out to more people unaware of the uses for Unicode. Many Internet users did not realize that they had been sharing knowledge on their blogs or social media using various legacy encodings, which neither appear in search engines nor allow anyone to share them in an accessible way.
By converting news and articles from newspapers and magazines as test cases, the converters were improved over time. Citing Odia-language sources wasn’t so easy before: making use of any content from a local newspaper could take hours.
From September 2014 to March 2015, a small project ran to convert text from several newspapers and magazines, so that they could be used as citations in articles in the Odia Wikipedia — this is important because, when these sources are not available in Unicode, search engines and Wikipedia users can have difficulty finding them.
Because the converters were hosted separately on Google Drive, it was difficult to have them all in one central place. Odia Wikipedians wanted their Wikipedia to host a single converter, where a user could select the appropriate input encoding. Wikimedian and developer Jnanaranjan Sahu came up with a responsive, wiki-based converter that went live on May 12 and is now available for use. The converter now enables the choice of source encoding from a drop-down menu, and converts the input into Unicode. Issues with this conversion process can be reported via Google Spreadsheet.
Combining five different converters into one, Jnanaranjan says, was a necessary next step in development: “When I found that there are different URLs for different converters, and that the URLs lead to a bunch of different sites, it seemed quite messed up. It would have been difficult for users to locate each of the converters. I thought it would be easier for users if they could find all the encoding converters for Odia on one page on their home wiki. So, I tried to tweak the source code and design this converter.”
He also explains that several newspapers whose news is encoded in older systems are now rich information sources. "Converting them and using the information to add more citations to Wikipedia could help to achieve the dream of every single person being able to contribute more information to the Odia Wikipedia," he says, "so all human knowledge may be available in our language."