Extracting English-to-Japanese Translations from Wiktionary

Wiktionary is a wiki-based collaborative dictionary that is freely available under CC BY-SA 3.0 license. As of this writing, the English version contains more than 5 million entries in over 3,800 languages (and steadily increasing). We are also seeing an increasing popularity of Wiktionary in natural language processing research, demonstrated by more than 13,500 search results mentioning “Wiktionary” on Google Scholar.

In this post, I’d like to demonstrate how to extract translation data from Wiktionary. Specifically, we’d like to extract English-to-Japanese translations, where English words are explained in Japanese. Unlike CC-CEDICT or JEDICT, there is no good public-domain resources for this language direction.

Below is the step to step guide how I extracted translation data:

Formatting Wiktionary Dump

First, we need to obtain the Wiktionary dump. All the recent dumps of the Wikimedia projects (including Wikipedia and Wiktionary) can be obtained from this Wikimedia Downloads page. Find jawiktionary there, and download the dump file (“All pages, current versions only”).

Wikimedia dumps (including ones for Wiktionary) are stored in an XML format. While you could write your own parser to extract article body and format it into plain text (which I’ve done several times in the past myself), parsing XML and MediaWiki notation and correctly formatting it into text can be a lot of work. For example, here’s an excerpt from a Wiktionary entry for English word “hello”:

=={{en}}==

=== 発音 ===
*{{ipa|həˈləʊ, hɛˈləʊ}}

=== 間投詞 ===
# [[こんにちは]]
# [[もしもし]]

=== 異綴 ===
*[[hallo]]
*[[halloa]]
*[[hullo]]
*[[hulloa]]

==== 関連語 ====
*[[hi]]
=== {{trans}} ===

There are tons of formatting and fluffs here, including related words and spelling variations, but all you want is probably something like this:

hello, intj. こんにちは

Fortunately, there are a lot of tools and libraries that are specifically designed to deal with Wikimedia dumps. In this post, we use WikiExtractor, which seems to be one of the most popular options out there. I found a small issue with the script when it expands templates, although after fixing it the script is working great. I invoked it with --sections and --html options so that it preserves as much structural information as possible for further processing.

python WikiExtractor.py path/to/wiktionary-dump.xml.bz2 --sections --html --template 

Extracting English-to-Japanese translations

At this point the Wiktionary dump is converted to a format that is relatively easy to process and extract translations from. I wrote a simple Python script that basically scans these HTML-like files generated by wikiextractor line by line and extract parts of the article body that most likely contain Japanese definitions for English words. In article bodies, <h2> tags basically correspond to languages (since one article may contain multiple definitions for multiple languages), and <h3> tags basically correspond to categories like “nouns,” “spelling variations,” “pronunciation,” and so on.

The final result of this script is a file where each line is one JSON strong for an entry that looks like this:

{
  "title": "zoo",
  "english": {"名詞": "zoological garden を短略した語。\n英語.\n
                      発音.\nIPA: /zuː/\n\n
                      名詞.\n\n\n動物園。\n\n
                      関連語\n\nzoo-\nzoologist\nzoology\nzoological garden\n\n
                      訳語\n\nフィンランド語: eläintarha\n
                      スウェーデン語: Zoo, djurpark"}}
{
	"title": "hello",
	"english": {"間投詞": "こんにちは\nもしもし"}}
...

Not perfect, but very close to what I originally wanted!


© Masato Hagiwara. All rights reserved.

Powered by Hydejack v7.5.1