transformersで英語から日本語への翻訳

はじめに

事前学習モデルを使って、英語から日本語への翻訳を試してみます。

モデル

ファインチューニングせずに、英語から日本語への翻訳が行えそうなモデルを HuggingFace Hub からピックアップしてみました。

marian: Helsinki-NLP/opus-mt-en-jap
mbart: facebook/mbart-large-50-one-to-many-mmt
- 英語から各言語への翻訳用
mbart: facebook/mbart-large-50-many-to-many-mmt
- 各言語から各言語への翻訳用

opus-mt-en-jap の出力結果が思わしくないので、今回はこれに加えて staka さんが公開されている marian ベースの FuguMT (https://staka.jp/wordpress/?p=413) を transformers で扱えるように変換して比較します。

今回の比較は以下の 3 種類でおこないます。

モデル名	モデル
Helsinki-NLP/opus-mt-en-jap	marian
FuguMT	marian
facebook/mbart-large-50-one-to-many-mmt	mbart

モデル内部情報	layers	hidden_size	attn heads	vocabs	params
opus-mt-en-jap	6	512	8	46,276	67,831,808
FuguMT	6	512	8	32,001	60,523,008
mbart-large-50-one-to-many-mmt	12	1024	16	250,054	610,879,488

環境

python: 3.8
CuDA: 11.1
pytorch: 1.8.1
transformers: 4.9.2

FuguMT の pytorch 形式への変換

まずは事前準備として FuguMT を pytorch 形式へ変換します。

モデルのダウンロード

blog からリンクされているモデルは取得できなかったので、 https://github.com/s-taka/fugumt からリンクされている、https://fugumt.com/FuguMT_ver.202011.1.zip を取得します。

$ mkdir /work/fugumt
$ cd /work/fugumt
$ wget https://fugumt.com/FuguMT_ver.202011.1.zip
$ unzip https://fugumt.com/FuguMT_ver.202011.1.zip

変換用スクリプト修正

transformers 4.9.2 に付属の変換スクリプト convert_marian_to_pytorch.py を使おうとしたのですが、そのままではうまくいかないため以下の修正をしました。

--- convert_marian_to_pytorch.py    2021-08-08 16:30:26.452665090 +0900
+++ convert_marian_to_pytorch.py    2021-08-09 19:07:31.006758177 +0900
@@ -380,13 +380,14 @@
     return list(model_dir.glob("*vocab.yml"))[0]


-def add_special_tokens_to_vocab(model_dir: Path) -> None:
-    vocab = load_yaml(find_vocab_file(model_dir))
-    vocab = {k: int(v) for k, v in vocab.items()}
+def add_special_tokens_to_vocab(model_dir: Path, spm_path) -> None:
+    #vocab = load_yaml(find_vocab_file(model_dir))
+    #vocab = {k: int(v) for k, v in vocab.items()}
+    vocab = load_spm(spm_path)
     num_added = add_to_vocab_(vocab, ["<pad>"])
     print(f"added {num_added} tokens to vocab")
     save_json(vocab, model_dir / "vocab.json")
-    save_tokenizer_config(model_dir)
+    #save_tokenizer_config(model_dir)


 def check_equal(marian_cfg, k1, k2):
@@ -397,7 +398,7 @@
 def check_marian_cfg_assumptions(marian_cfg):
     assumed_settings = {
         "tied-embeddings-all": True,
-        "layer-normalization": False,
+        "layer-normalization": True,
         "right-left": False,
         "transformer-ffn-depth": 2,
         "transformer-aan-depth": 2,
@@ -453,7 +454,10 @@

 class OpusState:
     def __init__(self, source_dir):
-        npz_path = find_model_file(source_dir)
+        decoder_yml = cast_marian_config(load_yaml(source_dir / "decoder.yml"))
+        #npz_path = find_model_file(source_dir)
+        npz_path = decoder_yml["models"][0]
+        self.vocab_file = decoder_yml["vocabs"]
         self.state_dict = np.load(npz_path)
         cfg = load_config_from_state_dict(self.state_dict)
         assert cfg["dim-vocabs"][0] == cfg["dim-vocabs"][1]
@@ -474,7 +478,7 @@
         ), f"Hidden size {hidden_size} and configured size {cfg['dim_emb']} mismatched or not 512"

         # Process decoder.yml
-        decoder_yml = cast_marian_config(load_yaml(source_dir / "decoder.yml"))
+        #decoder_yml = cast_marian_config(load_yaml(source_dir / "decoder.yml"))
         check_marian_cfg_assumptions(cfg)
         self.hf_config = MarianConfig(
             vocab_size=cfg["vocab_size"],
@@ -583,11 +587,14 @@
     dest_dir = Path(dest_dir)
     dest_dir.mkdir(exist_ok=True)

-    add_special_tokens_to_vocab(source_dir)
-    tokenizer = MarianTokenizer.from_pretrained(str(source_dir))
-    tokenizer.save_pretrained(dest_dir)
-
     opus_state = OpusState(source_dir)
+
+    add_special_tokens_to_vocab(source_dir, opus_state.vocab_file[1])
+    #tokenizer = MarianTokenizer.from_pretrained(str(source_dir), source_spm=opus_state.vocab_file[0], target_spm=opus_state.vocab_file[1])
+    tokenizer = MarianTokenizer(source_dir / "vocab.json", source_spm=opus_state.vocab_file[0], target_spm=opus_state.vocab_file[1])
+    tokenizer.save_pretrained(dest_dir)
+    save_tokenizer_config(dest_dir)
+
     assert opus_state.cfg["vocab_size"] == len(
         tokenizer.encoder
     ), f"Original vocab size {opus_state.cfg['vocab_size']} and new vocab size {len(tokenizer.encoder)} mismatched"
@@ -606,6 +613,12 @@
     with open(path) as f:
         return yaml.load(f, Loader=yaml.BaseLoader)

+def load_spm(path):
+    import sentencepiece as spm
+    sp = spm.SentencePieceProcessor(model_file='model/vocab.enja.spm')
+
+    vocab = {sp.IdToPiece(i): i for i in range(sp.GetPieceSize())}
+    return vocab

 def save_json(content: Union[Dict, List], path: str) -> None:
     with open(path, "w") as f:

修正済のコードはgistに貼り付けました。

変換の実行

$ cd /work/fugumt
$ ln -s /work/fugumt/model/model.npz.decoder.yml /work/fugumt/model/decoder.yml
$ python convert_marian_to_pytorch.py --src model --dest en-jap

翻訳実行

以下のようにして翻訳を試します。

from transformers import pipeline

opus_translator = pipeline('translation', model='Helsinki-NLP/opus-mt-en-jap')
fugu_translator = pipeline('translation', model='/work/fugumt/en-jap/')
mbart_translator = pipeline('translation',
                            model='facebook/mbart-large-50-one-to-many-mmt',
                            src_lang='en_XX', tgt_lang='ja_XX')

opus_translator('This is an apple.')
# => [{"translation_text": "これ は ひとみ で あ る ."}]

fugu_translator('This is an apple.')
# => [{"translation_text": "これはリンゴです。"}]

mbart_translator('This is an apple.')
# => [{"translation_text": "これはリンゴです"}]

marian(opus)に関しては、結果にスペースが含まれてしまっています。以下のサンプルではスペースを削除した結果を表示しています。

翻訳結果サンプル

サンプル 1

(transformers marian model の説明より)

Since Marian models are smaller than many other translation models
available in the library, they can be useful for fine-tuning
experiments and integration tests.

翻訳結果

# marian(opus)
後者は,わたしたちにとって信頼しているのではない.物を警戒したり,警戒す
ることができないので,それは豊かであって,恐怖と試錬とで日を過ごすことが
できる.

# marian(fugu)
Marianモデルはライブラリで利用可能な他の多くの翻訳モデルよりも小さいた
め、微調整実験や統合テストに有用である。

# mbart
Marianモデルは、ライブラリで利用可能な他の多くの翻訳モデルよりも小さい
ため、微調整実験や統合テストに役立ちます。

marian(fugu)と mbart は比較的よい感じで翻訳してくれているようです。

サンプル 2

(transformers tokenizer の説明より)

A tokenizer is in charge of preparing the inputs for a model. The
library contains tokenizers for all the models. Most of the tokenizers
are available in two flavors: a full python implementation and a
"Fast" implementation based on the Rust library tokenizers.

翻訳結果

# marian(opus)
あかしをする者の証拠は,そこに起るべきものである.その証拠はこれをしるし
ている.また,告訴の者たちのために,その二つのしるしを尋ねている.これは起っ
た者の証明と,心の荒い者たちであって,あかしをするものである.

# marian(fugu)
トークンライザは、モデルの入力を準備する。ライブラリには、すべてのモデ
ルのトークンライザが含まれている。ほとんどのトークンライザは、完全な
python実装と、Rustライブラリのトークンライザに基づいた"Fast"実装の2つ
のフレーバーで利用できる。

# mbart
ライブラリにはすべてのモデル用のトークナイザーが含まれています。ほとん
どのトークナイザーは2つのフレーバーで利用可能です:完全なパイソン実装と
Rustライブラリのトークナイザーに基づいた"Fast"実装。

marian(fugu)は比較的良い感じで翻訳してくれています。mbart もそこそこですが、1 文目が消えてしまってます。

翻訳結果(参考)

google 翻訳や deepL でも翻訳してみました。

# google翻訳
トークナイザーは、モデルの入力の準備を担当します。ライブラリには、すべ
てのモデルのトークナイザーが含まれています。ほとんどのトークナイザーは、
完全なpython実装とRustライブラリトークナイザーに基づく「高速」実装の2
つのフレーバーで利用できます。

# deepL
トークナイザーは、モデルの入力を準備する役割を果たします。ライブラリに
は ライブラリーには、すべてのモデルのトークナイザーが含まれています。
トークナイザーのほとんどは は2種類あります：完全なpython実装と Rustラ
イブラリのトークナイザーをベースにした "Fast "実装です。

おわりに

marin(fugu)や mbart での英語から日本語への翻訳は google 翻訳や deepL には及びませんがそこそこ健闘している印象です。