『入門自然言語処理』1章言語処理とPython

作者: Steven Bird,Ewan Klein,Edward Loper,萩原正人,中山敬広,水野貴明
出版社/メーカー: オライリージャパン
発売日: 2010/11/11
メディア: 大型本
購入: 20人クリック: 639回
この商品を含むブログ (44件) を見る

自然言語処理に興味を持ったので、ちょっと試してみる。

1.1 言語の計算処理：テキストと単語

Numpy, PyYaml, NLTKをインストールする。

import nltk
nltk.download()

ダイアログが現れるので、bookをダウンロード。

from nltk.book import *

concordance()

text1.concordance("monstrous")

text1に入っている小説「Moby Dick」からmonstousを含む箇所を抜き出す。

similar(), common_contexts()

text1.similar("monstrous")

似た単語を抜き出す。

text2.common_contexts(["monstrous", "very"])

共通で使われている単語を調べる。例えばこの例だとis, pretty(is very pretty, is monstrous pretty)など。

dispersion_plot()

text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

出現の分布を可視化する。text4は220年に渡るアメリカ大統領の就任演説。freedom, Americaが最近になってから多用されていることがわかる。

generate()

text4.generate()

データの統計的な特徴を利用して適当な文章を作成。N-Gramで作成しているらしい。

Building ngram index...
Fellow - citizens , much to do good . We are torn by division ,
wanting unity . We have no lawful right to equal rights and an orderly
society . We can strike at war . We must face a condition of our
country ; to carry them into those of high trust , in which the new
materials of social intercourse , either Republicans or Democrats ,
but we ought to accomplish all the means by which to rely on your
counsel , and which , while the thought of promoting any alteration in
it , if any

語彙を数える

set()を使って集合化することにより重複を省く。語彙の豊富さは、len(<text>)/len(set(<text>))で比較できる。

100 * text5.count("lol") / len(text5)
# 1.5640968673628082

text5（チャットのログ）に含まれる「lol」の割合は1.56%。

1.2 Pythonをより詳しく：テキストと単語のリスト

Pythonの基礎（主にリストの操作）。読み流す。

1.3 言語の計算処理：簡単な統計処理

頻度分布：FreqDist

fdist1 = FreqDist(text1)
fdist1
# <FreqDist with 19317 samples and 260819 outcomes> # 260819 outcomesは単語の総数
vocabulary1 = fdist1.keys() # 異なり語のリスト
vocabulary1[:50]
# [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
fdist1['whale']
# 906

text2は次のようになる：

fdist2 = FreqDist(text2)
vocabulary2 = fdist2.keys()
vocabulary2[:50]
# [',', 'to', '.', 'the', 'of', 'and', 'her', 'a', 'I', 'in', 'was', 'it', '"', ';', 'she', 'be', 'that', 'for', 'not', 'as', 'you', 'with', 'had', 'his', 'he', "'", 'have', 'at', 'by', 'is', '."', 's', 'Elinor', 'on', 'all', 'him', 'so', 'but', 'which', 'could', 'Marianne', 'my', 'Mrs', 'from', 'would', 'very', 'no', 'their', 'them', '--']

fdist1.plot(50, cumulative=True)で、上位50の単語で総単語数の何割が占められているかを可視化できる。

ほかの単語選択方法

V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)
# ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']

長い単語だけ抜き出す。ちなみにtext5でこれをやると楽しい感じになる：

['!!!!!!!!!!!!!!!!', ... , 'BAAAAALLLLLLLLIIIIIIINNNNNNNNNNN', 'Bloooooooooooood', 'HHEEYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY', ... , 'woooooooooaaaahhhhhhhhhhhh',... , 'yuuuuuuuuuuuummmmmmmmmmmm']

7回以上出現した単語に限ると：

fdist5 = FreqDist(text5)
sorted([w for w in fdist5 if len(w) > 7 and fdist5[w] > 7])
# ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching']

コロケーション

バイグラム(bigram)
- 単語のペア
コロケーション(collocation)
- 頻出のバイグラム

bigrams(['more', 'is', 'said', 'than', 'done'])
# [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
text4.collocations()
# Building collocations list
# United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties

単語の長さの分布

扱う集合を[len(w) for w in text1]のようにする。FreqDist.items()でカウントと値のペアを表示する。頻度分布に関する関数はほかにもいろいろ定義されている。

1.4 再びPython：決定を下し処理を制御する

条件分岐や繰り返し。読み飛ばす。

1.5 自動自然言語理解

語義曖昧性解消
- 多義語の意味を決定する
代名詞解析
- 代名詞が指す先行詞を決定する（照応解析）
- 代名詞が動詞とどう関わっているかを決定する（意味役割付与）
言語生成
- 応答質問や機械翻訳

機械翻訳

babelize_shell()で機械翻訳の実験が出来る。ただし文字コードの設定が必要な模様。
- 名前が面白い
テキストアラインメント
- 複数の言語で書かれた同じ内容の文書から、文章のペアをつくる

含意関係

含意関係認識(RTE:Recognizing Textual Entailment)
- 文章の中に、焦点を当てたい要素が含まれていて、かつ、知りたい内容について分析する、という課題

参考文献リスト

一般的な情報：

Indurkhya, Nitin and Fred Damerau (eds., 2010) Handbook of Natural Language Processing (second edition), Chapman & Hall/CRC.
Jurafsky, Daniel and James Martin (2008) Speech and Language Processing (second edition), Prentice Hall.
Mitkov, Ruslan (ed., 2002) The Oxford Handbook of Computational Linguistics. Oxford University Press. (second edition expected in 2010). The Association for Computational Linguistics is the international organization

言語学の入門書：

Edward Finegan. Language: Its Structure and Use. Wadsworth, Fifth edition, 2007.
William O’Grady, John Archibald, Mark Aronoff, and Janie Rees-Miller. Contemporary Linguistics: An Introduction. St. Martin’s Press, fifth edition, 2004.
OSU, editor. Language Files: Materials for an Introduction to Language and Linguistics. Ohio State University Press, tenth edition, 2007.

rn102.hatenablog

入門自然言語処理(O'Reilly) 1章

『入門自然言語処理』1章言語処理とPython

1.1 言語の計算処理：テキストと単語

concordance()

similar(), common_contexts()

dispersion_plot()

generate()

語彙を数える

1.2 Pythonをより詳しく：テキストと単語のリスト

1.3 言語の計算処理：簡単な統計処理

頻度分布：FreqDist

ほかの単語選択方法

コロケーション

単語の長さの分布

1.4 再びPython：決定を下し処理を制御する

1.5 自動自然言語理解

機械翻訳

含意関係

参考文献リスト

『入門 自然言語処理』1章 言語処理とPython

1.1 言語の計算処理：テキストと単語

concordance()

similar(), common_contexts()

dispersion_plot()

generate()

語彙を数える

1.2 Pythonをより詳しく：テキストと単語のリスト

1.3 言語の計算処理：簡単な統計処理

頻度分布：FreqDist

ほかの単語選択方法

コロケーション

単語の長さの分布

1.4 再びPython：決定を下し処理を制御する

1.5 自動自然言語理解

機械翻訳

含意関係

参考文献リスト

『入門自然言語処理』1章言語処理とPython