NLTK

NLTK (Natural Language Toolkit) is a Python platform for human language data processing.

A complete list of NLTK modules can be found here: https://www.nltk.org/py-modindex.html

Open Multilingual Wordnet (OMW)

While a dictionary is used to define words, a thesaurus groups words with similar meaning. WordNet is a lexical database for the English combining dictionary and thesaurus functionalities.

The goal of Open Multilingual Wordnet is to make it easy to use wordnets in multiple languages.

To use Open Multilingual WordNet you need to download it (Corpora -> 'owm') interactively using a GUI interface of the NLTK Downloader from Python terminal:

In [1]:
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[1]:
True

Note: if you only download wordnet instead of owm, there will be no possibility to work with other languages For the following examples you also need to download and import further corpora:

  • webtext (containing files in txt format)
  • stopwords (to filter out stopwords)
In [2]:
from nltk.corpus import wordnet
from nltk.corpus import webtext
from nltk.corpus import stopwords

Corpora

Each corpus has different files containing some text. To get a list of such files of e.g. the above imported wordnet (as wn) corpus, run:

In [3]:
print(wordnet.fileids())
('cntlist.rev', 'lexnames', 'index.sense', 'index.adj', 'index.adv', 'index.noun', 'index.verb', 'data.adj', 'data.adv', 'data.noun', 'data.verb', 'adj.exc', 'adv.exc', 'noun.exc', 'verb.exc')

To get the list of words inside a corpus we use the .words() method:

In [4]:
print(webtext.words())
['Cookie', 'Manager', ':', '"', 'Don', "'", 't', ...]

Wordnet (OMW) => Synset basics

Synset are wordnet instances grouping synonymous words that express the same concept.

In [5]:
syn = wordnet.synsets('fantasma', lang='ita')
print(syn)
[Synset('ghost.n.01'), Synset('figment.n.01'), Synset('apparition.n.01')]
In [6]:
print("NAME: ",syn[0].name())
print("DEFINITION: ",syn[0].definition())
print("EXAMPLES: ",syn[0].examples())
NAME:  ghost.n.01
DEFINITION:  a mental representation of some haunting experience
EXAMPLES:  ['he looked like he had seen a ghost', 'it aroused specters from his past']
In [7]:
print(syn[0].lemmas(lang='ita'))
print(syn[0].hypernyms())
[Lemma('ghost.n.01.fantasma'), Lemma('ghost.n.01.ombra'), Lemma('ghost.n.01.spettro'), Lemma('ghost.n.01.spirito')]
[Synset('apparition.n.03')]

Nltk.collocations

Collocations are terms that can co-occur inside a sentence because of some kind of relation given by the common use of the language.

A bigram is a sequence of two adjacent elements. What we are going to achieve with the following code is to list bigrams according to their collocation probability ranking / order within a given corpus (in our case webtext). So, after stopwords filtering, bigrams that co-occur more often than others are nearer to the top.

In [8]:
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures


stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset

# Load all words of webtext corpus
words = [w.lower() for w in webtext.words()] 

# Creating a BigramCollocationFinder object for later use
# from the previously created corpus words list
bigramColloc = BigramCollocationFinder.from_words(words)

# Filtering stopwords out from the BigramCollocationFinder object
bigramColloc.apply_word_filter(filter_stops)

# Get the first 20 results of two-word-sequences in our (already filtered)
# corpus according to the likelihood of both words to be found together
# in the shown sequence
bigramColloc.nbest(BigramAssocMeasures.likelihood_ratio, 20)
Out[8]:
[('jack', 'sparrow'),
 ('teen', 'girl'),
 ('new', 'york'),
 ('teen', 'boy'),
 ('download', 'manager'),
 ('elizabeth', 'swann'),
 ('http', '://'),
 ('top', '***'),
 ('new', 'tab'),
 ('context', 'menu'),
 ('address', 'bar'),
 ('print', 'preview'),
 ('davy', 'jones'),
 ('little', 'boy'),
 ('mozilla', 'firebird'),
 ('bookmarks', 'toolbar'),
 ('little', 'girl'),
 ('location', 'bar'),
 ('flying', 'dutchman'),
 ('old', 'man')]
In [9]:
# Here, instead, bigrams are ordered according to their frequency
# in the corpus
bigramColloc.nbest(BigramAssocMeasures.raw_freq, 20)
Out[9]:
[('teen', 'girl'),
 ('jack', 'sparrow'),
 ('teen', 'boy'),
 ('new', 'tab'),
 ('download', 'manager'),
 ('new', 'york'),
 ('little', 'girl'),
 ('top', '***'),
 ('little', 'boy'),
 ('address', 'bar'),
 ('context', 'menu'),
 ('bookmarks', 'toolbar'),
 ('old', 'man'),
 ('elizabeth', 'swann'),
 ('new', 'window'),
 ('http', '://'),
 ('mozilla', 'firebird'),
 ('drunk', 'guy'),
 ('location', 'bar'),
 ('old', 'lady')]