Extracting key terms (concepts) from a textbook

9 min readMay 1, 2019

Note: The code for the complete analysis below can be found on my github page here.

This is the second post in a series describing my experiments in knowledge extraction from textbooks. I was excited to see how many out-of-box tools already existed to process text, and the many online resources to learn how to use them. A very useful resource for starting out was Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper.

My first raw text sources were from electronic textbooks I had purchased and converted to text format. I had been meaning to get to studying them but was either daunted by the steep learning curve or bored by the repetition of concepts I already understood in the first few chapters. To get a sense of structure, I decided to extract the key terms and sort them by some metrics.

Preliminary observations

After loading the book Cognitive Load Theory by Sweller, Ayres and Kalyuga in text format as a string, and removing page numbers, contents and index sections, I looked at the frequency distributions of words in the text.

# Calculate frequency distribution of words in text
from nltk import Text
from nltk import word_tokenize
text1 = Text(word_tokenize(rawtxt))
fdist = nltk.FreqDist(rawtxt1)
fdist.most_common(10)
--------------------------------------------------------------------
[('the', 4912),
 (',', 4815),
 ('.', 4654),
 ('of', 3330),
 ('to', 3020),
 ('and', 2347),
 ('a', 2242),
 ('in', 1956),
 ('that', 1583),
 ('be', 1289)]

This is not very useful. Most of these terms are too common in English to be reveal anything about the content of the book. NLTK has a list of such ‘stopwords’ stored as a corpus. After removing words such as ‘a’, ‘and’, ‘to’ and non alpha numeric characters, like commas and periods, we get the following:

# remove stop words and non alpha numeric characters; build
# frequency distribution
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english'))
text1 = [word for word in text1 if (word.lower() not in stop_words) & (word.isalnum())]
fdist = nltk.FreqDist(text1)
fdist.most_common(10)
--------------------------------------------------------------------
[('information', 1004),
 ('cognitive', 1002),
 ('load', 995),
 ('memory', 673),
 ('learning', 644),
 ('knowledge', 598),
 ('effect', 591),
 ('learners', 529),
 ('working', 457),
 ('may', 435)]

These words give us much more information. Now we know the book has something to do with information, cognitive load, memory and learning.

Next, I looked at the collocations of the text. Collocations are sequences of words that occur together with relatively higher frequency than individually in the text. The NLTK collocations() function returns bigrams by default.

# Display 20 most common collocations
from nltk import Text
from nltk import word_tokenize
text1 = Text(word_tokenize(rawtxt))
text1.collocations(20) # displays the top 20 bigrams where the words occur together relatively frequently.
--------------------------------------------------------------------
cognitive load; working memory; element interactivity; worked
examples; long-term memory; intrinsic cognitive; extraneous cognitive;
interacting elements; load theory; worked example; van Merriënboer;
problem solving; expertise reversal; human cognition; modality effect;
biologically primary; environmental organising; reversal effect;
primary knowledge; linking principle

Already, we see some terms that represent key concepts in the text. But we see some examples of incomplete terms such as ‘biologically primary’, ‘environmental organising’. While promising, this method misses out on longer n-gram concepts. To extract all concepts, whatever the frequency, I started exploring noun phrase chunking. This required a few pre-processing steps, including sentence splitting, word tokenization, and part of speech tagging.

Sentence splitting

The first step was to split the book into sentences. One heuristic way to do this would be to split on periods. But periods are used in abbreviations and ellipses ‘…’. NLTK has a built-in pretrained sentence tokenizer function, however, it split on abbreviations such as ‘Dr.’, ‘et al.’ in the middle of sentences. A snippet from NLP for hackers helped. The NLTK Punkt trainer has an unsupervised trainer that identifies common abbreviations by processing a text of similar domain. It then ignores the full-stops following those abbreviations. It also uses other heuristics to ascertain if the word after the period is a sentence starter, and the word before the period is an initial. Here are the results of processing the textbook Cognitive Load Theory.

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True # include collocations which have # a period between them
trainer.train(rawtxt) # rawtxt is the string read from the text file
print(tokenizer._params.abbrev_types)
--------------------------------------------------------------------
{'fe', 'e.g', 'al', 'g01', 'fig', 'p', 'mrs', 'i.e', 'f', 'etc', 'h2o'}

Several abbreviations have been captured: the third one in the above set, ‘al’ is from ‘et al.’ , a very common abbreviation in scientific literature, short for Latin phraseet alia, which means “and others.”

If some abbreviations were missed out, we can add them explicitly, as below:

tokenizer._params.abbrev_types.add('dr')

To test the trainer out, we can feed it a sample blurb I made up.

blurb = tokenizer.tokenize('This method is derived from the one described by Linvingstone et al., but a significant difference is the attention to subjectivity. The second method is however followed exactly from the one described by David et al. Both methods were successful.')
print(blurb)
--------------------------------------------------------------------
['This method is derived from the one described by Linvingstone et al., but a significant difference is the attention to subjectivity.', 'The second method is however followed exactly from the one described by David et al.', 'Both methods were successful.']

The above blurb was split correctly, whether the abbreviation occurred at the end of a sentence or in the middle.

Word tokenization/ part of speech tagging

Once the raw text was split into sentences, they were further split into tokens. Tokens are the building blocks of text data, and can represent a word, number or any atomic unit found in text.

The built-in NLTK part-of-speech tagger was used to annotate each token with a part-of-speech tag from the UPenn treebank list.

# pos tagger
print('Sentence:',sentence)
from nltk import word_tokenize, pos_tag
tokens = word_tokenize(sentence)
print(tokens)
tags = pos_tag(tokens)
print(tags)
--------------------------------------------------------------------
Sentence: Instruction needs to consider the limitations of working memory so that information can be stored effectively in long-term memory.['Instruction', 'needs', 'to', 'consider', 'the', 'limitations', 'of', 'working', 'memory', 'so', 'that', 'information', 'can', 'be', 'stored', 'effectively', 'in', 'long-term', 'memory', '.'][('Instruction', 'NN'), ('needs', 'VBZ'), ('to', 'TO'), ('consider', 'VB'), ('the', 'DT'), ('limitations', 'NNS'), ('of', 'IN'), ('working', 'VBG'), ('memory', 'NN'), ('so', 'RB'), ('that', 'IN'), ('information', 'NN'), ('can', 'MD'), ('be', 'VB'), ('stored', 'VBN'), ('effectively', 'RB'), ('in', 'IN'), ('long-term', 'JJ'), ('memory', 'NN'), ('.', '.')]

Part of speech tagging enables dividing a sentence into subjects, objects and the links between them. The subjects and objects are found among the nouns or noun-phrases (sequence of words distinguishing the nouns from others) and other parts of speech indicate the relationship between them.

Noun phrase extraction

I looked at Textblob, a python library that performs NLP tasks. Textblobs can be manipulated just like strings in Python.

from textblob import TextBlob
blob = TextBlob(sentence)
print('Sentence: ',sentence)
print(blob.noun_phrases)
--------------------------------------------------------------------
Sentence: Instruction needs to consider the limitations of working memory so that information can be stored effectively in long-term memory.
['instruction', 'long-term memory']

The terms instruction and long-term memory were pulled out, but what about ‘working memory’ and ‘information’? While searching for more granular control over the phrases extracted, I came across the noun phrase regular expression chunker.

Chunking

Chunks are multi-token sequences that divide a sentence syntactically. The regular expression chunker uses a grammar rule defined similar to regular expressions for strings in Python, but based on part of speech tags. The chunker splits a tagged sentence based on the regex patterns into a tree. Below is an example:

from nltk.chunk.util import *
from nltk.chunk.regexp import *# create grammar rule dictionary: single word noun phrases
chunkrules = {}
chunkrules['NP1'] = r"""    
    NP1: {<NN.?>}       
"""# create regular expression parser
cp = RegexpParser(chunkrules['NP1']) # create parsed tree of sentence
J = cp.parse(tags) print(sentence)
print(J)--------------------------------------------------------------------
'Instruction needs to consider the limitations of working memory so that information can be stored effectively in long-term memory.'(S
  (NP1 Instruction/NN)
  needs/VBZ
  to/TO
  consider/VB
  the/DT
  (NP1 limitations/NNS)
  of/IN
  working/VBG
  (NP1 memory/NN)
  so/RB
  that/IN
  (NP1 information/NN)
  can/MD
  be/VB
  stored/VBN
  effectively/RB
  in/IN
  long-term/JJ
  (NP1 memory/NN)
  ./.)

The above parser created a tree with branches for each phrase of type NP1 (single word noun phrases), and other part of speech tags. Below is a portion of the tree. The NP1 labeled chunks are chunked together in a second level.

Tree generated from regular expression chunker

A simple function to parse based on a grammar rule and tags for a sentence is below. The tree has only two levels of branching, so the solution needn’t be recursive.

sentence = 'Instruction needs to consider the limitations of working memory so that information can be stored effectively in long-term memory.'# Part of speech tagging
tags = pos_tag(word_tokenize(sentence))# Function to take part of speech tags and grammar rule and return phrases that match the rule in a list
def chunk_this(grammar_rule_key,sentence_tags):
    phraselist = []
  
    cp = nltk.RegexpParser(chunkrules[grammar_rule_key]) #parser
    J = cp.parse(sentence_tags) # parsed tree
    for i in range(len(J)):
        if not(isinstance(J[i],tuple)): # found a sub tree
            if (J[i].label()==grammar_rule_key):
                phraselist.append((' '.join([J[i][j][0] for j in range(len(J[i]))])))
    phraselist = [phrase.lower() for phrase in phraselist]
    return list(set(phraselist)) # return unique hashable listchunk_this('NP1',tags)--------------------------------------------------------------------['limitations', 'memory', 'instruction', 'information']

This returned in a convenient list all the single word concepts from a sentence.

What if we wanted to pull more complex terms? I played around with more rules. I admit I got a little carried away here! I defined some more rules based on part of speech tags.

# Noun phrases with upto 4 noun words
chunkrules['NP2'] = r"""    
    NP2: {<NN.*>{1,4}}       
"""
sentence = sents[100]
print('Sentence: ',sents[100])
# Part of speech tagging
tags = pos_tag(word_tokenize(sentence))
# Function to take part of speech tags and grammar rule and return phrases that match the rule in a listchunk_this('NP2',tags)
--------------------------------------------------------------------
Sentence:  It is general because, in contrast to the structures that process primary information, the secondary processing engine is capable of processing a wide range of information categories.
['contrast',
 'information',
 'processing engine',
 'range',
 'structures',
 'information categories']

This gives us more specific concept terms, but maybe clubbing adverbs, adjectives and multiple noun words would complete the noun phrase?

chunkrules['JJNP'] = r"""    
    JJNP: {<RB.*>?<J.*>?<NN.*>{1,}}       
"""
chunk_this('JJNP',tags)
--------------------------------------------------------------------
['secondary processing engine',
 'contrast',
 'structures',
 'primary information',
 'wide range',
 'information categories']

Much better! I looked at other, more complex groupings.

‘The queen of England’ grouping: when two nouns are separated by a preposition or subordinating conjunction

chunkrules['NPINNP'] = r"""    
    NPINNP: {<DT>?<J.*>?<NN.*>{1,}<TO>?<IN><DT>?<PRP.*>?<J.*>?<NN.*>{1,}}       
"""
--------------------------------------------------------------------Example 1:
Sentence:  It is general because, in contrast to the structures that process primary information, the secondary processing engine is capable of processing a wide range of information categories.
['a wide range of information categories']
--------------------------------------------------------------------Example 2:
Sentence: The cat in the hat sat on the mat.
['the cat in the hat']

‘Happy and yellow sunflowers’ grouping: when two adjectives describing a noun are separated by a conjunction

chunkrules['JJCCJJNP'] = r"""    
    JJCCJJNP: {<J.*>{1,}<CC><J.*>?<NN.*>*}       
"""
--------------------------------------------------------------------
Sentence: There are no hard and fast rules to follow in this science.
['hard and fast rules']

‘Number with units’ grouping:

Example: ‘thirty people’

chunkrules['NUMWU'] = r"""    
    NUMWU: {<CD>{1,}<JJ>?<NN.*>*}       
"""
--------------------------------------------------------------------
Sentence: The five natural information processing system principles described in Chapters 2–4 will be shown to apply equally to biological evolution and human cognition.
['2–4', 'five natural information processing system principles']

There was a trade off between concept phrase precision vs recall, and in the end, I decided to stick with clubbing adverbs, adjectives and nouns (the ‘JJNP’ rule described above). Satisfied with my rule-based noun phrase extractor, I processed some sample sentences. I found that terms like ‘memory’ and ‘memories’ were catalogued as different concepts. To avoid this, I next looked at stemming and lemmatization.

Stemming and Lemmatization

Stemming is the process of removing affixes to get a base form of a word. NLTK has Porter and Lancaster stemmers built in, both following their own sets of rules for quickly stripping affixes. I fed some sample words in to compare their performance.

porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

Both the Porter and Lancaster stemmers can remove simple suffixes for plurals. The Porter Stemmer distinguishes somewhat between dance, dancer, but the Lancaster stemmer reduces them all to a root form of ‘dant’.

What about other types of plurals?

It looks like some unusual cases are not encoded. What about different meanings for the same word depending on the context?

In the above example, leaves can be a plural noun or a verb, but ‘leaf’ and ‘leav’ are different roots. The Lancaster stemmer reduces the word leaven to the same root as leaves, which is also no good. This is when I started looking into lemmatization using the Wordnet dictionary. The Wordnet Lemmatizer reduces words to different base lemma depending on the part of speech tag, but is somewhat slower than Stemmers.

wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize('leaves','n'))
print(wnl.lemmatize('leaves','v'))print(wnl.lemmatize('radii','n'))
print(wnl.lemmatize('radiuses','n'))--------------------------------------------------------------------leaf
leaveradius
radius

To prevent duplication of concepts, lemmatization is done after part-of-speech tagging and before chunking.

Results

Armed with the results from the above experiments, I validated the approach by using a combination of chunking and lemmatization to extract the main concepts from the Cognitive Load Theory textbook. Below are the 20 most frequently occuring concepts extracted out of 6451.

This is good progress. In my next post, I will talk about my experiments with constructing a rudimentary concept map that links related concepts, and calculate a number of metrics to decide on the importance of a concept to someone studying from a textbook. Thanks for reading!