Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Problem 2:

In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States
of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
 Find the number of characters, words and sentences for the mentioned documents.
 Remove all the stopwords from all the three speeches.
 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)
 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) –

Code Snippet to extract the three speeches:


"
import nltk
nltk.download('inaugural')
from nltk.corpus import inaugural
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"
Introduction:
NLTK will provide you with everything from splitting paragraphs to sentences, splitting words,
identifying the part of speech, highlighting themes, and even helping your machine
understand what the text is about.
Q1. Find the number of characters, words and sentences for the mentioned documents.
Answer : We are importing the nltk library to use the inaugural.fileds()

1|Page
After importing the text file, we would first count the total number of characters in each file
separately. Below is the code to count the char from each file. With the output along with
the screenshot.
# number of Characters in each file

# Number of words in each text file:


Below we are counting the total number of words from each file separately.
Here we are using the split() to split up the words based on space between each word and
we are counting the total number of words by using the len() function.
Output :

2|Page
# Number of Sentences.
Below we are counting the total number of sentence in each text file, by using lambda
function. We are using pd.Dataframe to move the data as dictionary and then with lambda
function we are checking each sentece which ends with “.” Using endswith() function and
the below code and output is as below.

Q2. Remove all the stopwords from all the three speeches.

Answer: We would use the library from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize.
We need these to remove all the English predefined words from each text file separately and
with the help of tokenize we would separate each word and remove all the words from the
text file.

3|Page
Q3. Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords)

Answer: We have already removed the stopwatch in previous code using stopwords.
Now we are loop to look for any word and count the total number of occurrences. And we
see from Roosevelt file we have the below words which are highly used in during the speech
by president.
Top 3 Words : Nation , Know, Spirit.

In

Keendy’s Speech we see the above as top 3 words.

4|Page
In Nixon’s Speech we see the below as top 3 words

Q4. Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords)
Answer: Word Cloud is a data visualization technique used for representing text data in
which the size of each word indicates its frequency or importance. Significant textual data
points can be highlighted using a word cloud. Word clouds are widely used for analysing data
from social network websites.
Here we are creating the wordcloud for Roosevelt speech and we have imported the
wordcloud by importing libraries. We are also making sure to remove all the substrings
appearing the filtered data.

5|Page
6|Page
Output : Image is cut inbetween could not take single shot so. In python code the output is
good.

Wordcloud for Kenndy speech

7|Page
OutPut for Kenndy Speech :

8|Page
Wordcloud for Nixon Speech

9|Page
10 | P a g e

You might also like