Skip to content

ginnyqg/natural-language-processing

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spring 2018

Project 1: Spooky Data Analysis 🎃 👻

Project Description

This is the first and only individual (as opposed to team) this semester.

  • Projec title: Who wrote these spooky texts?

  • This project is conducted by Ginny Gao, Columbia UNI: qg2158

  • Project summary: This project studies texts from 3 popular horror authors (1800s to early 1900s): Edgar Allan Poe (EAP), HP Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS). It explores authors' writing styles in amount of words, use of punctuation - specifically question marks, compares and contrasts positive and negative emotions in authors' writing content via sentiment analysis.

1. Who write how many texts in the dataset?

Pie chart shows composition of author's texts in the spooky dataset.

image

It appears that most texts belong to EAP (40.3%), followed by MWS (30.9%) and HPL (28.8%). MWS and HPL have about same number of texts in this dataset.

2.1 Do some authors use more questions in the texts than others?

  • Waffle chart presents the comparison of use of questions in texts among 3 authors.
  • Questions are identified with "?" mark in the sentences.

image

Not suprisingly, EAP uses most number of questions in his texts, as EAP has more texts than others in the dataset. Notice MWS and HPL have approximately same number of texts in the dataset from study above, but MWS uses more than twice as many questions in her texts than HPL does. This can be a key identifier to differentiate MWS and HPL's texts.

2.2 Are there any differences in authors' use of questions in sentences, compare to their total volume of texts in the dataset?

Bubble chart illustrates the similarities and differences in these quantities for each author.

image

EAP writes most texts in the dataset, and he also has largest volume of words. Interestingly for MWS, though she doesn't have many texts or nearly as many words as EAP does in the dataset, she uses many questions in her texts. Based on the slopes of where the 3 bubbles are in the chart, MWS uses most questions in her texts compare to the other 2. HPL writes the least in the group, and his use of questions is about the same as EAP.

You may interact with the chart here. Might need to download it as .html file first: right click Download button, click Save Link As..., and change the file type to .html.

2.3 How about bigrams in the text? What two words, less stop words appear together most often in the spooky dataset?

The network digram shows what are the top bigrams, and the relationship of these top bigrams with each other.

image

3.1 How do sentiments compare in these authors' writings?

Pyramid chart displays positive and negative emotional content in different authors' texts.

image

There are 3 widely used lexicons. bing lexicon classifies words into positive or negative categories. Since lexicons stay more concurrent with modern language usage, I am interested in exploring the emotional content of these 3 authors (1800s to early 1900s) in general. Thus, bing lexicon is chosen for this sentiment analysis.

From the pyramid chart, we can see that all 3 authors use words with negative emotions in their texts more than positive ones. After all, they are horror authors! While MWS leads the total number of emotional words in her texts compare to the other 2 authors, her negative to positive words ratio is about the same as EAP's. HPL, on the other hand, uses twice as many negative words than positive words in his writings.

3.2 What words that carry emotional content do the authors use most in their texts?

Word comparison cloud shows top 100 positive and negative words used by the authors.

image

  • From the comparison cloud, it is also represented that authors use more positive than negative words in the top 100 word list.

  • 'like', 'great', 'well', 'good', 'love' lead the top positive words, while 'death', 'strange', 'fear', 'dark' are the top negative words. As the size of the words shows, these top positive words appear more often in the authors' texts than the negative words do.

Then, I wonder...

3.3 Is there any pattern in use of sentiments and use of questions in the sentences from these authors?

I plotted the bubble chart on these 3 quantities to find out.

image

The bubble chart confirms with the pyramid chart that MWS's texts have the most positive and negative words, and HPL has the least amount of words carry sentiments, though HPL is much more likely to choose a negative word than a positive one when he uses words that carry emotional content. MWS and EAP utilize questions in their texts about 2.5 to 3 times as much as HPL does.

Appendix

About

spring2018-project1-ginnyqg created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%