Spring 2018

Project 1: Spooky Data Analysis 🎃 👻

Project Description

This is the first and only individual (as opposed to team) this semester.

Projec title: Who wrote these spooky texts?
This project is conducted by Ginny Gao, Columbia UNI: qg2158
Project summary: This project studies texts from 3 popular horror authors (1800s to early 1900s): Edgar Allan Poe (EAP), HP Lovecraft (HPL), and Mary Wollstonecraft Shelley (MWS). It explores authors' writing styles in amount of words, use of punctuation - specifically question marks, compares and contrasts positive and negative emotions in authors' writing content via sentiment analysis.

1. Who write how many texts in the dataset?

Pie chart shows composition of author's texts in the spooky dataset.

It appears that most texts belong to EAP (40.3%), followed by MWS (30.9%) and HPL (28.8%). MWS and HPL have about same number of texts in this dataset.

2.1 Do some authors use more questions in the texts than others?

Waffle chart presents the comparison of use of questions in texts among 3 authors.
Questions are identified with "?" mark in the sentences.

Not suprisingly, EAP uses most number of questions in his texts, as EAP has more texts than others in the dataset. Notice MWS and HPL have approximately same number of texts in the dataset from study above, but MWS uses more than twice as many questions in her texts than HPL does. This can be a key identifier to differentiate MWS and HPL's texts.

2.2 Are there any differences in authors' use of questions in sentences, compare to their total volume of texts in the dataset?

Bubble chart illustrates the similarities and differences in these quantities for each author.

EAP writes most texts in the dataset, and he also has largest volume of words. Interestingly for MWS, though she doesn't have many texts or nearly as many words as EAP does in the dataset, she uses many questions in her texts. Based on the slopes of where the 3 bubbles are in the chart, MWS uses most questions in her texts compare to the other 2. HPL writes the least in the group, and his use of questions is about the same as EAP.

You may interact with the chart here. Might need to download it as .html file first: right click Download button, click Save Link As..., and change the file type to .html.

2.3 How about bigrams in the text? What two words, less stop words appear together most often in the spooky dataset?

The network digram shows what are the top bigrams, and the relationship of these top bigrams with each other.

3.1 How do sentiments compare in these authors' writings?

Pyramid chart displays positive and negative emotional content in different authors' texts.

There are 3 widely used lexicons. bing lexicon classifies words into positive or negative categories. Since lexicons stay more concurrent with modern language usage, I am interested in exploring the emotional content of these 3 authors (1800s to early 1900s) in general. Thus, bing lexicon is chosen for this sentiment analysis.

From the pyramid chart, we can see that all 3 authors use words with negative emotions in their texts more than positive ones. After all, they are horror authors! While MWS leads the total number of emotional words in her texts compare to the other 2 authors, her negative to positive words ratio is about the same as EAP's. HPL, on the other hand, uses twice as many negative words than positive words in his writings.

3.2 What words that carry emotional content do the authors use most in their texts?

Word comparison cloud shows top 100 positive and negative words used by the authors.

From the comparison cloud, it is also represented that authors use more positive than negative words in the top 100 word list.
'like', 'great', 'well', 'good', 'love' lead the top positive words, while 'death', 'strange', 'fear', 'dark' are the top negative words. As the size of the words shows, these top positive words appear more often in the authors' texts than the negative words do.

Then, I wonder...

3.3 Is there any pattern in use of sentiments and use of questions in the sentences from these authors?

I plotted the bubble chart on these 3 quantities to find out.

The bubble chart confirms with the pyramid chart that MWS's texts have the most positive and negative words, and HPL has the least amount of words carry sentiments, though HPL is much more likely to choose a negative word than a positive one when he uses words that carry emotional content. MWS and EAP utilize questions in their texts about 2.5 to 3 times as much as HPL does.

Appendix

For code used in this analysis, please refer to Spooky_Data_Analysis.Rmd for details.
Reference: Text Mining with R, A Tidy Approach by Julia Silge and David Robinson

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
data		data
doc		doc
figs		figs
lib		lib
output		output
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spring 2018

Project 1: Spooky Data Analysis 🎃 👻

Project Description

1. Who write how many texts in the dataset?

2.1 Do some authors use more questions in the texts than others?

2.2 Are there any differences in authors' use of questions in sentences, compare to their total volume of texts in the dataset?

2.3 How about bigrams in the text? What two words, less stop words appear together most often in the spooky dataset?

3.1 How do sentiments compare in these authors' writings?

3.2 What words that carry emotional content do the authors use most in their texts?

3.3 Is there any pattern in use of sentiments and use of questions in the sentences from these authors?

Appendix

About

Uh oh!

Releases

Packages

Languages

ginnyqg/natural-language-processing

Folders and files

Latest commit

History

Repository files navigation

Spring 2018

Project 1: Spooky Data Analysis 🎃 👻

Project Description

1. Who write how many texts in the dataset?

2.1 Do some authors use more questions in the texts than others?

2.2 Are there any differences in authors' use of questions in sentences, compare to their total volume of texts in the dataset?

2.3 How about bigrams in the text? What two words, less stop words appear together most often in the spooky dataset?

3.1 How do sentiments compare in these authors' writings?

3.2 What words that carry emotional content do the authors use most in their texts?

3.3 Is there any pattern in use of sentiments and use of questions in the sentences from these authors?

Appendix

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages