Conversation Analyzer

Analyzer and statistics generator for text-based conversations. If interested, check out also my related article.

Includes scraper and parser for Facebook conversations.

The scraper retrieves all messages of a specific Facebook conversation and saves them in a JSON file. At this step messages include all fields and attributes as defined by the Facebook response format.

The parser takes as input the result of the scraper and extracts only the textual messages and related attributes, for then saving them to a text file. Such file is used as input for the conversation analyzer.
Conversation format example (based on the current parser):

2012.06.17 15:27:42 SENDER_1 Message text from sender1
2012.06.18 17:27:42 SENDER_2 Message text from sender2

Usage

Each of the three components (i.e. scraper, parser, analyzer) can be run separately. The analyzer can be accessed via the main.py script contained in the src folder. It will log a set of basic stats for the overall conversation. Scraper ( conversationScraper.py ) and parser ( conversationParser.py ) are instead inside the util folder.

For each see the help menu (-h or --help) for a detailed usage description.

Basic configurations for all modules can be managed via the config.ini file. The location of the config file to be used should be passed via --config argument.

Requirements

The only additional requirement for parser and scraper is the requests package. For the analyzer I used Conda; requirements.txt is the exported environment file. See here for a guide on how to manage Conda environments.

Scraper

In order to access Facebook conversations the following parameters are required: cookie and fb_dtsg and conversation ID.

Such data can be found via the following procedure:

Open the desired conversation in a browser
Check the network traffic via the preferred developer tool or equivalent
Scroll up in the conversation until a POST request is issued to thread_info.php
Locate and copy the required parameters: cookie, fb_dtsg and conversation ID
4.1 You can find the conversation ID in a line with this format: messages[user_ids][<conversation_ID>].. or messages[thread_fbids][<conversation_ID>] for group conversations.

Once the values of cookie and fb_dtsg have been copied in the config.ini file, the scraper can be run by passing the conversation ID as argument (--id). If you want to scrape a group conversation, use the -g flag, the scraper will not work otherwise. Via the -m flag new messages can be merged with the previously scraped part of the same conversation, if present.

Parser

To run the parser just provide as arguments the path of the scraped-conversation file and the desired path for the parsed output. Additionally, via --authors, you can pass a dictionary like structure to provide a correspondence between the profile IDs and preferred aliases. This will produce a more readable output. Example usage:

--authors "{"11234":"SENDER_1", "112345":"SENDER_2"}"

Analyzer

I have added two Jupyter notebooks for easier exploration of the various statistics and analytical results. Just check out Basic Stats or Words Stats. I have left previous outputs as examples, but I encourage you to explore your own data and tweak the stats and plots based on your preferences.

If you are not familiar or not willing to check out the notebook, you can still access the old main.py for automatic stats running. It requires as parameter the filepath of the conversation to be analyzed; it will then log and generate a set of basic stats for the overall conversation.

Conversation Stats List

Interval Stats (start/end date, duration, days without messages)
Basic Length Stats (number of messages, total length of messages, message average length)
Lexical Stats (tokens count and vocabulary, lexical diversity). Tokens count consider duplicate words, while vocabulary (also called types) is the count of unique words. Lexical richness/diversity is the ratio between vocabulary and tokens count.
Word Count/Frequency (top N words, words count, words trend, words used just by, relevant words by sender, zipf's law). This can then generalize for all other N-Grams.
Emoticons Stats (number of emoticons used, emoticon ratio, emoticon count)
Reply Delay (reply delay by sender, reply delay by message length, num of sequential messages by sender)
Aggregation - most of the previous groups include the option of aggregation by sender, or by combination of datetime features (e.g. hour, day, month, year).

NEW

Sentiment Analysis (joy, anger, disgust, fear values via IBM Watson Tone Analyzer Service)

TODO

* different-conversations comparison
* option of using normal login for facebook, then collect needed info (try mechanize)

License

Released under version 2.0 of the Apache License.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
src		src
.gitignore		.gitignore
Conversation Analyzer - Basic Stats.ipynb		Conversation Analyzer - Basic Stats.ipynb
Conversation Analyzer - Words Stats.ipynb		Conversation Analyzer - Words Stats.ipynb
LICENSE		LICENSE
README.md		README.md
config.ini		config.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Conversation Analyzer

Usage

Requirements

Scraper

Parser

Analyzer

Conversation Stats List

TODO

License

About

Uh oh!

Releases

Packages

Languages

License

tsinik/conversation-analyzer

Folders and files

Latest commit

History

Repository files navigation

Conversation Analyzer

Usage

Requirements

Scraper

Parser

Analyzer

Conversation Stats List

TODO

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages