Analyzer and statistics generator for text-based conversations. If interested, check out also my related article.
Includes scraper and parser for Facebook conversations.
The scraper retrieves all messages of a specific Facebook conversation and saves them in a JSON file. At this step messages include all fields and attributes as defined by the Facebook response format.
The parser takes as input the result of the scraper and extracts only the textual messages and related attributes, for then saving them to a text file. Such file is used as input for the conversation analyzer.
Conversation format example (based on the current parser):
2012.06.17 15:27:42 SENDER_1 Message text from sender1
2012.06.18 17:27:42 SENDER_2 Message text from sender2
Each of the three components (i.e. scraper, parser, analyzer) can be run separately. The analyzer can be accessed via the main.py script contained in the src folder. It will log a set of basic stats for the overall conversation. Scraper ( conversationScraper.py ) and parser ( conversationParser.py ) are instead inside the util folder.
For each see the help menu (-h or --help) for a detailed usage description.
Basic configurations for all modules can be managed via the config.ini file. The location of the config file to be used should be passed via --config argument.
The only additional requirement for parser and scraper is the requests package. For the analyzer I used Conda; requirements.txt is the exported environment file. See here for a guide on how to manage Conda environments.
In order to access Facebook conversations the following parameters are required: cookie and fb_dtsg and conversation ID.
Such data can be found via the following procedure:
- Open the desired conversation in a browser
- Check the network traffic via the preferred developer tool or equivalent
- Scroll up in the conversation until a POST request is issued to thread_info.php
- Locate and copy the required parameters: cookie, fb_dtsg and conversation ID
4.1 You can find the conversation ID in a line with this format: messages[user_ids][<conversation_ID>].. or messages[thread_fbids][<conversation_ID>] for group conversations.
Once the values of cookie and fb_dtsg have been copied in the config.ini file, the scraper can be run by passing the conversation ID as argument (--id). If you want to scrape a group conversation, use the -g flag, the scraper will not work otherwise. Via the -m flag new messages can be merged with the previously scraped part of the same conversation, if present.
To run the parser just provide as arguments the path of the scraped-conversation file and the desired path for the parsed output. Additionally, via --authors, you can pass a dictionary like structure to provide a correspondence between the profile IDs and preferred aliases. This will produce a more readable output. Example usage:
--authors "{"11234":"SENDER_1", "112345":"SENDER_2"}"
I have added two Jupyter notebooks for easier exploration of the various statistics and analytical results. Just check out Basic Stats or Words Stats. I have left previous outputs as examples, but I encourage you to explore your own data and tweak the stats and plots based on your preferences.
If you are not familiar or not willing to check out the notebook, you can still access the old main.py for automatic stats running. It requires as parameter the filepath of the conversation to be analyzed; it will then log and generate a set of basic stats for the overall conversation.
-
Interval Stats (start/end date, duration, days without messages)
-
Basic Length Stats (number of messages, total length of messages, message average length)
-
Lexical Stats (tokens count and vocabulary, lexical diversity). Tokens count consider duplicate words, while vocabulary (also called types) is the count of unique words. Lexical richness/diversity is the ratio between vocabulary and tokens count.
-
Word Count/Frequency (top N words, words count, words trend, words used just by, relevant words by sender, zipf's law). This can then generalize for all other N-Grams.
-
Emoticons Stats (number of emoticons used, emoticon ratio, emoticon count)
-
Reply Delay (reply delay by sender, reply delay by message length, num of sequential messages by sender)
-
Aggregation - most of the previous groups include the option of aggregation by sender, or by combination of datetime features (e.g. hour, day, month, year).
NEW
- Sentiment Analysis (joy, anger, disgust, fear values via IBM Watson Tone Analyzer Service)
* different-conversations comparison
* option of using normal login for facebook, then collect needed info (try mechanize)
Released under version 2.0 of the Apache License.