Christian Fernandez, [email protected]
Radhika Mattoo, [email protected]
Programmers alike do not inherently have all the answers when it comes to software and engineering. Thus, popular websites such as Stack Overflow and GitHub provide a platform for engineers to initiate discussions, raise questions, and crowd source answers. This project aims to analyze GitHub commit messages via a traditional word count to create a dated distribution of popular commit topics. We will then use this distribution to apply a weighted search of related Stack Overflow questions. Finally, these dated topics and their corresponding questions will be collated into a simple UI to assist programmers in accelerating their own development process.
This code was run on a Hadoop Cluster running hadoop version 2.6.0-cdh5.11.1
. In order to compile this project you will need the following:
- Hadoop version: 2.6.0-cdh5.11.1
- Maven version: 3.3.3
- Jar Dependencies for Hive, more information in the Step 1 hive directory
- Java: java version "1.8.0_152" Java(TM) SE Runtime Environment (build 1.8.0_152-b16) Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
- step-1-hive-import: This directory contains a README file with the steps taken to import the GitHub data into HDFS using Hive.
- step-2-group-data: This directory contains the code used to process the output of step 1.
- step-3-word-count: This directory contains the code used to count all words for each GitHub comment that was grouped in step 2.
- step-4-process-topics: This last directory contains the final code to process the GitHub data and find the associations with the Stack Overflow data.
All step directories, with the exception of step 1, have individual run.sh
shell scripts which include all commands needed to build and execute each step.
-
- Input: From Google, stored
/user/cf86/bigQuery
- Output: From hive, stored in
/user/cf86/hive-cleaned
- Input: From Google, stored
-
- Input: From hive
/user/cf86/hive-cleaned
- Output: From MapReduce,
/user/rm3485/groupedData
- Input: From hive
-
- Input: From step 2,
/user/rm3485/groupedData
- Output: From MapReduce,
/user/cf86/output/topic_data
- Input: From step 2,
-
- Input: Stack Overflow Posts (Questions),
/user/rm3485/posts-clean
- Output: From MapReduce,
/user/rm3485/finalTest
- Cached File: From Word Count (Step 3),
/user/cf86/output/topic_data
- Input: Stack Overflow Posts (Questions),
Final output file can also be found in this repository named output-data.txt