Question Recommendation On The Stack Overflow Network: Jacob Perricone
Question Recommendation On The Stack Overflow Network: Jacob Perricone
Network
Jacob Perricone
Institute of Computational and
Mathematical Engineering
Stanford University
Email: [email protected]
Abstract—The use of Question-Answering Communities in theless, the first step in creating a recommendation system,
software engineering has skyrocketed in recent years, with the which is integrated into both the user’s IDE and the developer
number of posts on Stack Overflow alone nearing 37 million. community at scale, is to analyze the recommendation system
Stack Overflow allows users to post questions to a large devel-
oper community, nearly 8 million users, and receive answers on the largest open-source software network available.
or advice about the issue the user is encountering. As the This project utilizes the network structure of Stack Over-
number of answered questions grows, Stack Overflow is not flow to recommend for a given question a set of related
only useful to the user posting a question but to any developer questions. One difficulty that must be addressed is how to
interested in gathering information. With an ever-increasing determine the quality of the recommendation, since there is
network of information, it is important to devise algorithms
able to recommend relevant questions (or even answers) to a no labeled dataset. Real world systems often approximate
user. Typically recommendation systems employ some form of recommendation quality by examining user interaction with
collaborative filtering coupled with latent factor models and the recommendation. In the off-line setting, however, assessing
content analysis. This project utilizes the network structure of recommendation quality requires an ad-hoc metrics of “rele-
Stack Overflow to recommend a set of related questions for vance” which may not accurately capture quality.
a given input question. In particular, this paper employs a
modified guided Personalized Page Rank algorithm to generate Lastly, this project focuses on the utility of network algo-
candidate recommendations and compares the results to those rithms in the recommendation tasks, which places constrictions
recommended by Stack Overflow. Semantic similarity and tag- on the input to the recommendation algorithm (i.e. a node
overlap are used to asses candidate recommendations. within the network). Recommending questions given only
Index Terms—Recommendation Systems, Stack Overflow, Per- a natural language query is an additional task that is not
sonalized Page Rank
addressed in this paper.
I. I NTRODUCTION A. Literature Review
Stack Overflow represents a vast wealth of curated answers Recommending relevant questions given an input question
to commonly encountered software engineering questions. and user touches on two areas of research: natural language
Indeed, combing through the Stack Overflow archives in process and network recommendation algorithms. If the input
search for answers or relevant information is common-place question is taken to be a node in the graph, the question can be
when encountering a bug or exploring a new library. The modeled as a recommendation algorithm; whereas, if the input
emergence of Stack Overflow testifies to the push for crowd- query is natural language, inference techniques from natural
sourced developer environments. Often, when writing a piece language processing must be used.
of software, it is likely that components of the software have Recommendation problems on networks can be analyzed
already been written. As the amount of software grows, it using variations of the canonical Page Rank algorithm. The
will become increasingly inefficient to spend time rewriting Page Rank Algorithm assigns an importance score to each
code that has already been developed. Creating a collaborative node in the network. However, Page Rank is restrictive in
development environment where code is recommended for a that it initially assigns equal weighting to all nodes and edges
user given their history of projects and the task at hand is in the network. In effect, it is useful as a global measure of
paramount in increasing developer efficiency. relevance but not helpful for finding nodes pertaining to a
Recommendation systems are present in most of the services given source node. Personalized Page Rank (PPR) attempts
we use, ranging from movie recommendation on Netflix, job rectify this short-coming by ranking nodes not just by global
recommendation on LinkedIn, product recommendation on popularity but by relevance to a given set of source nodes. A
Amazon, or users on social media. In fact, Stack Overflow common interpretation of PPR is to take a random walk around
even offers a list of related posts for a given question. Despite the graph, restarting randomly, and rank nodes by proximity
the prevalence of recommendation systems in every day life, [1] [2]. Variations on Personalized Page Rank are leveraged
code recommendation systems are not common, likely due for recommendation in large-scale system such as the Pinterest
to the difficulty of assessing intention and relevance. Never- graph [3]. The system used at Pinterest to recommend pins for
a board given a set of input pins in part utilizes a real time is resourceful.
random walk service Pixie [4]. Pixie conducts many random
walks on a bipartite graph of pins and boards and aggregates II. DATA
pin counts at each step, which is effectively computes Person- A. Data Collection & Attribution
alized PageRank on the graph with a given query pin. Pixie
has shown to be quite effective at recommendation since it The data used for this project is primarily pro-
is able extract highly connected pins while as well retaining vided by the StackExchange data-dump [10], though
good coverage of rare pins. A combination of Pixie candidates, additional data was gathered by scraping the Stack
heuristic candidates and session co-occurence candidates is Overflow website. From the StackExchange data-dump,
then used to augment the recommendation system. this project uses only data provided for Stack Over-
In Q&A communities, there has been a great deal of flow, in particular the tables Posts.xml, Users.xml,
research on how to recommend to expert users questions they PostLinks.xml, Comments.xml. Additional data on
would like to answer. Mathur et al [5] analyzes a restricted Stack Overflow’s “related questions” for a given post were
set of the Stack Exchange math community, using network mined from the Stack Overflow website itself.
features learned through node2vec [6] as inputs for a machine
learning algorithm to rank users for a given question. They B. Data Processing
achieve high quality results, with the correct user being in the Six SQL tables were created form the raw-data dump
top 2 suggestions over 75% of the time. Utilizing network file: Questions, Answers, Users, PostLinks, RelatedLinks,
embeddings is a novel and insightful aspect of their work. Tags. Since the ultimate goal is to improve the granularity
Gideon et all [7] proposes a recommendation system for Yahoo of recommendation, this project restricts the Stack Overflow
Answers that combines features extracted from the question’s Posts to only those posts relevant to questions pertaining to
text and user-user interactions to create a score for each user. python. This involved a number of processing steps detailed
Impressively, their model is able to capture the underlying below.
state of the question through time and assess the propensity The Tag table contains columns TagId, TagName, Count and
of a user to answer a question in a given state. was parsed directly from the XML file. The post data provided
Within the Stack Overflow network, additional research has by the data-dump is a 56-gigabyte xml file of all the posts,
been conducted on Tag Recommendation for a given input including both questions and answers1 on site. The Question
question and user. The most relevant algorithm in the tag- table contains the columns: Id, AcceptedAnswerId, Creation-
recommendation sphere is NetTagCombine, which expands Date, Text, Code, Score, ViewCount, Title, Tags, AnswerCount,
upon previous work tagging Software information sites by FavoriteCount, OwnerUserId, CommentCount. A post was
incorporating network structure into their tag prediction al- added to the Question table if a) the post was a question and
gorithm [8] [9] [8]. Short et al. analyzes three graphs within b) any version of the string python could be found in Tags,
the Stack Overflow network: a network based on semantic Body or Title. Next the data-dump attribute Body, containing
post-similarity created by adding edges between questions if the raw HTML of the post, was parsed into Text and Code
the cosine similarity of the two questions’ tf-idf vectors are columns. The total number of questions accumulated in the
greater than a set threshold, a network based on user to user first step was ∼ 800, 000. Now, to take into account that
interactions (u → v if u answers v questions), and a bipartite some posts pertaining to python are incorrectly mislabeled
graph linking users to tags assigning edge weights based on by only a subcategory2 , the graph was further augmented by
whether the user asked, posted, or commented on the question. fetching all linked questions from the PostLinks.xml file3 .
Short et al. then implements the TagCombine algorithm, which Finally, to ensure the validity of the processing and that all
consists of a multi-label learning algorithm, a similarity based questions pertaining to python posts are incorporated into
ranking component (finds similar post, accumulates the tags the graph, for each question q in the Question database, the
within them, assigns probabilities to the tags through empirical list of recommendated questions was scraped from the Stack
likelihood), and a co-occurrence weighting between the words Overflow web-page of q and stored them in a relatedlinks
in the question and the words associated with a tag. They table4 . The related links serve as our baseline for comparison
then improve the algorithm by leveraging the bipartite graph in the ranking algorithms discussed below. Since the scale of
between users and tags to assign tag i a score proportional this scraping procedure was massive, on the order of 800,000
to the weighted sum of the edges between the users who questions, the procedure employed multi-threaded calls to
interacted with the post and the tag i. In addition they use AWS lambda functions to extract the relevant information in
the BIGCLAM algorithm to detect communities in the question parallel.
to question network and allow the top 50 most similar posts
in each community to contribute their tag representatives to 1 For the purposes of this project Posts denote Questions or Answers
2 For example, questions pertaining to Django, a python server-side frame-
the likelihoods of assigning tags to the new question. Their
work, often forgot the parent tag
research demonstrate a significant improvement in prediction 3 The postLinks table has two types of links, duplicates and linked. An edge
when utilizing network structure. Utilizing the bipartite graph from pi to pj is of type link if pj is explicitly referred to by a user in pi
between users and tags and forming communities of questions 4 relatedlinks table has columns origin PostId RelatedPostId
After taking the union of the ids listed in relevantlinks Fig. 1. Degree Distribution of TT network
table, the total number of questions was ∼ 1, 100, 000. Using
the ids of the Question table, an Answer table was created that
contains all of the answers associated with the questions in the
Question table. Finally, a User table that stores information
on all users within both the Question and Answer tables was
created. The total number of answers is 2, 226, 000; the total
number of unique users in the question and answer table is
∼ 620, 000; the total number of distinct tags for python-related
questions is 23, 160.
Fig. 3. Degree Distribution of Tags within TQ network Fig. 5. Degree Distribution of UT network
15 I actually explored this avenue, creating node embeddings, but did not
have time to implement the prediction step