Question Recommendation On The Stack Overflow Network: Jacob Perricone

This document discusses developing a recommendation system for Stack Overflow that recommends related questions given an input question. It analyzes using a modified personalized PageRank algorithm on the Stack Overflow network to generate recommendations. It also discusses evaluating recommended questions by comparing them to Stack Overflow's built-in related questions and using semantic similarity and tag overlap metrics. The data used is from the Stack Exchange data dump and additional scraping of Stack Overflow for related question data. The goal is to improve the granularity of recommendations for questions related to Python.

Uploaded by

batman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

Question Recommendation On The Stack Overflow Network: Jacob Perricone

Uploaded by

batman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Question Recommendation On the Stack Overflow

Network
Jacob Perricone
Institute of Computational and
Mathematical Engineering
Stanford University
Email: [email protected]

Abstract—The use of Question-Answering Communities in theless, the first step in creating a recommendation system,
software engineering has skyrocketed in recent years, with the which is integrated into both the user’s IDE and the developer
number of posts on Stack Overflow alone nearing 37 million. community at scale, is to analyze the recommendation system
Stack Overflow allows users to post questions to a large devel-
oper community, nearly 8 million users, and receive answers on the largest open-source software network available.
or advice about the issue the user is encountering. As the This project utilizes the network structure of Stack Over-
number of answered questions grows, Stack Overflow is not flow to recommend for a given question a set of related
only useful to the user posting a question but to any developer questions. One difficulty that must be addressed is how to
interested in gathering information. With an ever-increasing determine the quality of the recommendation, since there is
network of information, it is important to devise algorithms
able to recommend relevant questions (or even answers) to a no labeled dataset. Real world systems often approximate
user. Typically recommendation systems employ some form of recommendation quality by examining user interaction with
collaborative filtering coupled with latent factor models and the recommendation. In the off-line setting, however, assessing
content analysis. This project utilizes the network structure of recommendation quality requires an ad-hoc metrics of “rele-
Stack Overflow to recommend a set of related questions for vance” which may not accurately capture quality.
a given input question. In particular, this paper employs a
modified guided Personalized Page Rank algorithm to generate Lastly, this project focuses on the utility of network algo-
candidate recommendations and compares the results to those rithms in the recommendation tasks, which places constrictions
recommended by Stack Overflow. Semantic similarity and tag- on the input to the recommendation algorithm (i.e. a node
overlap are used to asses candidate recommendations. within the network). Recommending questions given only
Index Terms—Recommendation Systems, Stack Overflow, Per- a natural language query is an additional task that is not
sonalized Page Rank
addressed in this paper.
I. I NTRODUCTION A. Literature Review
Stack Overflow represents a vast wealth of curated answers Recommending relevant questions given an input question
to commonly encountered software engineering questions. and user touches on two areas of research: natural language
Indeed, combing through the Stack Overflow archives in process and network recommendation algorithms. If the input
search for answers or relevant information is common-place question is taken to be a node in the graph, the question can be
when encountering a bug or exploring a new library. The modeled as a recommendation algorithm; whereas, if the input
emergence of Stack Overflow testifies to the push for crowd- query is natural language, inference techniques from natural
sourced developer environments. Often, when writing a piece language processing must be used.
of software, it is likely that components of the software have Recommendation problems on networks can be analyzed
already been written. As the amount of software grows, it using variations of the canonical Page Rank algorithm. The
will become increasingly inefficient to spend time rewriting Page Rank Algorithm assigns an importance score to each
code that has already been developed. Creating a collaborative node in the network. However, Page Rank is restrictive in
development environment where code is recommended for a that it initially assigns equal weighting to all nodes and edges
user given their history of projects and the task at hand is in the network. In effect, it is useful as a global measure of
paramount in increasing developer efficiency. relevance but not helpful for finding nodes pertaining to a
Recommendation systems are present in most of the services given source node. Personalized Page Rank (PPR) attempts
we use, ranging from movie recommendation on Netflix, job rectify this short-coming by ranking nodes not just by global
recommendation on LinkedIn, product recommendation on popularity but by relevance to a given set of source nodes. A
Amazon, or users on social media. In fact, Stack Overflow common interpretation of PPR is to take a random walk around
even offers a list of related posts for a given question. Despite the graph, restarting randomly, and rank nodes by proximity
the prevalence of recommendation systems in every day life, [1] [2]. Variations on Personalized Page Rank are leveraged
code recommendation systems are not common, likely due for recommendation in large-scale system such as the Pinterest
to the difficulty of assessing intention and relevance. Never- graph [3]. The system used at Pinterest to recommend pins for
a board given a set of input pins in part utilizes a real time is resourceful.
random walk service Pixie [4]. Pixie conducts many random
walks on a bipartite graph of pins and boards and aggregates II. DATA
pin counts at each step, which is effectively computes Person- A. Data Collection & Attribution
alized PageRank on the graph with a given query pin. Pixie
has shown to be quite effective at recommendation since it The data used for this project is primarily pro-
is able extract highly connected pins while as well retaining vided by the StackExchange data-dump [10], though
good coverage of rare pins. A combination of Pixie candidates, additional data was gathered by scraping the Stack
heuristic candidates and session co-occurence candidates is Overflow website. From the StackExchange data-dump,
then used to augment the recommendation system. this project uses only data provided for Stack Over-
In Q&A communities, there has been a great deal of flow, in particular the tables Posts.xml, Users.xml,
research on how to recommend to expert users questions they PostLinks.xml, Comments.xml. Additional data on
would like to answer. Mathur et al [5] analyzes a restricted Stack Overflow’s “related questions” for a given post were
set of the Stack Exchange math community, using network mined from the Stack Overflow website itself.
features learned through node2vec [6] as inputs for a machine
learning algorithm to rank users for a given question. They B. Data Processing
achieve high quality results, with the correct user being in the Six SQL tables were created form the raw-data dump
top 2 suggestions over 75% of the time. Utilizing network file: Questions, Answers, Users, PostLinks, RelatedLinks,
embeddings is a novel and insightful aspect of their work. Tags. Since the ultimate goal is to improve the granularity
Gideon et all [7] proposes a recommendation system for Yahoo of recommendation, this project restricts the Stack Overflow
Answers that combines features extracted from the question’s Posts to only those posts relevant to questions pertaining to
text and user-user interactions to create a score for each user. python. This involved a number of processing steps detailed
Impressively, their model is able to capture the underlying below.
state of the question through time and assess the propensity The Tag table contains columns TagId, TagName, Count and
of a user to answer a question in a given state. was parsed directly from the XML file. The post data provided
Within the Stack Overflow network, additional research has by the data-dump is a 56-gigabyte xml file of all the posts,
been conducted on Tag Recommendation for a given input including both questions and answers1 on site. The Question
question and user. The most relevant algorithm in the tag- table contains the columns: Id, AcceptedAnswerId, Creation-
recommendation sphere is NetTagCombine, which expands Date, Text, Code, Score, ViewCount, Title, Tags, AnswerCount,
upon previous work tagging Software information sites by FavoriteCount, OwnerUserId, CommentCount. A post was
incorporating network structure into their tag prediction al- added to the Question table if a) the post was a question and
gorithm [8] [9] [8]. Short et al. analyzes three graphs within b) any version of the string python could be found in Tags,
the Stack Overflow network: a network based on semantic Body or Title. Next the data-dump attribute Body, containing
post-similarity created by adding edges between questions if the raw HTML of the post, was parsed into Text and Code
the cosine similarity of the two questions’ tf-idf vectors are columns. The total number of questions accumulated in the
greater than a set threshold, a network based on user to user first step was ∼ 800, 000. Now, to take into account that
interactions (u → v if u answers v questions), and a bipartite some posts pertaining to python are incorrectly mislabeled
graph linking users to tags assigning edge weights based on by only a subcategory2 , the graph was further augmented by
whether the user asked, posted, or commented on the question. fetching all linked questions from the PostLinks.xml file3 .
Short et al. then implements the TagCombine algorithm, which Finally, to ensure the validity of the processing and that all
consists of a multi-label learning algorithm, a similarity based questions pertaining to python posts are incorporated into
ranking component (finds similar post, accumulates the tags the graph, for each question q in the Question database, the
within them, assigns probabilities to the tags through empirical list of recommendated questions was scraped from the Stack
likelihood), and a co-occurrence weighting between the words Overflow web-page of q and stored them in a relatedlinks
in the question and the words associated with a tag. They table4 . The related links serve as our baseline for comparison
then improve the algorithm by leveraging the bipartite graph in the ranking algorithms discussed below. Since the scale of
between users and tags to assign tag i a score proportional this scraping procedure was massive, on the order of 800,000
to the weighted sum of the edges between the users who questions, the procedure employed multi-threaded calls to
interacted with the post and the tag i. In addition they use AWS lambda functions to extract the relevant information in
the BIGCLAM algorithm to detect communities in the question parallel.
to question network and allow the top 50 most similar posts
in each community to contribute their tag representatives to 1 For the purposes of this project Posts denote Questions or Answers
2 For example, questions pertaining to Django, a python server-side frame-
the likelihoods of assigning tags to the new question. Their
work, often forgot the parent tag
research demonstrate a significant improvement in prediction 3 The postLinks table has two types of links, duplicates and linked. An edge
when utilizing network structure. Utilizing the bipartite graph from pi to pj is of type link if pj is explicitly referred to by a user in pi
between users and tags and forming communities of questions 4 relatedlinks table has columns origin PostId RelatedPostId
After taking the union of the ids listed in relevantlinks Fig. 1. Degree Distribution of TT network
table, the total number of questions was ∼ 1, 100, 000. Using
the ids of the Question table, an Answer table was created that
contains all of the answers associated with the questions in the
Question table. Finally, a User table that stores information
on all users within both the Question and Answer tables was
created. The total number of answers is 2, 226, 000; the total
number of unique users in the question and answer table is
∼ 620, 000; the total number of distinct tags for python-related
questions is 23, 160.

III. N ETWORK A NALYSIS

As seen above, the in and out degree distribution of the Tag-

A variety of graphs were created and explored from the Tag network appear to follow a power-law. Moreover, one can
Stack Overflow data. Not all of them proved useful in notice that the out-degree distribution has much fatter tails
recommendation, but nonetheless the exploration provides than the in-degree. Nevertheless, there are still many nodes
insight into the Stack Overflow network. In particular, this that have in-degree over 100, demonstrating the difficulty of
paper explores the T ag → T ag graph, TT, the bipartite unraveling the network into a tree. In hopes to find a way to
Question → T ag graph, QT, the U ser → T ag graph, UT, partition the question-tag space, the communities within to the
and the four-way U ser → T ag → Question graph, UTQ. tag-tag network5 in order to determine whether the space could
be effectively partitioned. The plot below shows the result of
running the Louvian Method for Community detection on the
TT network.
Fig. 2. Communities within the TT network
A. Tag-Tag Network

A common way to query software sites is using tags.

The initial approach to recommending questions relied upon
utilizing the implicit hierarchy in tags to restrict the search
space. Tags on Stack Overflow are not intrinsically hierarchical
though they retain an implicit structure; that is, if a question
pertains to the pandas package in python, it is most frequently
given the tags <python><pandas>. Therefore, there is
a loose ordering of tags such that those that come first
represent broader categories than those succeeding it. The TT
is constructed as follows: let ti denote the ith tag. The tag-tag
TT network contains an edge from ti → tj if tj succeeds ti
in ordering for some question qk . There are 23116 nodes and
290, 408 edges. The initial impetus of exploiting the tag-tag
network was to create a tag-tree to effectively group questions
by their position in the tree. However since the ordering is
not strict, creating a tag-hierarchy proved uninformative. After
examining the properties of the tag-tag network, it was clear In total 60 communities were found, but only 5 of them con-
that the TT network was a multi-edge digraph with many tain more that 1% of the network. In fact, the top community
cycles. The plot below shows the in-degree, out-degree, and
total-degree distributions of the tag-tag network on a log-log 5 Casting the TT graph as a weighted undirected network
contains around 27% of the nodes, and the top 5 communities achieving granularity in results. The reasoning is that there
contain nearly the entire graph. The modularity of the partition exists tags to which nearly all of the questions are linked.
is only around .30. Therefore, we are only able to cluster the Therefore taking a step to one of these high-degree tags from
tag space into a few meaningful groups and definitely not able a question node will broaden the search too widely. Indeed the
to partition the graph into a well structured tree. semi-hierarchical structure of the tags themselves do not lend
themselves well to vanilla PPR algorithm. Therefore, in order
B. Question-Tag Network
to recommend related questions, it is important to weight the
Since the Question-Tag network is the first graph upon edges of the graph and modify the page rank algorithm. The
which the random walk algorithm will be tested, it is perti- methodology used will be discussed in the next section.
nent to examine its network characteristics. The Question-Tag
network, QT, is an undirected bi-partite graph from questions C. User-Tag Network
to tags, where qi is linked to tj if qi contains tj . There QT The User-Tag network,UT, is a bipartite network between
network is quite large, with 800, 000 nodes and 3 million users and tags where an edge from ui to tag tj is drawn if
edges. It is evident by construction that there will be tags that ui answers some question qk where tj ∈ Tags[qk ]. This graph
are linked to nearly all the questions of the graph, i.e. the has |V | = 450, 521 and |E| = 3, 690, 086, which is much
tag <python>. Furthermore, it is expected that the degree larger than the TT, network. The impetus behind examining
distribution of questions to be very contained, i.e. have small this graph is that if we can locate a set of users that are
support, since questions are rarely linked to a massive number authorities for a group of tags, we can prioritize questions that
of tags. Indeed, examining the plots below, we can see that are answered by those users in a random-walk. Furthermore,
there exists tags for which hundreds of thousands of questions it is of value to determine whether users typically answer
are linked. The network characteristics above immediately questions in a relatively small subset of the tag-space.

Fig. 3. Degree Distribution of Tags within TQ network Fig. 5. Degree Distribution of UT network

Fig. 4. Degree Distribution of Questions within TQ network

Examining the figure above, notice that the degree-

distribution of users decays much more quickly than the tag
distribution. The graph shows that the majority of users answer
questions within a relatively small subset of the tag-space,
with a spike in density at around 6. Those users that answer
questions over a large tag-domain, greater than 300, on average
answer more than 3 times as many questions as those users
confined within the tag-space.
In order to create some structure for the massive UT net-
work, Louvian community detection was applied to the graph.
A total of 59 communities were found, with 12 communities
suggest that the vanilla personalized page-rank algorithm on containing over 1% of the total nodes in the network. The
a bi-partite graph of questions and tags may have trouble top 5 communities contain around 70% of all of the nodes
and the partition achieved a modularity score of 42%. The Fig. 6. Word Embedding Clustering
results indicate that we are able to create structure within
the user-tag space. This is of value since one can utilize the
network structure to restrict the search space. If user ui is
within a community that is commonly associated with tj ,
we can weight more heavily questions that contain tj . For
example, community c0 has mainly tags associated with lower
level systems engineering, i.e. questions with tags such as
linux, binary, declarative-pipeline. Given a
question with tags in c0 , one could recommend the questions
answered by users of c0 , since they are strongly associated in
the user-tag space.
IV. E VALUATION
The goal is to recommend related questions given an input
question. In order to quantitatively assess the quality of the
recommendations, it is first necessary to devise metrics of
similarity. The primary four metrics used in this paper are
Title Similarity, Text Similarity, and Tag Overlap. Recall that
given an input question q̃, the algorithm returns a set of For a given question qi , the average of the word embeddings
recommended questions Qr . within qi was used to represent the document. The semantic
A. Word Embeddings similarity between qi and qj was approximated by the cosine
similarity of qi and qj ‘s vector representations. Formally, let
Given q̃ and Qr , there are a variety of methods to compute ti ,bi denote the vector representation of the ith questions title
semantic similarity between the text in q̃ and the text in the and body, respectively. For both the title and the text body
questions within Qr . A baseline method uses a frequency
based approach and assigns each document within Qr a score hti , tj i
Stitle =
based on the cosine similarity of the TF-IDF vector for qj and ||ti ||||tj ||
the TF-IDF vector of q̃, where the TF-IDF matrix is computed hbi , bj i
on q̃ ∪ Qr 6 . However this approach does not take into account Sbody =
||bi ||||bj ||
the larger corpus of questions within the Stack Overflow and
yields quite restrictive representations of words. The tag overlap between q̃ and the questions in Qr was
Word embeddings provide a mapping from natural language also calculated. Let T (q) be the tags generated by a question
to a rich real-valued vector space and are commonly used q. For a given recommendation question qr , the tag overlap is
in natural language processing techniques. The word2vec defined as
algorithm maps a word to a fixed-length, dense and real-
T (q) ∩ T (qr )
valued vector by training a neural network to predict a target OT =
word given the context words7 , learning the word repre- T (q) ∪ T (qr )
sentations in the process. Google has pre-trained word2vec The metrics above are averaged across the qr in the recom-
embeddings available for download, however, many of the mendation set. Let Ō, S̄title , S̄body denote the average scores of
hyper-specific programming words are not contained within the recommendation sets Qir for a set of experiments i . . . , N .
the embedding space. Therefore, using the data collected from These metrics are then compared to the average scores of Stack
Stack Overflow, three sets of word embeddings were trained Overflow’s recommended questions. 8 .
using gensim’s word2vec implementation [11]. One set of
embeddings trained on only the titles, another on only the V. E XPERIMENTS
text, and lastly one where each document was composed of All algorithms were coded in NetworkX python library.
the concatenation of the title and body. The embeddings using Computationally expensive tasks or obviously parallel tasks
the concatenation of title and body were used to score the were performed on an AWS server with 32 cores. The ex-
results. periments were run for 10, 000 starting nodes within the QT
These embeddings give a mechanism to cluster questions network of size |V | ∼ 800, 000, |E| ∼ 3 million. For a given
by semantic meaning. For example the plot below shows the recommendation set, the average text, title, and tag-overlap
result of a PCA projection of the word embeddings for all tags scores were calculated. These scores were then averaged
within the dataset. The colors represent clusters found using across all trials of the experiment to yield a final score. Unless
K-Means on the word embedding space with k = 3. otherwise stated, the restart parameter of Personalized Page
6 Term-Frequency Inverse Document Frequency. 8 These recommended questions are in the relatedlinks table mined from
7 This is the CBOW implementation of Word2vec| the Stack Overflow site
rank was set to α = .3. All variations of the standard PPR run TABLE I
for 10000 iterations or until the 10th recommended question R ANDOM WALK W / R ESTART A LGORITHM α = .3
has a count greater than 30. Relevance Metric PPR Stack Overflow Baseline
ŌT 0.5361 0.6509
A. Vanilla Random Walk with Restarts S̄title 0.5205 0.7382
The first algorithm implements the standard Personalized S̄text 0.7397 .8184
Page Rank algorithm on the QT graph. It follows closely
σS̄title 0.0665 0.0449
σS̄text 0.0273 0.0182
the Pinterest algorithm introduced in Jure Leskovec’s [1]. The
pseudocode for the algorithm is shown below: As shown above the vanilla PPR algorithm under-performs
1 as compared to the Stack Overflow “related” questions across
INPUT : q u e r y Q u e s t i o n s S e t , G, NUMSTEPS, a l p h a all categories. This is likely attributed to the fact that some
3 q u e r y Q u e s t i o n = q u e r y Q u e s t i o n s S e t . random ( )
f o r i i n r a n g e (NUMSTEPS) : tags have very high degree, thereby expanding the search
5 t a g n o d e = getRandomTag ( q u e r y Q u e s t i o n ) space of the algorithm too widely. Examining this further, the
queryQuestion = getRandomQuestion ( tagnode ) experiment was rerun restricting the QT to those questions
7 c o u n t [ q u e r y Q u e s t i o n ] += 1
i f random . r a n d ( ) <= a l p h a : with more than 2 tags (i.e. questions tagged more specifically).
9 q u e r y Q u e s t i o n = q u e r y Q u e s t i o n s S e t . random ( ) The idea here is that there is a greater likelihood of the
algorithm jumping to tags with higher specificity. The results
are shown below.
Starting at an input question, it takes a random walk to a tag-
node, and then randomly walks to the following question. TABLE II
Recalling Figure 2, one can see where this algorithm R ANDOM WALK W / R ESTART A LGORITHM ON R ESTRICTED QT GRAPH
α = .3
will fail. If from a given question, each tag-node is equally
likely, often the algorithm will choose a tag that is far too Relevance Metric PPR Stack Overflow Baseline
ŌT 0.5425 0.6109
broad. For example if the input question contains the tags
S̄title 0.674971 0.7199
<python> <django> <django-models> <sql>, S̄text 0.79680 .8232
with probability 41 , the algorithm will jump to <python>, σS̄title 0.0856 0.0624
which links to nearly a million questions! Indeed, the σS̄text 0.0246 0.0209
algorithm should have some way to restrict its search.
Otherwise, if the algorithm does not take into account the The results returned by PPR, though improved, are still too
hierarchical ordering of tags, it will with high probability general. This is likely attributed to the fact the PPR algorithm
jump to an incredibly general tag, thereby expanding the jumps to all tags with equal probability, thereby ignoring the
random walk space and reducing the specificity of results. It loose ordering of tag specificity.
is as if there exists a Pinterest board that encompassed all B. Guided Random Walk with Restart
of the pins. Another issue also stemming from the implicit
hierarchical behavior is that the ordering of the tags matters. In order to restrict the search space, a modified weighted
Say, for example the algorithm jump to the tag node <sql>. Personalized Page Rank algorithm was adapted. In particular,
<sql> is linked to a huge body of questions that do not given a question qi , the probability of transitioning to a tag-
at all pertain to <django>. Similarly <django> is linked node tj is inversely proportional to the out-degree of tj .
to a huge number of questions that do not contain <sql>. This makes the event of transitioning to a tag linked to a
In theory, the overlapping links for questions with both massive number of questions more unlikely. Furthermore, I
<django> <sql> should cause the pertinent questions weighted the edges based on their ordering within the question.
to rank higher, but there is a high probability that this will The ultimate tag had the highest weighting, followed by the
not happen 9 . The issue stems from the out-degree of the penultimate tag, and so forth. To prevent the algorithm from
tag-nodes. If there is no way to prioritize question nodes overly-weight tags with the fewest number of linked questions,
linked to both sql and django, it is difficult to restrict the a maximum probability of a transition was set to be equal to
search space. The results detail the average Tag-OverLap .8. Formally,
score, title semantic-similarity score, and text semantic rtij
P[qi → tj ] ∝ max

similarity score of the Naive PPR method run on 10000 c , .8
dtj
random starting nodes. The average semantic similarity for
both the text and question were calculated using the word where rtij is the rank of tag j in question i, dtj is the degree
embeddings described in the previous section. I also report of tag j and c is a normalizing constant.
the average standard deviation of the scores across each The next modification made assigns a higher probability of
recommended set. 10 traveling to a question-node from a tag-node if the question
9 due node weights the tag similarly to the source question-node.
to the number of questions linked to both tag groups
10 Starting
nodes were processed in parallel. Due to the size of the graphs Formally, given that algorithm jumps from qi → tj with
the hyperparameter α was not optimized for probability pij , the algorithm assigns a higher probability to
the transition tj → qk for nodes qk that have pkj close to pij . as in the UT graph. User uj is linked to tag tk with probability
Formally, proportional to the number of times the user asks a question
tagged with tk . The links between qi and tk were calculated as
P[tj → qk ] ∝ c |P[q Fk + Vk
→ tj ] − P[qi → tj | + ]
in the guided-random walk. At each iteration i, the algorithm
k can choose to jump from a question to a tag or from a question
where c is a normalizing constant and Fk , Vk are the favorite to a user and then to a tag. The results, on preliminary analysis,
count and view count respectively. Intuitively, this is guiding were more diffuse, in that they recommended questions from
the algorithm to question that have more similar tag structure a large tag-subset. This algorithm, however, did not scale well,
to the source question. since at each step, the algorithm must filter the edges based
The results of the running the guided PPR algorithm on on their grouping within the graph. This is computationally
10000 random nodes are shown below: expensive for such a large graph, and thereby was excluded
from the analysis.
TABLE III
G UIDED R ANDOM WALK W / R ESTART α = .3 D. Incorporating Community Structure
Relevance Metric Guided PPR Stack Overflow Baseline The last variation of the PPR algorithm, which remains
ŌT 0.7432 0.5962 partially implemented 12 , utilizes the communities of the UT13
S̄title 0.7187 0.7215 to restrict the tag-space of the QT to those induced by the
S̄text 0.9057 0.8628
σS̄title 0.1125 0.0741 community to which the input question belongs. Formally, for
σS̄text 0.0314 0.0311 a given qi , let the community to which the owner of qi belongs
as ci . For a given input question qi , find the subgraph induced
The results above suggest a significant boost in performance by tj ∈ ci within the the QT. In particular drop all tk 6∈ ci and
as compared to unweighted Random Walk algorithm across all qk :6 ∃T (qk ) ∈ ci , where T (qk ) is the tag set of question k.
all categories. Indeed, the weighted PPR algorithm achieves
VI. E XTENSIONS & F UTURE W ORK
better performance than Stack Overflow’s related set for both
Tag-Overlap and Text-Similarity.11 These results make sense This paper utilizes variations on the Personalized Page Rank
since the algorithm essentially guides the walk to tag-nodes algorithm for question recommendation on a subset of the
with more specificity. This is ideal if the goal is to recommend Stack Overflow Network with around 1 million nodes and 3
questions within a general category, but the ultimate goal is to million edges. This paper implements variations of the PPR on
recommend questions that may have the answer. Furthermore, the Question→ Tag graph to recommend “related” questions.
it is important to note that the weighted PPR is slower than the This paper finds that guided PPR works well, achieving better
vanilla algorithm due to the weight updates at each tag node. results on average than Stack Overflow in terms of semantic
Furthermore, since this algorithm is jumping to questions similarity and tag overlap. Nevertheless, there appears to be
through Tags, it is unlikely that it will give high weight to a limit on the specificity of results able to be achieved using
questions with a very different tag set. Sometimes a question is random walk algorithms. Using natural language processing on
erroneously tagged. Since this algorithm explores via a guided the text within a candidate set after PPR recommendation is
walk through the tag-space, it is unlikely to find the answer if an interesting avenue of exploration to increase the specificity
the questions is too generally tagged or tagged erroneously. of results.
To improve results one could rank the recommendations of As addressed in the introduction, an obvious limitation of
the PPR algorithm by their semantic similarity to the input the experiments is that it is not clear whether the metrics used
question. Indeed, when taking the top 100 results from PPR (semantic similarity of text/title and tag overlap) to assess the
and ranking each question by semantic similarity to the input quality of a recommendation set are actually useful in practice.
question, the algorithm achieves better performance scores. It may be the case that PPR performs moderately well using
However, this process isn’t fair since it uses the evaluation the metrics defined in this paper but fails miserably at another
metric to guide the procedure and therefore is not included in metric such as user overlap. Furthermore, if one is optimizing
the results. for text similarity, it may be more accurate and faster to use
elastic search over the raw database14 .
C. Attempted Variations Another limitation is that the algorithms described above
1) Incorporating User Information in User-Tag-Question are restricted to nodes on the network. They do not generalize
Graph: I attempted to incorporate user information using a to natural language. Therefore, in order to transfer from input
four-way graph between questions users and tags. Question qi question node to input query while at the same time utilizing
links to uj if uj asked question qi . Question qi is linked to tk network information requires additional processing.

11 The 12 It proved too computationally expensive to run large simulations on the

PPR and guided PPR did not receive the same input set, which may
raise concerns about drawing comparisons between them. However, since both graph
13 It is in fact a slight variation of the graph discussed. There is a link
sets of random trials are compared to the baseline and the number of random
starting nodes is large, I this is unimportant. The goal is to evaluate comparison ui → tj if user i asks a question with tag j
to the baseline not between the two algorithms 14 This is true and it is faster and more intuitive
One potential extension is to recast the problem to a semi-
supervised setting. One could use node2vec [6] to learn
feature representations of the nodes within the UT and TT
graph 15 . One could then, for a given input user, use the word
embeddings of the input query, the node embedding of the
user, and the tag embeddings of the user’s previous questions
to predict a set of tags or even a set of questions. In the case
of tags, labels are given, whereas for questions one could use
Stack Overflow’s PostLinks table. Indeed, utilizing learned
node embeddings to predict the recommendation set using
semi-supervised learning methods, though beyond the scope
of this paper, would be a great avenue of exploration.
A PPENDIX A
C ODE
Code for the project can be found at
https://github.com/jacobperricone/224w.
ACKNOWLEDGMENT
I would like to thank Poorvi Bhargava with her help
throughout the project and Professor Leskovec for an enjoy-
able and informative class
R EFERENCES
[1] Lure Jeskovec. Link analysis: Pagerank and hits, Nov 2017.
[2] Jon Kleinberg David Easley. Networks, Crowds, and Markets: Reasoning
about a Highly Connected World. 2010.
[3] Raymond Shiau Dmitry Kislyuk Kevin C. Ma Zhigang Zhong Jenny
Liu Yushi Jing David C. Liu, Stephanie Rogers. elated pins at pinterest:
The evolution of a real-world recommender system. 2017.
[4] J. Z. Liu Y. Liu R. Sharma C. Sugnet M. Ulrich Eksombatchai, P. Jindal
and J. Leskovec. Pixie: A system for recommending 1+ billion items to
150+ million pinterest users in real-time. 2011.
[5] Dibyajyoti Ghosh Priyank Mathur, Siamak Shakeri. User recommenda-
tion for stack exchange. 2016.
[6] Jure Leskovec Aditya Grover. node2vec: Scalable featuring learning for
network. 2016.
[7] Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor. I want
to answer; who has a question?: Yahoo! answers recommender system.
In Proceedings of the 17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’11, pages 1109–1117,
New York, NY, USA, 2011. ACM.
[8] Bogdan Vasilescu Shaowei Wang, David Lo and Alexander Serebrenik.
Entagrec: An enhanced tag recommendation system for software infor-
mation sites. 2014.
[9] Christopher Wong Logan Short and David Zeng. Tag recommendations
in stackoverflow. CS224W Project, 2014.
[10] StackExchange. https://archive.org/details/stackexchange.
[11] Radim Řehůřek and Petr Sojka. Software Framework for Topic Mod-
elling with Large Corpora. In Proceedings of the LREC 2010 Workshop
on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta,
May 2010. ELRA.

15 I actually explored this avenue, creating node embeddings, but did not
have time to implement the prediction step