NLMF: Nonlinear Matrix Factorization Methods For Top-N Recommender Systems
NLMF: Nonlinear Matrix Factorization Methods For Top-N Recommender Systems
Abstract—Many existing state-of-the-art top-N recommenda- SVD++ (4)). In content based methods, users/items features
tion methods model users and items in the same latent space and are used to build models (5; 6). In this work, we limit our
the recommendation scores are computed via the dot product focus to only CF based methods.
between those vectors. These methods assume that the user
preference is consistent across all the items that he/she has rated. One of the recently developed methods called MaxMF (7),
This assumption is not necessarily true, since many users can extends the traditional Matrix Factorization (MF) based ap-
have multiple personas/interests and their preferences can vary proaches by representing the user with multiple latent vectors,
with each such interest. To address this, a recently proposed each corresponding to a different “taste” associated with the
method modeled the users with multiple interests. In this paper,
we build on this approach and model users using a much richer user. These different tastes associated with each user repre-
representation. We propose a method which models the user sentation are termed as interests. The assumption behind this
preference as a combination of having global preference and approach is that, by letting the users to have multiple interests,
interest-specific preference. The proposed method uses a nonlin- it helps to capture user preferences better, especially when the
ear model for predicting the recommendation score, which is used itemsets or user’s interests is diverse. The authors then propose
to perform top-N recommendation task. The recommendation
score is computed as a sum of the scores from the components a max function based non linear model, which takes the
representing global preference and interest-specific preference. A maximum scoring interest as the final recommendation score
comprehensive set of experiments on multiple datasets show that for a given user item pair. It was shown that MaxMF achieves
the proposed model outperforms other state-of-the-art methods better recommendation performance than other state-of-the-art
for top-N recommendation task. methods. However, one of the limitations of this method is, it
Keywords-Database Applications, Data mining, Personaliza- models the users with only interest-specific component. This
tion, Mining methods and algorithms can potentially dilute the learnt latent factors for users who
have not provided enough preferences or who do not have
enough diversity in their itemsets due to lack of support (in
I. I NTRODUCTION
terms of number of rated items) for each of the interests.
Recommender Systems are prevelant and are widely used in In this paper, we propose a new method called NLMF
many applications. Specifically, top-N recommender systems (Non Linear Matrix Factorization), which models the user
are widely used in e-commerce applications to recommend as a combination of global preference and interest-specific
ranked list of items to users in order to identify the items that latent factors. This representation of user allows NLMF to
best fit their personal tastes obtained from their feedback. Over effectively capture both the global preference and multiple
the years many algorithms and methods have been developed interest-specific preference. This approach implicitly allows
to address the top-N recommender problem (1; 2). These the model to strike a balance between the global and interest-
algorithms make use of the user feedback data available in the specific components. Our experimental evaluation on multiple
form of purchase, rating or review. The existing methods can datasets show that NLMF performs better than MaxMF and
be broadly classified into two groups: collaborative filtering other state-of-the-art methods.
(CF) based methods and content based methods. User/Item The key contributions of the work presented in this paper
co-rating is used in collaborative filtering methods to build are the following:
models. Typically these methods represent the user ratings on
items in a user-item rating matrix and act on it. One class (i) proposes a new nonlinear method, which models the users
of state-of-the-art methods in top-N recommender problem is multiple interests as a combination of global and interest-
based on learning latent factors for users and items. In these specific preferences;
methods, users and items are represented as vectors in common (ii) proposes two different approaches based on shared and
latent space and the recommendation score for a given user and independent item factors between the global preference
item pair is computed as the dot product of the corresponding and interest-specific preferences; and
user and item latent vectors. Most notable methods rely on (iii) compares the performance of the proposed method with
matrix factorization (MF) (2) or singular value decomposition other state-of-the-art methods for top-N recommendation
(SVD) (3) to learn the user and item latent factors. Some task, and investigates the impact of various parameters
extensions and variations to SVD are also proposed (like as they relate to number of latent factors and number of
2
users who have not provided sufficient ratings, learning only where wut is the user latent vector for u in the interest-
interest-specific preferences will result in lesser support (in specific preference component corresponding to the interest
terms of number of items) for each interest. This can poten- t and yi is the item latent vector in the interest-specific
tially affect the learning process and can result in learning preference component. We can see that, for a given item i,
less meaningful (latent) factors for all the item partitions NLMFi has two independent item factors (qi and yi ), each one
corresponding to that user. corresponding to the global preference and interest-specific
To overcome this problem, our proposed approach NLMF preference components.
learns the user preferences as a combination of global prefer- The recommendation score r̂ui for a given user u and item
ence and interest-specific preference components. The global i is computed as,
preference is learned using all the ratings provided by the
user. Thus, it helps to better estimate the user’s preferences r̂ui = pu qT T
i + max wut yi (6)
t=1,...,T
when the available data is limited. With regularization, this
method allows the model to be flexible, i.e., it implicitly allows where pu and qi are the user and item latent vectors in
the learning process to strike a balance between the global the global preference component respectively. Thus, NLMFi
preference and interest-specific preference components. Hence is an additive model which independently learns two non-
this model is expected to perform better than MaxMF. overlapping models corresponding to global preference and
interest-specific component and computes their sum as the
V. NLMF - N ONLINEAR M ETHODS FOR CF final prediction score.
In NLMF, given a user u and an item i, the estimated Note that, the number of latent factors for the global
rating r̂ui is given by the sum of the estimations from global preference component (i.e., pu qT i ) and the interest-specific
preference and interest-specific preference components. That component (i.e., wut yiT ) need not be the same. Thus, this
is, model has the flexibility of having different number of latent
r̂ui = pu qT
i + max f (u, i, t), (3) factors for the two components. We use k to represent the
t=1,...,T
number of latent factors for the global preference component
where pu is the latent vector associated with user u and qi (i.e., pu , qi ∈ R1×k ) and we use l to represent the number
is the latent vector associated with item i. Thus, pu qT i gives of latent factors for the interest-specific component (i.e.,
the prediction score from global preference component of the wut , yi ∈ R1×l ).
model and f (u, i, t) is the prediction score from interest- In NLMFi, the martices P, Q, W and Y are learned by
specific preference component. The final prediction score minimizing the following regularized optimization problem:
is the sum of the predictions from global preference and
interest-specific preference components. Figure 1 illustrates 1 X λ
minimize krui − r̂ui k2F + (kPk2F + kQk2F +
the overview of the NLMF method. P,Q,W,Y 2 2
u,i∈R
The selection of the best interest t∗ is done by choosing
the interest which results in the maximum score from the kWk2F + kYk2F ), (7)
multiple interests model. The max function is used to compute where λ is the l2 -regularization constant for latent factor
the maximum recommendation score for the item amongst all matrices. l2 regularization is used to prevent overfitting.
the interests of the user in the multiple interest function. The The optimization problem in Equation 7 is solved using
intuition behind this idea is that, for an item to be ranked a Stochastic Gradient Descent (SGD) algorithm (10). Algo-
higher in the top-N list of the user, at least one of the interests rithm 1 provides the detailed procedure and the gradient update
of the user must provide a high score for that item. rules for the learning algorithm. Initially the matrices P, Q
We use squared error loss function to compute and minimize and Y, and tensor W are initialized with small random values
the loss. That is, as the initial estimate. Then, in each iteration the parameter
XX
L(·) = (rui − r̂ui )2 , (4) values are updated based on the gradients computed w.r.t. the
i∈D u∈C
parameter being updated. This process is repeated until the
error on validation set does not decrease further or the number
where rui is the ground truth value and r̂ui is the estimated of iterations has reached a predefined threshold.
value.
Note that the gradient updates for model paramters are
We propose two different methods to represent the interest-
computed for both rated and non-rated entries of R. This
specific preference component, f (u, i, t). First one has inde-
is in accordance with common practice followed for top-
pendent item factors in f (u, i, t) compared to that of global
N recommendation problem (3; 11; 12). This is in contrast
preference component, whereas the second one shares the item
with rating prediction problem, where only the rated items
factors of f (u, i, t) with the global preference component.
are typically used for computing gradient updates. In order to
These two methods are described in the next two sections.
reduce the computational complexity of the learning process,
A. NLMFi - Independent Item Factors the zero entries corresponding to non-rated items are sampled
and used along with all the non-zero entries (corresponding to
In NLMFi, the interest-specific preference component
rated items) of R. Given a sampling constant ρ and nnz(R),
f (u, i, t) is given by,
the number of non-zeros in R, ρ · nnz(R) zeros are sampled
f (u, i, t) = wut yiT , (5) and used for optimization in each iteration of the learning
4
Interest-1
Learned User
Preferences Global Preference Interest-2
Interest-3
algorithm. Our experimental results indicate that a small value B. NLMFs - Shared Item Factors
of ρ (in the range 3−5 is sufficient to produce the best model.
This sampling strategy makes NLMF methods computationally
efficient and scalable. In NLMFs, the interest-specific component f (u, i, t) is
given by,
Algorithm 1 NLMFi:Learn. f (u, i, t) = wut qT
i, (8)
1: procedure NLMF I L EARN
2: η ← learning rate where wut is the user latent vector for u in the interest-specific
3: λ ← `F regularization weight component corresponding to the interest t and qi is the shared
4: ρ ← sample factor item latent vector between the global preference and interest-
5: iter ← 0 specific components. By using the shared item latent vectors,
6: Init P, Q, W and Y with random values in (-0.001, this model has the ability to transfer the learning between the
0.001) global preference and interest-specific components. Contrast
7:
this with NLMFi model, which has independent item factors
8: while iter < maxIter or error on validation set (qi and yi ) for global preference and interest-specific compo-
decreases do nents.
9: R0 ← R ∪ SampleZeros(R, ρ) The recommendation score r̂ui for a given user u and item
10: R0 ← RandomShuffle(R0 ) i is computed as,
11:
12: for all rui ∈ R0 do r̃ui = pu qT T
i + max wut qi (9)
t=1,...,T
13: r̂ui ← pu qT T
i + max wut yi
t=1,...,T where pu is the user latent vector for u in the global preference
14: component.
15: t∗ ← interest corresponding to max score
16: eui ← rui − r̂ui In NLMFs, the martices P, Q and W are learned by
17: pu ← pu + η · (eui · qi − λ · pu ) minimizing the following regularized optimization problem:
18: qi ← qi + η · (eui · pu − λ · qi ) 1 X λ
19: wut∗ ← wut∗ + η · (eui · yi − λ · wut∗ ) minimize krui − r̂ui k2F + (kPk2F + kQk2F
P,Q,W 2 2
20: yi ← yi + η · (eui · wut − λ · yi ) u,i∈R
21: end for + kWk2F ), (10)
22:
23: iter ← iter + 1 where the common terms mean the same as in Equation 7.
24: end while Similar to NLMFi, the optimization problem in Equation 10
25: is solved using a SGD based algorithm. The details of the
26: return P, Q, W, Y procedure is presented in Algorithm 2. The learning algorithm
27: end procedure and details are similar to Algorithm 1, except the gradient
update rules.
5
VII. R ESULTS
0.198
The experimental evaluation consists of three parts. First,
we assess the effect of various model parameters of NLMF 0.196
HR
comparison with the MaxMF method, which is also a non- K = 64
0.190
linear method based on modeling user with multiple interests. K = 96
K = 128
Due to lack of space, we present these studies only for 0.188
the Netflix dataset. However the same trend in results and
conclusions carry over to the Flixster dataset as well. In the 0.186
0.196
0.190
HR
0.194
0.185
0.192
0.180
NLMFs K = 64
HR
two interests one. Whereas, for NLMF methods, two interests two components compared to NLMFs, which shares the item
model performs better than one interest model for all values factors during the learning process.
of k. This is possibly due to the reason that, MaxMF learns
only the interest-specific user preference, which can potentially VIII. C ONCLUSION
lead to decrease in the support for each interest in terms of In this paper we presented a non-linear matrix factorization
number of items. On the other hand, NLMF methods balance based method (NLMF) for the top-N recommendation task.
the interest-specific preference with the global preference by NLMF models the users preference using a richer repre-
learning a combined model with both the global preference sentation using a nonlinear model for predicting the recom-
and interest-specific components. mendation score to perform top-N recommendation task. The
recommendation score is computed as a sum of the scores
from the components representing the global preference and
0.20
interest-specific user preference. For modeling the interest-
specific component, we presented two different approaches.
0.19 First approach learns the item factors independently in the
MaxMF (T = 1) global preference and interest-specific components, whereas
MaxMF (T = 2)
the second approach shares the item factors between the
HR
0.18 NLMFs (T = 1)
NLMFs (T = 2) global preference and interest-specific components. The results
NLMFi (T = 1) showed that the proposed method outperforms rest of the
NLMFi (T = 2)
0.17 state-of-the-art methods in terms of top-N recommendation
performance. As future work, we plan to evaluate this method
0.16 on multiple datasets at different sparsity levels to measure how
32 64 96 128 160
NLMF methods perform relative to other methods when the
# Factors
training data gets sparser. We also plan to extend this work
for rating prediction task.
Fig. 5: Comparison with MaxMF.
ACKNOWLEDGEMENTS
This work was supported in part by NSF (IIS-0905220,
D. Comparision with Other Approaches OCI-1048018, CNS-1162405, IIS-1247632, IIP-1414153, IIS-
Table III shows the overall recommendation performance 1447788), Army Research Office (W911NF-14-1-0316), Intel
of NLMF methods in terms of HR and ARHR in comparison Software and Services Group, and the Digital Technology
to other state-of-the-art methods (Section VI-C). For all the Center at the University of Minnesota. Access to research and
results presented, the number of top-N items chosen is 10 (i.e., computing facilities was provided by the Digital Technology
N = 10). Following parameter space was explored for each of Center and the Minnesota Supercomputing Institute.
the methods and the best performing model in that parameter
space in terms of HR is reported. For UserKNN, PureSVD and R EFERENCES
BPRMF, parameter k was selected from the range 2 to 800. [1] G. Adomavicius and A. Tuzhilin, “Toward the next
Learning rate for BPRMF was selected from the range 10−5 generation of recommender systems: A survey of the
to 1.0, with a multiplicative increment of 10. For SLIM, the state-of-the-art and possible extensions,” Knowledge and
regularization constants were selected from the range 10−5 Data Engineering, IEEE Transactions on, vol. 17, no. 6,
to 20. For MaxMF and NLMF methods the regularization pp. 734–749, 2005.
constants were selected from the range 10−5 to 5 and learning [2] F. Ricci, L. Rokach, B. Shapira, and P. Kantor, “Rec-
rate was selected from the range 10−5 to 1.0. ommender systems handbook,” Recommender Systems
The results in Table III show that NLMF methods perform Handbook:, ISBN 978-0-387-85819-7. Springer Sci-
better than the rest of the competing methods for all the ence+ Business Media, LLC, 2011, vol. 1, 2011.
datasets. The performance gains of NLMF methods compared [3] P. Cremonesi, Y. Koren, and R. Turrin, “Performance
to the next best performing baseline method are of the order of recommender algorithms on top-n recommendation
of 6% and 10% for Netflix and Flixster respectively. Note tasks,” in Proceedings of the fourth ACM conference on
that, contrary to the results presented in (7), the MaxMF Recommender systems, 2010, pp. 39–46.
model does not outperform PureSVD for the datasets con- [4] Y. Koren, “Factorization meets the neighborhood: a mul-
sidered in this study. In terms of the two proposed NLMF tifaceted collaborative filtering model,” in Proceeding
methods, independent item factors model (NLMFi) achieved of the 14th ACM SIGKDD international conference on
better performance than shared item factors model (NLMFs). Knowledge discovery and data mining. ACM, 2008, pp.
The reason for this could be that, NLMFi has the ability to 426–434.
learn the global preference and interest-specific components [5] R. J. Mooney and L. Roy, “Content-based book rec-
independently, as the items factors are not overlapping, thereby ommending using learning for text categorization,” in
resulting in learning better respresentation of users and items. Proceedings of the fifth ACM conference on Digital
This allows the model to strike a better balance between the libraries. ACM, 2000, pp. 195–204.
8
Netflix Flixster
Method
Params HR ARHR Params HR ARHR
UserKNN 100 - - - 0.1412 0.0515 100 - - - 0.1013 0.0295
PureSVD 50 - - - 0.1821 0.0807 100 - - - 0.1273 0.0494
BPRMF 400 0.01 - - 0.1890 0.0813 200 0.01 - - 0.1165 0.0437
SLIM 0.001 0.1 - - 0.1888 0.0872 0.01 1.0 - - 0.1303 0.0502
MaxMF 192 2 0.0005 0.0005 0.1743 0.0704 160 2 0.0001 0.0005 0.1345 0.0493
NLMFs 192 2 0.01 0.0005 0.1975 0.0870 256 2 0.01 0.005 0.1401 0.0532
NLMFi 256/160 2 0.008 0.001 0.1999 0.0835 288/192 2 0.01 0.001 0.1441 0.0546
Columns corresponding to “params” indicate the model parameters for the corresponding method. For UserKNN
method, the parameter is the number of neighbors. For PureSVD method, the parameter is the number of latent factors.
For BPRkNN method, the parameters are the number of latent factors used and the learning rate. For SLIM method,
the parameters correspond to the `2 and `1 regularization constants. For MaxMF and NLMFs methods, the parameters
correspond to the number of latent factors, number of interests, regularization constant and learning rate. For NLMFi
method, the parameters correspond to number of latent factors for global preference and interest-specific preference
components, number of interests, regularization constant and learning rate. The columns corresponding to HR and
ARHR represent the hit rate and average reciprocal hit rank metrics. Underlined numbers represent the best performing
model measured in terms of HR for each dataset.
[6] M. Pazzani and D. Billsus, “Content-based recommen- [11] X. Ning and G. Karypis, “Slim: Sparse linear methods for
dation systems,” The adaptive web, pp. 325–341, 2007. top-n recommender systems,” in Data Mining (ICDM),
[7] J. Weston, R. J. Weiss, and H. Yee, “Nonlinear latent 2011 IEEE 11th International Conference on. IEEE,
factorization by embedding multiple user interests,” in 2011, pp. 497–506.
Proceedings of the 7th ACM conference on Recommender [12] S. Kabbur, X. Ning, and G. Karypis, “Fism: factored
systems. ACM, 2013, pp. 65–68. item similarity models for top-n recommender systems,”
[8] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, in Proceedings of the 19th ACM SIGKDD international
L. R. Gordon, and J. Riedl, “Grouplens: applying collab- conference on Knowledge discovery and data mining.
orative filtering to usenet news,” Communications of the ACM, 2013, pp. 659–667.
ACM, vol. 40, no. 3, pp. 77–87, 1997. [13] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sisma-
[9] U. Shardanand and P. Maes, “Social information fil- nis, “Large-scale matrix factorization with distributed
tering: algorithms for automating word of mouth,” in stochastic gradient descent,” in Proceedings of the 17th
Proceedings of the SIGCHI conference on Human factors ACM SIGKDD international conference on Knowledge
in computing systems. ACM Press/Addison-Wesley discovery and data mining. ACM, 2011, pp. 69–77.
Publishing Co., 1995, pp. 210–217. [14] M. Deshpande and G. Karypis, “Item-based top-n recom-
[10] L. Bottou, “Online algorithms and stochastic mendation algorithms,” ACM Transactions on Informa-
approximations,” in Online Learning and Neural tion Systems (TOIS), vol. 22, no. 1, pp. 143–177, 2004.
Networks, D. Saad, Ed. Cambridge, UK: Cambridge [15] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-
University Press, 1998, revised, oct 2012. [Online]. thieme, “Ls: Bpr: Bayesian personalized ranking from
Available: http://leon.bottou.org/papers/bottou-98x implicit feedback,” in In: Proceedings of the 25th Confer-
ence on Uncertainty in Artificial Intelligence (UAI, 2009.