4543-File Feedback On Revisions From Reviewers - 15071-1-4-20221226
4543-File Feedback On Revisions From Reviewers - 15071-1-4-20221226
4543-File Feedback On Revisions From Reviewers - 15071-1-4-20221226
Decree of the Director General of Higher Education, Research, and Technology, No. 158/E/KPT/2021
Validity period from Volume 5 Number 2 of 2021 to Volume 10 Number 1 of 2026
Abstract
Collaborative Filtering is a method to be used in recommendation systems. Collaborative Filtering works by analyzing
rating data patterns. It is also used to make predictions of user interest. This process begins with collecting data and
analyzing large amounts of information about the behavior, activities, and tendencies of users. The results of the analysis are
used to predict what users like based on similarities with other users. In addition, Collaborative Filtering is able to produce
recommendations with better quality than recommendation systems based on content and demographics. However,
Collaborative Filtering still faces scalability and sparsity problems. It is because the data is always evolving so that it
becomes big data, besides that there are many data with incomplete conditions or many vacancies are found. Therefore, the
purpose of this study proposed a clustering and ranking based approach. The cluster algorithm used K-Means. Meanwhile,
the WP-Rank method was used for ranking based. The experimental results showed that the running time was faster with an
average execution time of 0.15 second by clustering. In addition, it was able to improve the quality of recommendations as
indicated by an increase in the value of NDCG at k=22, the average value of NDCG was 0.82, so that the recommendations
produced had more quality and more appropriate with user interests.
Keywords: collaborative filtering, scalability, sparsity, K-means, WP-Rank
recommendations. This is done with large-scale data recommendation system by speeding up training
conditions and requires a lot of resources and reliable procedures[4].
computing.
Several studies have also been conducted to overcome
Several studies have been carried out to overcome this the problem of sparsity, including those conducted by
problem, such as that of Das, J et al who proposed a Ifada, Net al., who combined ratings and similarity of
clustering-based collaborative filtering approach, by film genres in solving sparsity problems. Meanwhile,
partitioning the data using CURE (Clustering using to overcome the scalability problem, Fuzzy C-Means
representatives). The results of the cluster are then is used to perform cluster movies. This approach is
processed using a collaborative filtering algorithm so able to produce dense rating data so that it can be
as to produce recommendations for target users. This scaled for data with high dimensions, and the resulting
process is carried out for each cluster, so it does not recommendations are of higher quality [5].
process the entire database of user items. In this way, Furthermore, the study of Andra, D and Baizal, Z
the time required becomes faster. In addition to proposed Principal Component Analysis (PCA) and K-
overcoming the scalability problem, the clustering Means Clustering to overcome the sparsity problem.
approach can overcome the sparsity problem by PCA is used to reduce data dimensions and improve
reducing the dimension of the rating matrix and the performance of K-Means clustering. While the K-
reducing noise data. In addition, it significantly Means Algorithm is used to form data clusters and
reduces running time and with quality reduce the amount of data processed. Using PCA and
recommendations [2]. K-Means results in a lower RMSE value compared to
other models [6]. The same thing was also done by
Wang, L et al proposed a diversified and scalable Ardimansyah, MI et al., proposing Matrix
recommendation method (DR_LT) to overcome Factorization to fill in the empty rating values to
problems in neighborhood-based collaborative filtering overcome the problem of sparse rating data [7].
(CF). Some of these problems include an increase in
the volume of rating item data from users, so the In addition, study to overcome sparsity was also
resulting recommendations are less efficient. This is carried out by Lestari, S., et al who proposed a ranking
because the recommendation system will analyze all based approach, namely the NRF (Normalized Rating
ranking data when searching for similar users or Frequency) method [8]. In addition, Lestari, S., et al
similar items. In addition, neighborhood-based also proposed WP-Rank which maximizes the use of
collaborative filtering (CF) pays more attention to ranking data to generate product weights. The
recommendation accuracy, while key indicators such experimental results show that the WP-Rank method is
as recommendation performance are often ignored superior to the Borda method [9]. They were followed
such as recommendation diversity (RD) which will by proposing the PoratRank method to generate
have an impact on recommendation results and reduce product rankings by optimizing rating data so that the
user satisfaction. By using a DR_LT, which is utilize aggregation results are in the form of product rankings
locality-sensitive hashing and cover trees to optimize recommended to users according to their interests.
the list of recommendations so that performance This process is able to produce higher quality
becomes effective, besides producing item recommendations[10].
recommendations that are accurate, diverse and able to
solve scalability problems [3]. Meanwhile, our next study is to combine a clustering
approach with a ranking based approach to overcome
Zhao, Z et al proposed a recurrent neural network the problems of scalability and sparsity. The K-Means
(RNN) to overcome the scalability problem, because it clustering algorithm is used to overcome scalability
saves memory. In addition, it is also able to overcome problems, while WP-Rank is used to overcome the
the cold start problem, as well as provide new users sparsity problem by performing an aggregation process
with the same quality of service as the old users. The so as to produce higher quality recommendations that
steps taken are, the initial stage is allocating a table of are in accordance with user preferences.
items for items and using a pair of vectors embedding
to represent each item. In this way it is possible to use
several vectors to represent many items, so it will 2. Research Methods
reduce the memory used. The second step in the item
table will use a similarity-based initialization method This study solved the problem of scalability and
so that the representation of items will be better. sparsity in Collaborative Filtering using clustering and
Followed by placing the appropriate item in the item ranking based approaches. The cluster algorithm used
table, using the loss function and adjustment method. K-Means. Meanwhile, the WP-Rank method was used
This is to improve the performance of the for ranking based. The stages of the study was seen in
Figure 1.
DOI: https://doi.org/10.29207/resti.v6iX.xxx
Creative Commons Attribution 4.0 International License (CC BY 4.0)
2
Author1, Author2
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 6 No. x (2022)
DOI: https://doi.org/10.29207/resti.v6iX.xxx
Creative Commons Attribution 4.0 International License (CC BY 4.0)
3
Author1, Author2
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 6 No. x (2022)
distance Euclidean Distance Space by knowing the shortest distance between two points. The stages of the
K-Means algorithm were:
2. Determine the centroid (initial initiation can be U = {u1, u2, ..., ug, …, ul-1, ul} (U was User) dan P
done by selecting data randomly). = {p1, p2, …, ph, …, pm-1, pm} (P was produk).
3. Calculate the distance on each data to the centroid.
This study used Euclidean Distance Space, using
Equation (1). 2) Determine product points
√
n m
D( x i , c j)= ∑ ( x i−c j)2 P(u , p ) =1+ ∑ PR(ug , ph , k)
(1)
g h
i=1 k=1
(6)
{
4. Grouping data based on proximity to the centroid.
The smaller the distance value, the closer the data 1 ,if R(u , p ) > R(u ,k) , g h g
is to the cluster centroid. 1 , if R(u , p )=R(u , k) , S(u , p ) >S (u , k) ,
5. Determine the new centroid, by finding the average PR(u g , ph , k )= g h g g h g
value of the data that was a member of the cluster. 1 ,if R(u , p )=R(u ,k) , S (u , p ) =S(u , k) ,u g <k ,
g h g g h g
Using Equation 2. 0 , otherwise
p
(7)
k=1
The ranking results from WP-Rank are then taken by
(4)
Top-K to be recommended to users.
DOI: https://doi.org/10.29207/resti.v6iX.xxx
Creative Commons Attribution 4.0 International License (CC BY 4.0)
4
Author1, Author2
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 6 No. x (2022)
5 3,936 17 2,379
p rel i
2 −1 6 3,291 18 2,825
DCG p=∑ (8) 7 3,831 19 2,728
i=1 log 2 ( i +1 ) 3,133 2,829
8 20
9 3,621 21 2,921
DCG p 3,795
NDCG p= (9) 10 22 2,216
IDCG p 11 3,392 23 2,262
12 3,193 24 2,754
Figure 2 was a model for the clustering process using Figure 3 showed that the results of the NDCG
the K-Means algorithm and evaluation using the evaluation of the ranking quality was generated by the
Davies Bouldin Index. Determination of the value of WP-Rank method, based on the number of clusters
"k" starting from k=2 to k=25 to determine the effect formed from the K-Means algorithm. In this
of the number of clusters on the results of the Davies experiment, the number of "k" set was 2-25 clusters,
Bouldin Index. The evaluation results using the and samples were taken at k=2, k-5, k=10, k=15, k=20,
Davies Bouldin Index found that there was a k=22, and k=25. Based on the experimental, results
significant change in value, namely in the number of showed that there was an increase in the value of
clusters 2, 3, and 4; the average Davies Bouldin Index NDCG from NDCG 1-10 at k=2 to k=22. The average
value was above 3. Meanwhile, for the number of NDCG values were 0.63, 0.68, 0.78, 0.76, 0.81, and
clusters 5-25, the average Davies Bouldin Index value 0.82, respectively. Meanwhile, for k=25, there was a
was above 2. However, the most optimal number of decrease in the average value of NDCG, which was
clusters in this experiment is shown in the number of 0.79. There was a significant difference at k=2 and
clusters 22, with the smallest value of 2,216. k=5 when compared to k=22, namely 0.19 and 0.14.
Meanwhile, for k=10, k=15, k=20, they are 0.04, 0.06,
Furthermore, this study evaluated the quality of the and 0.01. However, there was a decrease in the
ranking generated by the WP-Rank method based on average value of NDCG at k=25 although it was not
the cluster formed using NDCG. The results of the too significant, namely 0.02. The average value of
evaluation was seen in Figure 3. NDCG was k=22. It showed that the highest value
because k=22 was the most optimal number of clusters
according to the results of the evaluation using the
Davies Bouldin Index. In addition, it showed that the
optimal number of clusters affects the quality of the
recommendations as indicated by a better NDCG
value.
The next evaluation was running time by calculating
the real time the process takes to execute the input data
to produce a ranking. The results of the running time
evaluation was seen in Figure 4.
Figure 4 showed the results of the running time
evaluation for k=2 to k=25. The average time required
for execution was 1.923, 0.507, 0.239, 0.165, 0.145,
Gambar 3. The Evaluation of NDCG on WP-Rank Implementation
DOI: https://doi.org/10.29207/resti.v6iX.xxx
Creative Commons Attribution 4.0 International License (CC BY 4.0)
6
Author1, Author2
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 6 No. x (2022)
DOI: https://doi.org/10.29207/resti.v6iX.xxx
Creative Commons Attribution 4.0 International License (CC BY 4.0)
7
Author1, Author2
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 6 No. x (2022)
DOI: https://doi.org/10.29207/resti.v6iX.xxx
Creative Commons Attribution 4.0 International License (CC BY 4.0)
8