-
Notifications
You must be signed in to change notification settings - Fork 153
Item based Collaborative Filtering
This document describe how to do Item-based Collaborative Filtering using Hivemall.
Caution: naive similarity computation is O(n^2)
to compute all item-item pair similarity. MinHash is an efficient scheme for computing jaccard similarity. Section 6 show how to use MinHash in Hivemall.
Prepare following transaction table. We are generating feature_vector
for each item_id
based on cooccurrence of purchased items, a sort of bucket analysis.
userid | itemid | purchase_at timestamp
|
---|---|---|
1 | 31231 | 2015-04-9 00:29:02 |
1 | 13212 | 2016-05-24 16:29:02 |
2 | 312 | 2016-06-03 23:29:02 |
3 | 2313 | 2016-06-04 19:29:02 |
.. | .. | .. |
What we want for creating a feature vector for each item is the following cooccurrence
relation.
itemid | other | cnt |
---|---|---|
583266 | 621056 | 9999 |
583266 | 583266 | 18 |
31231 | 13212 | 129 |
31231 | 31231 | 3 |
31231 | 9833 | 953 |
... | ... | ... |
Feature vectors of each item will be as follows:
itemid | feature_vector array<string>
|
---|---|
583266 | 621056:9999, 583266:18 |
31231 | 13212:129, 31231:3, 9833:953 |
... | ... |
Note that value of feature vector should be scaled for k-NN similarity computation e.g., as follows:
itemid | feature_vector array<string>
|
---|---|
583266 | 621056:ln(9999+1) , 583266:ln(18+1)
|
31231 | 13212:ln(129+1) , 31231:ln(3+1) , 9833:ln(953+1)
|
... | ... |
The following queries results in creating the above table.
The following query creates a table that contains userid, itemid, and purchased_at. The table represents the last user-item contact (purchase) while the transaction
table holds all contacts.
CREATE TABLE user_purchased as
-- INSERT OVERWRITE TABLE user_purchased
select
userid,
itemid,
max(purchased_at) as purchased_at,
count(1) as purchase_count
from
transaction
-- where purchased_at < xxx -- divide training/testing data by time
group by
userid, itemid
;
Note: Better to avoid too old transactions because those information would be outdated though an enough number of transactions is required for recommendation.
Caution: Item-Item cooccurrence matrix is a symmetric matrix that has the number of total occurrence for each diagonal element . If the size of items are k
, then the size of expected matrix is k * (k - 1) / 2
, usually a very large one.
Better to use 2.2.2. instead of 2.2.1. for creating a cooccurrence
table where dataset is large.
create table cooccurrence as
-- INSERT OVERWRITE TABLE cooccurrence
select
u1.itemid,
u2.itemid as other,
count(1) as cnt
from
user_purchased u1
JOIN user_purchased u2 ON (u1.userid = u2.userid)
where
u1.itemid != u2.itemid
-- AND u2.purchased_at >= u1.purchased_at -- the other item should be purchased with/after the base item
group by
u1.itemid, u2.itemid
-- having -- optional but recommended to have this condition where dataset is large
-- cnt >= 2 -- count(1) >= 2
;
Caution: Note that specifying having cnt >= 2
has a drawback that item cooccurrence is not calculated where cnt
is less than 2. It could result no recommended items for certain items. Please ignore having cnt >= 2
if the following computations finish in an acceptable/reasonable time.
Caution: We ignore a purchase order in the following example. It means that the occurrence counts of ItemA -> ItemB
and ItemB -> ItemA
are assumed to be same. It is sometimes not a good idea e.g., for Camera -> SD card
and SD card -> Camera
.
Better to create Upper Triangular Matrix that has itemid > other
if resulting table is very large. No need to create Upper Triangular Matrix if your Hadoop cluster can handle the following instructions without considering it.
create table cooccurrence_upper_triangular as
-- INSERT OVERWRITE TABLE cooccurrence_upper_triangular
select
u1.itemid,
u2.itemid as other,
count(1) as cnt
from
user_purchased u1
JOIN user_purchased u2 ON (u1.userid = u2.userid)
where
u1.itemid > u2.itemid
group by
u1.itemid, u2.itemid
;
create table cooccurrence as
-- INSERT OVERWRITE TABLE cooccurrence
select * from (
select itemid, other, cnt from cooccurrence_upper_triangular
UNION ALL
select other as itemid, itemid as other, cnt from cooccurrence_upper_triangular
) t;
Note: UNION ALL
required to be embedded in Hive.
create table cooccurrence_upper_triangular as
WITH t1 as (
select
u1.itemid,
u2.itemid as other,
count(1) as cnt
from
user_purchased u1
JOIN user_purchased u2 ON (u1.userid = u2.userid)
where
u1.itemid > u2.itemid
group by
u1.itemid, u2.itemid
),
t2 as (
select
each_top_k( -- top 1000
1000, itemid, cnt,
itemid, other, cnt
) as (rank, cmpkey, itemid, other, cnt)
from (
select * from t1
CLUSTER BY itemid
) t;
)
-- INSERT OVERWRITE TABLE cooccurrence_upper_triangular
select itemid, other, cnt
from t2;
create table cooccurrence as
WITh t1 as (
select itemid, other, cnt from cooccurrence_upper_triangular
UNION ALL
select other as itemid, itemid as other, cnt from cooccurrence_upper_triangular
),
t2 as (
select
each_top_k(
1000, itemid, cnt,
itemid, other, cnt
) as (rank, cmpkey, itemid, other, cnt)
from (
select * from t1
CLUSTER BY itemid
) t
)
-- INSERT OVERWRITE TABLE cooccurrence
select itemid, other, cnt
from t2;
You can optionally compute cooccurrence ratio as follows:
WITH stats as (
select
itemid,
sum(cnt) as totalcnt
from
cooccurrence
group by
itemid
)
INSERT OVERWRITE TABLE cooccurrence_ratio
SELECT
l.itemid,
l.other,
(l.cnt / r.totalcnt) as ratio
FROM
cooccurrence l
JOIN stats r ON (l.itemid = r.itemid)
group by
l.itemid, l.other
;
l.cnt / r.totalcnt
represents a cooccurrence ratio of range [0,1]
.
INSERT OVERWRITE TABLE item_features
SELECT
itemid,
-- scaling `ln(cnt+1)` to avoid large value in the feature vector
-- rounding to xxx.yyyyy to reduce size of feature_vector in array<string>
collect_list(feature(other, round(ln(cnt+1),5))) as feature_vector
FROM
cooccurrence
GROUP BY
itemid
;
Item-Item similarity computation is known to be computation complexity O(n^2)
where n
is the number of items.
Depending on your cluster size and your dataset, the optimal solution differs.
Note: Better to use 3.1.1. scheme where dataset is large.
This version involves 3-way joins w/ large data shuffle; However, this version works in parallel where a cluster has enough task slots.
WITH similarity as (
select
o.itemid,
o.other,
cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
from
cooccurrence o
JOIN item_features t1 ON (o.itemid = t1.itemid)
JOIN item_features t2 ON (o.other = t2.itemid)
),
topk as (
select
each_top_k( -- get top-10 items based on similarity score
10, itemid, similarity,
itemid, other -- output items
) as (rank, similarity, itemid, other)
from (
select * from similarity
where similarity > 0 -- similarity > 0.01
CLUSTER BY itemid
) t
)
INSERT OVERWRITE TABLE item_similarity
select
itemid, other, similarity
from
topk;
Note item_similarity
is a similarity matrix. So, you can compute it from an upper triangular matrix as follows.
WITH cooccurrence_top100 as (
select
each_top_k(
100, itemid, cnt,
itemid, other
) as (rank, cmpkey, itemid, other)
from (
select * from cooccurrence_upper_triangular
CLUSTER BY itemid
) t
),
similarity as (
select
o.itemid,
o.other,
cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
from
cooccurrence_top100 o
-- cooccurrence_upper_triangular o
JOIN item_features t1 ON (o.itemid = t1.itemid)
JOIN item_features t2 ON (o.other = t2.itemid)
),
topk as (
select
each_top_k( -- get top-10 items based on similarity score
10, itemid, similarity,
itemid, other -- output items
) as (rank, similarity, itemid, other)
from (
select * from similarity
where similarity > 0 -- similarity > 0.01
CLUSTER BY itemid
) t
)
INSERT OVERWRITE TABLE item_similarity_upper_triangler
select
itemid, other, similarity
from
topk;
INSERT OVERWRITE TABLE item_similarity
select * from (
select itemid, other, similarity from item_similarity_upper_triangler
UNION ALL
select other as itemid, itemid as other, similarity from item_similarity_upper_triangler
) t;
Alternatively, you can compute cosine similarity as follows. This version involves cross join and thus runs sequentially in a single task. However, it involves less shuffle when compared to 3.1.
WITH similarity as (
select
t1.itemid,
t2.itemid as other,
cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
from
item_features t1
CROSS JOIN item_features t2
WHERE
t1.itemid != t2.itemid
),
topk as (
select
each_top_k( -- get top-10 items based on similarity score
10, itemid, similarity,
itemid, other -- output items
) as (rank, similarity, itemid, other)
from (
select * from similarity
where similarity > 0 -- similarity > 0.01
CLUSTER BY itemid
) t
)
INSERT OVERWRITE TABLE item_similarity
select
itemid, other, similarity
from
topk
;
item | other | similarity |
---|---|---|
583266 | 621056 | 0.33 |
583266 | 583266 | 0.18 |
31231 | 13212 | 1.29 |
31231 | 31231 | 0.3 |
31231 | 9833 | 0.953 |
... | ... | ... |
This section introduces item-based recommendation based on recently purchased items by each user.
Caution: It would better to ignore recommending some of items that user already purchased (only 1 time) while items that are purchased twice or more would be okey to be included in the recommendation list (e.g., repeatedly purchased daily necessities). So, you would need an item property table showing that each item is repeatedly purchased items or not.
First, prepare recently_purchased_items
table as follows:
INSERT OVERWRITE TABLE recently_purchased_items
select
each_top_k( -- get top-5 recently purchased items for each user
5, userid, purchased_at,
userid, itemid
) as (rank, purchased_at, userid, itemid)
from (
select
purchased_at, userid, itemid
from
user_purchased
-- where [optional filtering]
-- purchased_at >= xxx -- divide training/test data by time
CLUSTER BY
user_id -- Note CLUSTER BY is mandatory when using each_top_k
) t;
WITH topk as (
select
each_top_k(
5, userid, cnt,
userid, other
) as (rank, cnt, userid, rec_item)
from (
select
t1.userid, t2.other, max(t2.cnt) as cnt
from
recently_purchased_items t1
JOIN cooccurrence t2 ON (t1.itemid = t2.itemid)
where
t1.itemid != t2.other -- do not include items that user already purchased
AND NOT EXISTS (
SELECT a.itemid FROM user_purchased a
WHERE a.userid = t1.userid AND a.itemid = t2.other
-- AND a.purchased_count <= 1 -- optional
)
group by
t1.userid, t2.other
CLUSTER BY
userid -- top-k grouping by userid
) t1
)
INSERT OVERWRITE TABLE item_recommendation
select
userid,
map_values(to_ordered_map(rank, rec_item)) as rec_items
from
topk
group by
userid
;
4.3. Recommend top-k items based on the (cooccurrence) similarity for each user's recently purchased item
WITH topk as (
select
each_top_k(
5, userid, similarity,
userid, other
) as (rank, similarity, userid, rec_item)
from (
select
t1.userid, t2.other, max(t2.similarity) as similarity
from
recently_purchased_items t1
JOIN item_similarity t2 ON (t1.itemid = t2.itemid)
where
t1.itemid != t2.other -- do not include items that user already purchased
AND NOT EXISTS (
SELECT a.itemid FROM user_purchased a
WHERE a.userid = t1.userid AND a.itemid = t2.other
-- AND a.purchased_count <= 1 -- optional
)
group by
t1.userid, t2.other
CLUSTER BY
userid -- top-k grouping by userid
) t1
)
INSERT OVERWRITE TABLE item_recommendation
select
userid,
map_values(to_ordered_map(rank, rec_item)) as rec_items
from
topk
group by
userid
;
Refer this article to get details about MinHash and Jarccard similarity. This blog article also explains about Hivemall's minhash.
INSERT OVERWRITE TABLE minhash -- results in 30x records of item_features
select
-- assign 30 minhash values for each item
minhash(itemid, feature_vector, "-n 30") as (clusterid, itemid) -- '-n' would be 10~100
from
item_features
;
WITH t1 as (
select
l.itemid,
r.itemid as other,
count(1) / 30 as similarity -- Pseudo jaccard similarity '-n 30'
from
minhash l
JOIN minhash r
ON (l.clusterid = r.clusterid)
where
l.itemid != r.itemid
group by
l.itemid, r.itemid
having
count(1) >= 3 -- [optional] filtering equals to (count(1)/30) >= 0.1
),
top100 as (
select
each_top_k(100, itemid, similarity, itemid, other)
as (rank, similarity, itemid, other)
from (
select * from t1
-- where similarity >= 0.1 -- Optional filtering. Can be ignored.
CLUSTER BY itemid
) t2
)
INSERT OVERWRITE TABLE jaccard_similarity
select
itemid, other, similarity
from
top100
;
Caution: Note that there might be no similar item for certain items.
You can compute top-k
similar items based on cosine similarity, following rough top-N
similar items listing using minhash, where k << N
(e.g., k=10 and N=100).
WITH similarity as (
select
o.itemid,
o.other,
cosine_similarity(t1.feature_vector, t2.feature_vector) as similarity
from
jaccard_similarity o
JOIN item_features t1 ON (o.itemid = t1.itemid)
JOIN item_features t2 ON (o.other = t2.itemid)
),
topk as (
select
each_top_k( -- get top-10 items based on similarity score
10, itemid, similarity,
itemid, other -- output items
) as (rank, similarity, itemid, other)
from (
select * from similarity
where similarity > 0 -- similarity > 0.01
CLUSTER BY itemid
) t
)
INSERT OVERWRITE TABLE cosine_similarity
select
itemid, other, similarity
from
topk;