第一章:簡單推薦算法

找出相似用戶


曼哈頓距離

最簡單的距離計算方式;

在二維模型中,每個目標對象都可以用 (x, y) 的點來表示,我們可以用下標來表示不同的對象, (x1, y1)表示A, (x2, y2)表示B,那么他們之間的曼哈頓距離就是:

曼哈頓距離

歐幾里得距離

另一種計算距離的方法就是看兩點之間的直線距離,即利用勾股定理計算距離:

歐幾里得距離

推廣:閔可夫斯基距離

將曼哈頓距離和歐幾里得距離歸納成一個公式,這個公式稱為閔可夫斯基距離:

其他:

r = ∞ 極大距離, r 值越大,單個唯獨的差值大小會對整體距離有更大的影響

def minkowski(rating1, rating2, r):
    distance = 0
    for key in rating1:
        if key in rating2:
            distance += pow(abs(rating1[key] - rating2[key]), r)
    return pow(distance, 1.0 / r)

皮爾遜相關(guān)數(shù)

皮爾遜相關(guān)數(shù)用于衡量兩個變量之間的相關(guān)性,它的值在 -1 至 1 之間, 1 表示完全吻合, -1 表示完全相悖。

“分數(shù)膨脹”:審計標準較低導致的數(shù)據(jù)結(jié)果數(shù)值偏高。

def pearson(rating1, rating2):
    sum_xy = 0
    sum_x = 0
    sum_y = 0
    sum_x2 = 0
    sum_y2 = 0
    n = 0

    for key in rating1:
        if key in rating2:
            n += 1
            x = rating1[key]
            y = rating2[key]
            sum_xy += x * y
            sum_x += x
            sum_y += y
            sum_x2 += pow(x, 2)
            sum_y2 += pow(y, 2)

    # 計算分母
    denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)
    if denominator == 0:
        return 0
    else:
        return (sum_xy - (sum_x * sum_y) / n) / denominator

余弦相似度

余弦相似度的范圍從 1 到 -1 , 1 表示完全匹配, - 1 表示完全相悖。

小結(jié):

1、如果數(shù)據(jù)存在“分數(shù)膨脹”問題,就是用皮爾遜相關(guān)系數(shù)。

2、如果數(shù)據(jù)比較“密集”,變量之間基本都存在公有值,且這些距離數(shù)據(jù)是非常重要的,那就是用歐幾里得或曼哈頓距離。

3、如果數(shù)據(jù)是稀疏的,則使用余弦近似值。


# -*- coding:utf-8 -*-

"""
曼哈頓、皮爾遜系數(shù)、余弦相似度整合簡單推薦算法
"""

import codecs 
from math import sqrt

users = {
    "Angelica": {
        "Blues Traveler": 3.5, 
        "Broken Bells": 2.0, 
        "Norah Jones": 4.5, 
        "Phoenix": 5.0, 
        "Slightly Stoopid": 1.5, 
        "The Strokes": 2.5, 
        "Vampire Weekend": 2.0
        },
    "Bill":{
        "Blues Traveler": 2.0, 
        "Broken Bells": 3.5, 
        "Deadmau5": 4.0, 
        "Phoenix": 2.0, 
        "Slightly Stoopid": 3.5, 
        "Vampire Weekend": 3.0
        },
    "Chan": {
        "Blues Traveler": 5.0, 
        "Broken Bells": 1.0, 
        "Deadmau5": 1.0, 
        "Norah Jones": 3.0, 
        "Phoenix": 5, 
        "Slightly Stoopid": 1.0
        },
    "Dan": {
        "Blues Traveler": 3.0, 
        "Broken Bells": 4.0, 
        "Deadmau5": 4.5, 
        "Phoenix": 3.0, 
        "Slightly Stoopid": 4.5, 
        "The Strokes": 4.0, 
        "Vampire Weekend": 2.0
        },
    "Hailey": {
        "Broken Bells": 4.0, 
        "Deadmau5": 1.0, 
        "Norah Jones": 4.0, 
        "The Strokes": 4.0, 
        "Vampire Weekend": 1.0
        },
    "Jordyn":  {
        "Broken Bells": 4.5, 
        "Deadmau5": 4.0, 
        "Norah Jones": 5.0, 
        "Phoenix": 5.0, 
        "Slightly Stoopid": 4.5, 
        "The Strokes": 4.0, 
        "Vampire Weekend": 4.0
        },
    "Sam": {
        "Blues Traveler": 5.0, 
        "Broken Bells": 2.0, 
        "Norah Jones": 3.0, 
        "Phoenix": 5.0, 
        "Slightly Stoopid": 4.0, 
        "The Strokes": 5.0
        },
    "Veronica": {
        "Blues Traveler": 3.0, 
        "Norah Jones": 5.0, 
        "Phoenix": 4.0, 
        "Slightly Stoopid": 2.5, 
        "The Strokes": 3.0
        }
    }

class recommender:
    def __init__(self, data, k = 1, metric = 'pearson', n =5):
        """ 初始化推薦模塊
        data 訓練數(shù)據(jù)
        k K 鄰近算法中的值
        metric 使用何種距離計算方式
        n 推薦結(jié)果數(shù)量
        """
        self.k = k 
        self.n = n 
        self.username2id = {}
        self.userid2name = {}
        self.productid2name = {}
        # 將距離計算方式保存下來
        self.metric = metric
        if self.metric == 'pearson':
            self.fn = self.pearson
        
        # 如果 data 是一個字典類型,則保存下來,否則忽略
        if type(data).__name__ == 'dict':
            self.data = data
            
    def convertProductID2name(self, id):
        # 通過產(chǎn)品 ID 獲取名稱
        if id in self.productid2name:
            return self.productid2name[id]
        else:
            return id        
            
    def userRatings(self, id, n):
        # 返回該用戶評分最高的物品
        print("Ratings for " + self.userid2name[id])
        ratings = self.data[id]
        print(len(ratings))
        ratings = list(ratings.items())
        ratings = [(self.convertProductID2name(k), v)
                   for (k, v) in ratings]
        # 排序并返回結(jié)果
        ratings.sort(key=lambda artisTuple: artisTuple[1], reverse=True)
        for rating in ratings:
            print("%s\t%i" % (rating[0], rating[1]))
            
    def loadBookDB(self, path = ''):
        # 加載 BX 數(shù)據(jù)集,path 是數(shù)據(jù)文件位置
        self.data = {}
        i = 0
        # 將數(shù)據(jù)評分數(shù)據(jù)放入 self.data
        f = codecs.open(path + "\BX-Book-Ratings.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            user = fields[0].strip('"')
            book = fields[1].strip('"')
            rating = int(fields[2].strip().strip('"'))
            if user in self.data:
                currentRatings = self.data[user]
            else:
                currentRatings = {}
            currentRatings[book] = rating
            self.data[user] = currentRatings
        f.close()
        # 將數(shù)據(jù)信息存入 self.productid2name
        # 包括 isbn 號、書名、作者等
        f = codecs.open(path + "\BX-Books.csv", 'r', 'utf8')
        for line in f:
            i += 1
            #separate line into fields
            fields = line.split(';')
            isbn = fields[0].strip('"')
            title = fields[1].strip('"')
            author = fields[2].strip().strip('"')
            title = title + ' by ' + author
            self.productid2name[isbn] = title
        f.close()
        # 將用戶信息存入 self.userid2name 和 self.username2id
        f = codecs.open(path + "\BX-Users.csv", 'r', 'utf8')
        for line in f:
            i += 1
            # print(line)
            #separate line into fields
            fields = line.split(';')
            userid = fields[0].strip('"')
            location = fields[1].strip('"')
            if len(fields) > 3:
                age = fields[2].strip().strip('"')
            else:
                age = 'NULL'
            if age != 'NULL':
                value = location + '  (age: ' + age + ')'
            else:
                value = location
            self.userid2name[userid] = value
            self.username2id[location] = userid
        f.close()
        print(i)
        
    def pearson(self, rating1, rating2):
        sum_xy = 0
        sum_x = 0
        sum_y = 0
        sum_x2 = 0
        sum_y2 = 0
        n = 0
        for key in rating1:
            if key in rating2:
                n += 1
                x = rating1[key]
                y = rating2[key]
                sum_xy += x * y
                sum_x += x
                sum_y += y
                sum_x2 += pow(x, 2)
                sum_y2 += pow(y, 2)
        if n == 0:
            return 0
        # now compute denominator
        denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
                       * sqrt(sum_y2 - pow(sum_y, 2) / n))
        if denominator == 0:
            return 0
        else:
            return (sum_xy - (sum_x * sum_y) / n) / denominator
            
    def computeNearestNeighbor(self, username):
        """獲取鄰近用戶"""
        distances = []
        for instance in self.data:
            if instance != username:
                distance = self.fn(self.data[username], self.data[instance])
                distances.append((instance, distance))
        # 按距離排序,距離近的排在前面
        distances.sort(key=lambda artistTuple: artistTuple[1], reverse=True)
        return distances
    
    def recommend(self, user):
        """返回推薦列表"""
        recommendations = {}
        # 首先獲取鄰近用戶
        nearest = self.computeNearestNeighbor(user)
        # 獲取用戶評價過的商品
        userRatings = self.data[user]
        # 計算總距離
        totalDistance = 0.0
        for i in range(self.k):
            totalDistance += nearest[i][1]
        # 匯總 k 鄰近用戶的評分
        for i in range(self.k):
            # 計算餅圖的每個分片
            weight = nearest[i][1] / totalDistance
            # 獲取用戶名稱
            name = nearest[i][0]
            # 獲取用戶評分
            neighborRatings = self.data[name]
            # 獲取沒有評價過的商品
            for artist in  neighborRatings:
                if not artist in userRatings:
                    if artist not in recommendations:
                        recommendations[artist] = (neighborRatings[artist] * weight)
                    else:
                        recommendations[artist] = (recommendations[artist] + neighborRatings[artist] * weight)
            # 開始推薦
            recommendations = list(recommendations.items())
            recommendations = [(self.convertProductID2name(k), v)
                               for (k, v) in recommendations]
            # 排序并返回
            recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse=True)
            # 返回前 n 個結(jié)果      
            return recommendations[:self.n]
                  
                  
R = recommender(users)
R.loadBookDB("...")
print(R.recommend('276747'))
print(R.recommend('276822'))
print(R.recommend('276813'))

參考原文作者:Ron Zacharski CC BY-NC 3.0] https://github.com/egrcc/guidetodatamining

參考原文原文 http://guidetodatamining.com/

參考譯文來自 @egrcchttps://github.com/egrcc/guidetodatamining

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容