數(shù)據(jù)挖掘?qū)嵺`指南讀書(shū)筆記2

寫(xiě)在之前

本書(shū)涉及的源程序和數(shù)據(jù)都可以在以下網(wǎng)站中找到:http://guidetodatamining.com/
這本書(shū)理論比較簡(jiǎn)單,書(shū)中錯(cuò)誤較少,動(dòng)手鍛煉較多,如果每個(gè)代碼都自己寫(xiě)出來(lái),收獲不少??偨Y(jié):適合入門(mén)。
歡迎轉(zhuǎn)載,轉(zhuǎn)載請(qǐng)注明出處,如有問(wèn)題歡迎指正。。
合集地址:https://www.zybuluo.com/hainingwyx/note/559139

基于物品的協(xié)同過(guò)濾

顯示評(píng)級(jí):顯示給出評(píng)級(jí)結(jié)果,如Youtube的點(diǎn)贊、點(diǎn)差按鈕
隱式評(píng)級(jí):網(wǎng)站點(diǎn)擊軌跡。
基于鄰居(用戶)的推薦系統(tǒng)計(jì)算的次數(shù)十分巨大,所以有延遲性。還有稀疏性的問(wèn)題。也稱為基于內(nèi)存的協(xié)同過(guò)濾,因?yàn)樾枰4嫠械脑u(píng)級(jí)結(jié)果來(lái)進(jìn)行推薦。
基于物品的過(guò)濾:事先找到最相似的物品,并結(jié)合物品的評(píng)級(jí)結(jié)果生成推薦。也稱為基于模型的協(xié)同過(guò)濾,因?yàn)椴恍枰4嫠械脑u(píng)級(jí)結(jié)果,取而代之的隨時(shí)構(gòu)建一個(gè)模型表示物品之間的相似度。
為了抵消分?jǐn)?shù)夸大,調(diào)整余弦相似度

U表示所有同事對(duì)i和j進(jìn)行過(guò)評(píng)級(jí)的用戶組合,



表示用戶u對(duì)物品i的評(píng)分,



表示用戶u對(duì)所有物品評(píng)分的平均值??梢垣@得相似度矩陣。
users3 = {"David": {"Imagine Dragons": 3, "Daft Punk": 5,
                    "Lorde": 4, "Fall Out Boy": 1},
          "Matt":  {"Imagine Dragons": 3, "Daft Punk": 4,
                    "Lorde": 4, "Fall Out Boy": 1},
          "Ben":   {"Kacey Musgraves": 4, "Imagine Dragons": 3,
                    "Lorde": 3, "Fall Out Boy": 1},
          "Chris": {"Kacey Musgraves": 4, "Imagine Dragons": 4,
                    "Daft Punk": 4, "Lorde": 3, "Fall Out Boy": 1},
          "Tori":  {"Kacey Musgraves": 5, "Imagine Dragons": 4,
                    "Daft Punk": 5, "Fall Out Boy": 3}}

def computeSimilarity(band1, band2, userRatings):
   averages = {}
   for (key, ratings) in userRatings.items():
      averages[key] = (float(sum(ratings.values()))
                      / len(ratings.values()))

   num = 0  # numerator
   dem1 = 0 # first half of denominator
   dem2 = 0
   for (user, ratings) in userRatings.items():
      if band1 in ratings and band2 in ratings:
         avg = averages[user]
         num += (ratings[band1] - avg) * (ratings[band2] - avg)
         dem1 += (ratings[band1] - avg)**2
         dem2 += (ratings[band2] - avg)**2
   return num / (sqrt(dem1) * sqrt(dem2))

相似矩陣預(yù)測(cè):

p(u,i)表示用戶u對(duì)物品i的預(yù)測(cè)值

N表示用戶u的所有評(píng)級(jí)物品中每個(gè)和i得分相似的物品。


是i和N之間的相識(shí)度


是u給N的評(píng)級(jí)結(jié)果,應(yīng)該在[-1, 1]之間取值,可能需要做線性變換

得到新的評(píng)級(jí)結(jié)果為


Slope One算法

  • 計(jì)算偏差

    物品i到物品j的平均偏差為

card(S)是S集合中的元素的個(gè)數(shù)。X是整個(gè)評(píng)分集合。



是所有對(duì)i和j進(jìn)行評(píng)分的用戶集合。

def computeDeviations(self):
    # for each person in the data:
    #    get their ratings
    for ratings in self.data.values():        # data:users2, ratings:{song:value, , }
        # for each item & rating in that set of ratings:
        for (item, rating) in ratings.items():
            self.frequencies.setdefault(item, {})   #key is song
            self.deviations.setdefault(item, {})                    
            # for each item2 & rating2 in that set of ratings:
            for (item2, rating2) in ratings.items():
                if item != item2:
                    # add the difference between the ratings to our
                    # computation
                    self.frequencies[item].setdefault(item2, 0)
                    self.deviations[item].setdefault(item2, 0.0)
                    # frequemcies is card
                    self.frequencies[item][item2] += 1    
                    # diviations is the sum of dev of diff users
                    #value of complex dic is dev
                    self.deviations[item][item2] += rating - rating2     

                    for (item, ratings) in self.deviations.items():
                        for item2 in ratings:
                            ratings[item2] /= self.frequencies[item][item2]
# test code for ComputeDeviations(self)
#r = recommender(users2)
#r.computeDeviations()
#r.deviations

?

  • 加權(quán)Slope預(yù)測(cè)

表示加權(quán)Slope算法給出的用戶u對(duì)物品j的預(yù)測(cè)

def slopeOneRecommendations(self, userRatings):
    recommendations = {}
    frequencies = {}
    # for every item and rating in the user's recommendations
    for (userItem, userRating) in userRatings.items():        # userItem :i
        # for every item in our dataset that the user didn't rate
        for (diffItem, diffRatings) in self.deviations.items():    #diffItem : j
            if diffItem not in userRatings and \
            userItem in self.deviations[diffItem]:
                freq = self.frequencies[diffItem][userItem] #freq:c_ji
                # 如果鍵不存在于字典中,將會(huì)添加鍵并將值設(shè)為默認(rèn)值。
                recommendations.setdefault(diffItem, 0.0)
                frequencies.setdefault(diffItem, 0)
                # add to the running sum representing the numerator
                # of the formula
                recommendations[diffItem] += (diffRatings[userItem] +
                                              userRating) * freq
                # keep a running sum of the frequency of diffitem
                frequencies[diffItem] += freq
                #p(u)j list
                recommendations =  [(self.convertProductID2name(k),          
                                     v / frequencies[k])
                                    for (k, v) in recommendations.items()]
                # finally sort and return
                recommendations.sort(key=lambda artistTuple: artistTuple[1],
                                     reverse = True)
                # I am only going to return the first 50 recommendations
                return recommendations[:50]
           
# test code for SlopeOneRecommendations
#r = recommender(users2)
#r.computeDeviations()
#g = users2['Ben']
#r.slopeOneRecommendations(g)
def loadMovieLens(self, path=''):
      self.data = {}
      #
      # first load movie ratings
      #
      i = 0
      #
      # First load book ratings into self.data
      #
      #f = codecs.open(path + "u.data", 'r', 'utf8')
      f = codecs.open(path + "u.data", 'r', 'ascii')
      #  f = open(path + "u.data")
      for line in f:
         i += 1
         #separate line into fields
         fields = line.split('\t')
         user = fields[0]
         movie = fields[1]
         rating = int(fields[2].strip().strip('"'))
         if user in self.data:
            currentRatings = self.data[user]
         else:
            currentRatings = {}
         currentRatings[movie] = rating
         self.data[user] = currentRatings
      f.close()
      #
      # Now load movie into self.productid2name
      # the file u.item contains movie id, title, release date among
      # other fields
      #
      #f = codecs.open(path + "u.item", 'r', 'utf8')
      f = codecs.open(path + "u.item", 'r', 'iso8859-1', 'ignore')
      #f = open(path + "u.item")
      for line in f:
         i += 1
         #separate line into fields
         fields = line.split('|')
         mid = fields[0].strip()
         title = fields[1].strip()
         self.productid2name[mid] = title
      f.close()
      #
      #  Now load user info into both self.userid2name
      #  and self.username2id
      #
      #f = codecs.open(path + "u.user", 'r', 'utf8')
      f = open(path + "u.user")
      for line in f:
         i += 1
         fields = line.split('|')
         userid = fields[0].strip('"')
         self.userid2name[userid] = line
         self.username2id[line] = userid
      f.close()
      print(i)
# test code
#r = recommender(0)
#r.loadMovieLens('ml-100k/')
#r.computeDeviations()
#r.slopeOneRecommendations(r.data['1'])
#r.slopeOneRecommendations(r.data['25'])
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容