數(shù)據(jù)類型練習題

19.

21.證明下式給出的集合差度滿足度量公理。

size(A,B)=size(A-B)+size(B-A)

其中,A和B是集合,A-B是集合差。

22.討論如何將相關值從區(qū)間[-1,1]映射到區(qū)間[0,1]。注意,你所使用的變換類型可能取決于你的應用。因此,考慮兩種應用:對時間序列聚類,給定一個時間序列預測另一個的性質。

對于時間序列聚類,相對強的正相關時間序列放到一起。給定一個時間序列預測另一個時間序列,必須考慮強正或負相關。這種轉換即sim=|corr|是合適的,這種進考慮預測大小而不是方向。

23.給定一個在區(qū)間[0,1]取值的相似性度量,描述兩種將該相似度變換成區(qū)間[0,∞]中的相異度方法。

d=(1-s)/s

d=-log s

24.通常,鄰近度定義在一對對象之間。

(a)闡述兩種定義一組對象之間鄰近度的方法。

Two examples are the following: (i) based on pairwise proximity, i.e., minimum pairwise similarity or maximum pairwise dissimilarity, or (ii) for points in Euclidean space compute a centroid (the mean of all the points—see Section 8.2) and then compute the sum or average of the distances of the points to the centroid.基于成對鄰近,即最小成對相似性或最大成對相似性。對于歐幾里德空間中的點,計算一個質心,然后計算點到質心的距離之和或平均值。

(b).如何定義歐幾里得空間中兩個點集之間的距離?

一種方法是計算兩組點的質心之間的距離。

(c).如何定義兩個數(shù)據(jù)對象集之間的鄰近度?(除鄰近度定義在任意一對對象之間外,對數(shù)據(jù)對象不做任何假定。)

One approach is to compute the average pairwise proximity of objects in one group of objects with those objects in the other group. Other approaches are to take the minimum or maximum proximity.一種方法是計算一組對象與另一組對象的平均成對接近度。其他方法是采取最小或最大接近。

注意,集群的凝聚力與它們之間的一組對象的接近的概念有關,并且群集的分離與兩組對象接近的概念有關。此外,兩個聚類的鄰近性是凝聚層次聚類中的一個重要概念。

25.

不幸的是,提示中有一個錯誤和缺乏清晰度。提示應措辭如下:提示:如果z是S的任意點,則三角形不等式d(x, y) ≤ d(x, z)+d(y, z), 應該寫作 d(y, z) ≥ d(x, y)?d(x, z).

另一個三角不等式的應用為:d(x, z) ≤ d(x, y)+d(y, z),即d(y, z) ≥ d(x, z)?d(x, y).如果任何一個不等式中得到的d(y,z)的下界比δ大,d(y,z)不需要計算。如果不等式d(y, z) ≤ d(y, x)+d(x, z)中計算的上邊界d(y,z)不大于≯δ,則d(x,z)不需要計算。

(b).If x = y then no calculations are necessary. As x becomes farther away, typically more distance calculations are needed.

(c).Let x and y be the two points and let x? and y? be the points in S that are closest to the two points, respectively. If d(x?, y?)+2 ≤ β, then we can safely conclude d(x, y) ≤ β. Likewise, if d(x?, y?)?2 ≥ β, then we can safely conclude d(x, y) ≥ β. These formulas are derived by considering the cases where x and y are as far from x? and y? as possible and as far or close to each other as possible. 這些公式是通過考慮x和y盡可能遠離x*和y*以及盡可能彼此遠離或接近的情況而得出的。

26.證明1減Jaccard相似度是兩個數(shù)據(jù)對象x和y之間的一種距離度量,該度量滿足d(x,y)=1-J(x,y)。

1(a). Because J(x, y) ≤ 1, d(x, y) ≥ 0.

1(b). Because J(x, x)=1, d(x, x)=0

2. Because J(x, y) = J(y, x), d(x, y) = d(y, x)

3. (Proof due to Jeffrey Ullman)

minhash(x) is the index of first nonzero entry of x prob(minhash(x) = k) is the probability tha minhash(x) = k when x is randomly permuted. 是當x被隨機排列時,minhash(x)=k的概率。

Note that prob(minhash(x) = minhash(y)) = J(x, y) (minhash lemma)

Therefore, d(x, y)=1?prob(minhash(x) = minhash(y)) = prob(minhash(x) = minhash(y)) We have to show that, prob(minhash(x) = minhash(z)) ≤ prob(minhash(x) = minhash(y)) +prob(minhash(y) = minhash(z)

但是,請注意,無論何時minhash(x)=minhash(z),那么 minhash(x)=minhash(y)和minhash(y)=minhash(z)中的至少一個必須為true。

27.證明定義為兩個數(shù)據(jù)向量x和y之間夾角的距離度量滿足度量公理d(x,y)=arccos(cos(x,y))。

Note that angles are in the range 0 to 180.

1(a). Because 0 ≤ cos(x, y) ≤ 1, d(x, y) ≥ 0.

1(b). Because cos(x, x)=1, d(x, x) = arccos(1) = 0

2. Because cos(x, y) = cos(y, x), d(x, y) = d(y, x)

3. If the three vectors lie in a plane then it is obvious that the angle between x and z must be less than or equal to the sum of the angles between x and y and y and z.如果三個矢量位于一個平面上,那么很明顯x和z之間的角度必須小于或等于x和y以及y和z之間的角度之和。 If y is the projection of y into the plane defined by x and z, then note that the angles between x and y and y and z are greater than those between x and y and y and z.如果y是y在x和z定義的平面上的投影,那么請注意x和y以及y和z之間的角度大于x和y以及y和z之間的角度。

28.解釋為什么計算兩個屬性之間的鄰近度比計算兩個對象之間的相似度簡單。

In general, an object can be a record whose fields (attributes) are of different types. To compute the overall similarity of two objects in this case, we need to decide how to compute the similarity for each attribute and then combine these similarities. This can be done straightforwardly by using Equations 2.15 or 2.16, but is still somewhat ad hoc, at least compared to proximity measures such as the Euclidean distance or correlation, which are mathematically well founded. In contrast, the values of an attribute are all of the same type, and thus, if another attribute is of the same type, then the computation of similarity is conceptually and computationally straightforward.

通常,對象可以是字段(屬性)屬于不同類型的記錄。在這種情況下,要計算兩個對象的總體相似性,我們需要決定如何計算每個屬性的相似性,然后將這些相似性組合起來。這可以通過使用方程2.15或2.16直接地完成,但仍然有點特別,至少與接近的度量,例如歐幾里得距離或相關性相比,這些歐幾里得距離或相關性在數(shù)學上是有充分根據(jù)的。相反,一個屬性的值都是同一類型的,因此,如果另一個屬性是同一類型的,那么相似度的計算在概念上和計算上都是直接的。

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
【社區(qū)內容提示】社區(qū)部分內容疑似由AI輔助生成,瀏覽時請結合常識與多方信息審慎甄別。
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發(fā)布,文章內容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

友情鏈接更多精彩內容