? ? ? ? 上篇文章對(duì)word2vector做了簡(jiǎn)單的描述,本篇主要是進(jìn)行基礎(chǔ)實(shí)踐,
基本翻譯來自:https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/
加上我的一些理解。
基本步驟為: ??
1、構(gòu)建詞袋,詞袋的意思是你所要處理的所有詞的一個(gè)set,并用唯一標(biāo)識(shí)來標(biāo)識(shí)某一個(gè)詞,并統(tǒng)計(jì)每個(gè)詞出現(xiàn)的次數(shù)。這里可以用簡(jiǎn)單的index,也可以用hash等方法。其實(shí)這也可以理解為為所有詞建立一個(gè)loop-up table。
```
def build_dataset(words, n_words):?
?"""Process raw inputs into a dataset."""?
?count = [['UNK', -1]] count.extend(collections.Counter(words).most_common(n_words - 1)) dictionary = dict()?
?for word, _ in count:?
?????dictionary[word] = len(dictionary)?
?data = list()?
?unk_count = 0?
?for word in words:
?????if word in dictionary:
?????????index = dictionary[word] ????
?????else:
?????????index = 0 # dictionary['UNK']?
?????????unk_count += 1?
? ? data.append(index)?
?count[0][1] = unk_count
?reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))? ?
?return data, count, dictionary, reversed_dictionary
```
2、構(gòu)建batch,作為神經(jīng)網(wǎng)絡(luò)的輸入使用。batch中包括輸入的詞和label,這個(gè)label是隨機(jī)給定的,作為該gram的需要預(yù)測(cè)的給定的label。比如,對(duì)于這句話“the cat sat on the mat”,如果gram是3個(gè),那么就是 the cat sat, cat sat on ....,如果是5個(gè),就為?the cat sat on the,輸出詞為sat,將要預(yù)測(cè)的上下文就是從剩下的?[‘the’, ‘cat’, ‘on’, ‘the’]中隨機(jī)選擇。上下文窗口的意思是在輸入詞的周圍選擇幾個(gè)詞。
????關(guān)于輸入詞和label,如果你的場(chǎng)景下,對(duì)于給定的輸入詞,給定了具體的label,那么直接進(jìn)行構(gòu)建即可。如在推薦場(chǎng)景下的dssm和youtube dnn模型中,輸入詞即為用戶觀看過得一系列視頻,label為用戶下一次觀看的視頻。
```
data_index = 0
# generate batch data
def generate_batch(data, batch_size, num_skips, skip_window):?
? ? ? global data_index
? ? ? ?assert batch_size % num_skips == 0? ?
? ? ? ? assert num_skips <= 2 * skip_window
? ? ????batch = np.ndarray(shape=(batch_size), dtype=np.int32)? ??
? ? ? ? ?context = np.ndarray(shape=(batch_size, 1), dtype=np.int32)? ??
????????span = 2 * skip_window + 1? # [ skip_window input_word skip_window ]? ??
????????buffer = collections.deque(maxlen=span)? ??
????????for _ in range(span):? ? ? ??
????????????????buffer.append(data[data_index])? ? ? ??
????????????????data_index = (data_index + 1) % len(data)? ??
????????for i in range(batch_size // num_skips):? ? ? ??
????????????????target = skip_window? # input word at the center of the buffer? ? ? ??
????????????????targets_to_avoid = [skip_window]? ? ? ??
????????????????for j in range(num_skips):? ? ? ? ? ??
????????????????????while target in targets_to_avoid:? ? ? ? ? ? ? ??
????????????????????????target = random.randint(0, span - 1)? ? ? ? ? ??
????????????????????targets_to_avoid.append(target)? ? ? ? ? ??
????????????????????batch[i * num_skips + j] = buffer[skip_window]? # this is the input word? ? ? ? ? ??
????????????????????context[i * num_skips + j, 0] = buffer[target]? # these are the context words ? ? ? ?
?? ? ? ? ? ? ? ? ? ?buffer.append(data[data_index]) ? ? ? ?
????????????????????data_index = (data_index + 1) % len(data)? ??
????????????# Backtrack a little bit to avoid skipping words in the end of a batch? ??
????data_index = (data_index + len(data) - span) % len(data)? ??
? ? ?return batch, context
```
3、構(gòu)建好輸入后,即使用tensorflow來進(jìn)行word2vector的訓(xùn)練。
注意:因?yàn)橹虚g這一隱層是一個(gè)全連接層,所以會(huì)出現(xiàn)embeddings 的維度和weights 的維度相同的情況。
batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a context.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Look up embeddings for inputs.
embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
#?vocabulary_size就是輸入詞袋的大小,如10000,使用的是one-hot的表達(dá)
#embedding_size就是將要embeding的維度。
embed = tf.nn.embedding_lookup(embeddings, train_inputs)
#?tf.nn.embedding_lookup就是對(duì)這些embedding的詞查找其所在的index。
#為輸出層構(gòu)建變量
# Construct the variables for the softmax
weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size)))
biases = tf.Variable(tf.zeros([vocabulary_size]))
hidden_out = tf.matmul(embed, tf.transpose(weights)) + biases
#最后使用交叉熵來訓(xùn)練模型變量,注意需要將輸入先轉(zhuǎn)為one-hot的形式。
# convert train_context to a one-hot format
train_one_hot = tf.one_hot(train_context, vocabulary_size)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=hidden_out, labels=train_one_hot))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(cross_entropy)
注意模型訓(xùn)練完之后,當(dāng)然要進(jìn)行驗(yàn)證,可以使用cosine夾角來驗(yàn)證訓(xùn)練的效果
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
4、然而,使用以上的訓(xùn)練方法很慢,原因在于在最后一層softmax中,需要計(jì)算所有可能詞的出現(xiàn)的概率,即10000。所以采用NCE(noise contrastive estimation),該方法僅隨機(jī)挑選2-20個(gè)可能的詞來進(jìn)行概率預(yù)測(cè)。
所以在實(shí)際應(yīng)用中,可以直接調(diào)用tensor的函數(shù)
```
# Construct the variables for the NCE loss
nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
nce_loss = tf.reduce_mean(?
?????????????tf.nn.nce_loss(weights=nce_weights, ????
????????????????????????????????????????biases=nce_biases,
?????????????????????????????????????????labels=train_context,?
????????????????????????????????????????inputs=embed,?
????????????????????????????????????????num_sampled=num_sampled,?
????????????????????????????????num_classes=vocabulary_size))
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(nce_loss)
```