CS224N Lecture 1

NLP with Deep Learning - Lecture 1 - Introduction and Word Vectors

Lecture Plan

The course
Human language and word meaning
Word2vec introduction
Word2vec objective function gradients
Optimization basics
Looking at word vectors

How do we represent the meaning of a word?

Definition:meaning

一个词、词组等表示的概念
一个人想用语言、符号等来表达的想法
被表达在作品、艺术等方面的思想

理解意义的最普遍的语言方式(linguistic way) : 语言符号与语言符号的意义的转化

How do we have usable meaning in a computer?

Common NLP solution: WordNet, 一个包含同义词集和上位词(抽象-具体关系"is a" relationships) synonym sets and hypernyms 的列表的辞典

Problems with resources like WordNet

作为一个资源是很好的，但忽略了细微差别 (例如proficient被列为good的同义词。但这只在某些上下文中是正确的。)
缺少单词的新含义 (难以持续更新，例如 wicked, badass, nifty, wizard, genius, ninja, bombest)
主观的
需要人类劳动来创造和调整
无法计算单词相似度

Representing words as discrete symbols

在传统的NLP中，我们把词语看作离散的符号: hotel, conference, motel —— a localist representation。单词表示成独热向量(one-hot vectors)，向量维度=词汇数量(如500,000)。
\[motel = [0\;0\;0\;0\;0\;0\;0\;0\;0\;0\;1\;0\;0\;0\;0]\] \[hotel = [0\;0\;0\;0\;0\;0\;0\;1\;0\;0\;0\;0\;0\;0\;0]\]

Problem with words as discrete symbols

所有向量是正交的。对于独热向量，没有关于相似性概念，并且向量维度过大。(例如：如果用户搜索"Seattle motel"，我们想匹配包含"Seattle hotel"的内容，独热向量并没有相似性概念)

Solution:

使用类似 WordNet 的工具中的列表，获得相似度，但会因不够完整而失败
学习在向量本身中编码相似性

Representing words by their context(上下文)

Distributional semantics ：一个单词的意思是由经常出现在它附近的单词给出的
"You shall know a word by the company it keeps" (J. R. Firth 1957: 11)
现代统计NLP最成功的理念之一
当一个单词 \(w\) 出现在文本中时，它的上下文是出现在其附近的一组单词(在一个固定大小的窗口中)。
使用 \(w\) 的许多上下文来构建 \(w\) 的表示

Word vectors

我们为每个单词构建一个密集的向量，使其与出现在相似上下文中的单词向量相似，使用向量点积来衡量相似性

词向量word vectors有时被称为词嵌入word embeddings或词表示word representations，它们是分布式表示distributed representation

Word meaning as a neural word vector – visualization

Word2vec: Overview

Word2vec (Mikolov et al. 2013)是一个学习单词向量的框架
Idea:

我们有大量的文本 (corpus means 'body' in Latin.)
固定词汇表中的每个单词都由一个向量表示
文本中的每个位置 \(t\)，其中有一个中心词 \(c\) 和上下文(“外部”)单词 \(o\)
使用 \(c\) 和 \(o\) 的词向量的相似性来计算给定 \(c\) 的 \(o\) 的概率 (反之亦然)
不断调整词向量来最大化这个概率

例如窗口大小 \(j=2\) 时的 \(P(w_{t+j}|w_t)\) 计算过程，center word 分别为 into 和 banking

Word2vec: objective function

对于每个位置 \(t=1,...,T\), 在大小为 \(m\) 的固定窗口内预测上下文单词，给定中心词 \(w_t\)
\[Likelihood=L(\theta)=\prod\limits_{t=1}^{T}\prod\limits_{-m \le j \le m \\ \;\;\;\;j \ne 0}P(w_{t+j}|w_t;\theta)\]