Introduction
References
Aim
The aim is to "learn" \(\mathbb{P}\).
\[\Big(\{\textrm{Sequences of Words}\}, \mathcal{F}, \mathbb{P} \Big)\]
High Level Approach
We can achieve this objective as follows:
- Learn a distribution over the first word of a sequence.
\[\big(\mathcal{V}, \mathcal{F}, \mathbb{P} \big)\]
- Learn the conditional distribution
\[\textrm{Model} :: \textrm{Params} \to \{\textrm{context}\} \to \mathbb{P}_{\mid \textrm{context}}\]
To Do
This should be made more exact
A Neural Probabilistic Language Model
Essence
- transfer probability mass
- In the proposed model, it will so generalize because “similar” words are expected to have a similar feature vector, and because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability
We can construct the condition distribution as follows.
- First we introduce an embedding function. Given that our vocab is finite, we can represent this function as a matrix. i.e. \(h\) is isomorphic to \(\theta \in {| \mathcal{V} | \times m}\)
\[h :: \mathcal{V} \to \mathcal{R}^p\]
-
We introduce a function \(g\) which maps subsequences of these embeddings into a conditional distribution
-
Given this level of detail we could augment the signature of our model as follows:
\[\textrm{Model} :: \textrm{Embedding Functions} \to \textrm{Forward Functions} \to \{\textrm{context}\} \to \mathbb{P}_{\mid \textrm{context}}\]
To Do
What should the name of this forward function be?
Key Insights
- "In high dimensions, it is crucial to distribute probability mass where it matters rather than uniformly in all directions around each training point."1