Jekyll2016-06-20T05:30:06+00:00https://blog.aidangomez.ca/Aidan GomezMath, art & code.aidangomezThe Neural Turing Machine2016-05-16T03:08:30+00:002016-05-16T03:08:30+00:00https://blog.aidangomez.ca/2016/05/16/The-Neural-Turing-Machine<p>This article serves to briefly outline the design of the Neural Turing Machine (NTM), a backpropogatable architecture that can (among many possibilities) learn to dynamically execute programs.</p>
<p><a href="https://arxiv.org/abs/1410.5401">The original paper.</a></p>
<p>I’ve added some specifications about the NTM’s architecture that the paper excludes for the sake of generality. These will be discussed upon presentation.</p>
<p>The Neural Turing Machine was proposed by Graves et al. as a Turing-Complete network capable of learning (rather complex) programs. Inspired by the sequential nature of the brain, and the large, addressable memory of the computer.</p>
<h5 id="the-ntm-is-composed-of-five-modules">The NTM is composed of five modules:</h5>
<ul>
<li>The controller</li>
<li>The addressing module</li>
<li>The read module</li>
<li>The write module</li>
<li>The memory</li>
</ul>
<p><img src="https://blog.aidangomez.ca/assets/ntm-structure.jpg" alt="The NTM's Structure" /></p>
<h1 id="the-controller">The Controller</h1>
<p>The controller acts as the interface between input data and memory. It learns to manage its own memory through addressing.</p>
<p>The paper maintains that the controller can be of any format, it simply needs to read in data and produce the outputs required by the sub-modules. They choose an LSTM for their implementation.</p>
<h5 id="the-parameters-depended-upon-by-the-sub-modules-are">The parameters depended upon by the sub-modules are:</h5>
<ul>
<li><script type="math/tex">\mathbf{k_{t}}</script> the key vector; Compared against the memory when addressing by content similarity.</li>
<li><script type="math/tex">\beta_{t}</script> the key strength; A weighting effecting the precision of the content addressing.</li>
<li><script type="math/tex">g_{t} \in (0,1)</script> the blending factor; A weight to blend between content addressing and previous time-step addressing.</li>
<li><script type="math/tex">\mathbf{s_{t}}</script> the shift weighting; A normal distribution across the allowed shift amounts.</li>
<li><script type="math/tex">\gamma</script> the sharpening exponent; Serves to sharpen the final address</li>
<li><script type="math/tex">\mathbf{e_{t}} \in (0,1)^{N}</script> erase vector; Similar to an LSTM, decides what memory from the previous time-step to erase.</li>
<li><script type="math/tex">\mathbf{a_{t}}</script> add vector; The data to be added to memory.</li>
</ul>
<h1 id="the-addressing-module">The Addressing Module</h1>
<p><img src="https://blog.aidangomez.ca/assets/ntm-addressing.jpg" alt="The NTM's Addressing" />
This module generates a window over the memory for the read and write heads.</p>
<h2 id="accessing-by-content-similarity">Accessing by content similarity</h2>
<p>This module effectively laps over the memory comparing each block to the key vector <script type="math/tex">\mathbf{k_{t}}</script> and creates a normally distributed weighting over the memory based of the similarity.</p>
<p><script type="math/tex">w^{c}_{t}(i) = softmax(\beta_{t}\delta(K_{t}, M_{t}(i)))</script> where <script type="math/tex">\delta(a, b) = \dfrac{a \cdot b}{\vert a \vert \cdot \vert b \vert}</script> is an example difference function.</p>
<h2 id="accessing-by-location">Accessing by location</h2>
<p>This module has 3 steps:</p>
<ol>
<li>The controller decides how much of the previous time-step’s weighting <script type="math/tex">\mathbf{w}^{final}_{t-1}</script> should be preserved (using <script type="math/tex">g_{t}</script>).
<ul>
<li>
<script type="math/tex; mode=display">\mathbf{w}^{g}_{t} = g_{t} \cdot \mathbf{w}^{c}_{t} + (1 - g_{t}) \cdot \mathbf{w}^{final}_{t-1}</script>
</li>
</ul>
</li>
<li>It then performs a shift of the weighting (using <script type="math/tex">\mathbf{s}_{t}</script>).
<ul>
<li>
<script type="math/tex; mode=display">w^{s}_{t}(i) = \sum_{j} w^{g}_{t}(j) \cdot s_{t}(i - j)</script>
</li>
</ul>
</li>
<li>Finally it sharpens (using <script type="math/tex">\gamma_{t}</script>) and normalizes the weighting.
<ul>
<li>
<script type="math/tex; mode=display">w^{final}_{t}(i) = \dfrac{w^{s}_{t}(i)^{\gamma_{t}}}{\sum_{j}w^{s}_{t}(j)^{\gamma_{t}}}</script>
</li>
</ul>
</li>
</ol>
<h1 id="read--write">Read & Write</h1>
<p><img src="https://blog.aidangomez.ca/assets/ntm-read-write.jpg" alt="The NTM's Reading & Writing" /></p>
<h2 id="reading">Reading</h2>
<p>Pretty straight-forward.</p>
<script type="math/tex; mode=display">\mathbf{r}_{t} = \sum_{i} w^{final}_{t}(i) \cdot \mathbf{M}_{t}(i)</script>
<h2 id="writing">Writing</h2>
<p>Writing is performed in two steps, similar to how an LSTM updates its state.</p>
<ol>
<li>The erase vector <script type="math/tex">\mathbf{e_{t}}</script> removes memory that is no longer relevant.
<ul>
<li>
<script type="math/tex; mode=display">\mathbf{M}_{t}(i) = \mathbf{M}_{t-1}(i) [1 - w^{final}_{t}(i) \cdot \mathbf{e}_{t}]</script>
</li>
</ul>
</li>
<li>The new data is placed in memory (using <script type="math/tex">\mathbf{a_{t}}</script>).
<ul>
<li>
<script type="math/tex; mode=display">\mathbf{M}_{t}(i) = \mathbf{M}_{t}(i) + w^{final}_{t}(i) \cdot \mathbf{a}_{t}</script>
</li>
</ul>
</li>
</ol>
<h1 id="significance">Significance</h1>
<p>This architecture has already had an impact on a multitude of research projects (notably, the Neural GPU) and I have great faith it will continue to do so. There’s been much discussion over the past couple of years about computers programming themselves, and I believe this has been the greatest stride towards that end-goal.</p>
<p>The Neural Turing Machine is a reactionary computer, changing behaviour based on its “environment”. It certainly will play a major role in setting precedence for the way neural networks will be applied for the purpose of AI.</p>
<p>I look forward to reading more of the work done by the Google DeepMind and OpenAI teams.</p>
<p>If there are any errors in my description please do not hesistate to reach out to me at <a href="mailto:hello@aidangomez.ca">hello@aidangomez.ca</a>.</p>aidangomezThis article serves to briefly outline the design of the Neural Turing Machine (NTM), a backpropogatable architecture that can (among many possibilities) learn to dynamically execute programs.Backpropogating an LSTM: A Numerical Example2016-04-17T03:08:30+00:002016-04-17T03:08:30+00:00https://blog.aidangomez.ca/2016/04/17/Backpropogating-an-LSTM-A-Numerical-Example<style>
li li {
list-style-type: none;
}
</style>
<p>Let’s do this…</p>
<p>We all know LSTM’s are super powerful; So, we should know how they work and how to use them.</p>
<p><img src="https://blog.aidangomez.ca/assets/lstm.png" alt="An LSTM" /></p>
<h2 id="syntactic-notes">Syntactic notes</h2>
<ul>
<li>Above <script type="math/tex">\bigodot</script> is the element-wise product or Hadamard product.</li>
<li>Inner products will be represented as <script type="math/tex">\cdot</script></li>
<li>Outer products will be respresented as <script type="math/tex">\bigotimes</script></li>
<li><script type="math/tex">\sigma</script> represents the sigmoid function: <script type="math/tex">\sigma(x) = \dfrac{1}{1 + e^{-x}}</script></li>
</ul>
<h2 id="the-forward-components">The forward components</h2>
<p>The gates are defined as:</p>
<ul>
<li>Input activation:
<ul>
<li>
<script type="math/tex; mode=display">a_{t} = \tanh(W_{a} \cdot x_{t} + U_{a} \cdot out_{t-1} + b_{a})</script>
</li>
</ul>
</li>
<li>Input gate:
<ul>
<li>
<script type="math/tex; mode=display">i_{t} = \sigma(W_{i} \cdot x_{t} + U_{i} \cdot out_{t-1} + b_{i})</script>
</li>
</ul>
</li>
<li>Forget gate:
<ul>
<li>
<script type="math/tex; mode=display">f_{t} = \sigma(W_{f} \cdot x_{t} + U_{f} \cdot out_{t-1} + b_{f})</script>
</li>
</ul>
</li>
<li>Output gate:
<ul>
<li>
<script type="math/tex; mode=display">o_{t} = \sigma(W_{o} \cdot x_{t} + U_{o} \cdot out_{t-1} + b_{o})</script>
</li>
</ul>
</li>
</ul>
<p><strong>Note</strong> for simplicity we define:</p>
<script type="math/tex; mode=display">gates_{t} = \begin{bmatrix} a_{t}\\ i_{t}\\ f_{t}\\ o_{t}] \end{bmatrix},\
W = \begin{bmatrix} W_{a}\\ W_{i}\\ W_{f}\\ W_{o} \end{bmatrix},\
U = \begin{bmatrix} U_{a}\\ U_{i}\\ U_{f}\\ U_{o} \end{bmatrix},\
b = \begin{bmatrix} b_{a}\\ b_{i}\\ b_{f}\\ b_{o} \end{bmatrix}</script>
<p>Which leads to:</p>
<ul>
<li>Internal state:
<ul>
<li>
<script type="math/tex; mode=display">state_{t} = a_{t} \odot i_{t} + f_{t} \odot state_{t-1}</script>
</li>
</ul>
</li>
<li>Output:
<ul>
<li>
<script type="math/tex; mode=display">out_{t} = \tanh(state_{t}) \odot o_{t}</script>
</li>
</ul>
</li>
</ul>
<h2 id="the-backward-components">The backward components</h2>
<p>Given:</p>
<ul>
<li><script type="math/tex">\Delta_{t}</script> the output difference as computed by any subsequent layers (i.e. the rest of your network), and;</li>
<li><script type="math/tex">\Delta out_{t}</script> the output difference as computed by the next time-step LSTM (the equation for t-1 is below).</li>
</ul>
<p>Find:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\delta out_{t} &= \Delta_{t} + \Delta out_{t}\\
\delta state_{t} &= \delta out_{t} \odot o_{t} \odot (1 - \tanh^{2}(state_{t})) + \delta state_{t+1} \odot f_{t+1}\\
\delta a_{t} &= \delta state_{t} \odot i_{t} \odot (1 - a_{t}^{2})\\
\delta i_{t} &= \delta state_{t} \odot a_{t} \odot i_{t} \odot (1 - i_{t})\\
\delta f_{t} &= \delta state_{t} \odot state_{t-1} \odot f_{t} \odot (1 - f_{t})\\
\delta o_{t} &= \delta out_{t} \odot \tanh(state_{t}) \odot o_{t} \odot (1 - o_{t})\\
\delta x_{t} &= W^{T} \cdot \delta gates_{t}\\
\Delta out_{t-1} &= U^{T} \cdot \delta gates_{t}
\end{aligned} %]]></script>
<p>The final updates to the internal parameters is computed as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\delta W &= \sum\limits^{T}_{t=0} \delta gates_{t} \otimes x_{t}\\
\delta U &= \sum\limits^{T-1}_{t=0} \delta gates_{t+1} \otimes out_{t}\\
\delta b &= \sum\limits^{T}_{t=0} \delta gates_{t+1}
\end{aligned} %]]></script>
<p>Putting this all together we can begin…</p>
<h1 id="the-example">The Example</h1>
<p>Let us begin by defining out internal weights:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
W_{a} &= \begin{bmatrix} 0.45\\ 0.25 \end{bmatrix}, U_{a} = \begin{bmatrix} 0.15 \end{bmatrix}, b_{a} = \begin{bmatrix} 0.2 \end{bmatrix}\\
W_{i} &= \begin{bmatrix} 0.95\\ 0.8 \end{bmatrix}, U_{i} = \begin{bmatrix} 0.8 \end{bmatrix}, b_{i} = \begin{bmatrix} 0.65 \end{bmatrix}\\
W_{f} &= \begin{bmatrix} 0.7\\ 0.45 \end{bmatrix}, U_{f} = \begin{bmatrix} 0.1 \end{bmatrix}, b_{f} = \begin{bmatrix} 0.15 \end{bmatrix}\\
W_{o} &= \begin{bmatrix} 0.6\\ 0.4 \end{bmatrix}, U_{o} = \begin{bmatrix} 0.25 \end{bmatrix}, b_{o} = \begin{bmatrix} 0.1 \end{bmatrix}
\end{aligned} %]]></script>
<p>And now input data:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
x_{0} &= \begin{bmatrix} 1\\ 2 \end{bmatrix} \text{ with label: } 0.5\\
x_{1} &= \begin{bmatrix} 0.5\\ 3 \end{bmatrix} \text{ with label: } 1.25\\
\end{aligned} %]]></script>
<p><em>I’m using a sequence length of two here to demonstrate the unrolling over time of RNNs</em></p>
<h2 id="forward--t0">Forward @ <script type="math/tex">t=0</script></h2>
<p><img src="https://blog.aidangomez.ca/assets/lstm-forward-0.png" alt="Forward pass @ t=0" /></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&a_{0} = \tanh(W_{a} \cdot x_{0} + U_{a} \cdot out_{-1} + b_{a}) = \tanh(\begin{bmatrix} 0.45\ 0.25 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} + \begin{bmatrix} 0.15 \end{bmatrix} \begin{bmatrix} 0 \end{bmatrix} + \begin{bmatrix} 0.2 \end{bmatrix}) = 0.81775\\
&i_{0} = \sigma(W_{i} \cdot x_{0} + U_{i} \cdot out_{-1} + b_{i}) = \sigma(\begin{bmatrix} 0.95\ 0.8 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} + \begin{bmatrix} 0.8 \end{bmatrix} \begin{bmatrix} 0 \end{bmatrix} + \begin{bmatrix} 0.65 \end{bmatrix}) = 0.96083\\
&f_{0} = \sigma(W_{f} \cdot x_{0} + U_{f} \cdot out_{-1} + b_{f}) = \sigma(\begin{bmatrix} 0.7\ 0.45 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} + \begin{bmatrix} 0.1 \end{bmatrix} \begin{bmatrix} 0 \end{bmatrix} + \begin{bmatrix} 0.15 \end{bmatrix}) = 0.85195\\
&o_{0} = \sigma(W_{o} \cdot x_{0} + U_{o} \cdot out_{-1} + b_{o}) = \sigma(\begin{bmatrix} 0.6\ 0.4 \end{bmatrix} \begin{bmatrix} 1\\ 2 \end{bmatrix} + \begin{bmatrix} 0.25 \end{bmatrix} \begin{bmatrix} 0 \end{bmatrix} + \begin{bmatrix} 0.1 \end{bmatrix}) = 0.81757\\
\\
&state_{0} = a_{0} \odot i_{0} + f_{0} \odot state_{-1} = 0.81775 \times 0.96083 + 0.85195 \times 0 = 0.78572 \\
&out_{0} = \tanh(state_{0}) \odot o_{0} = \tanh(0.78572) \times 0.81757 = 0.53631
\end{aligned} %]]></script>
<p>From here, we can pass forward our state and output and begin the next time-step.</p>
<h2 id="forward--t1">Forward @ <script type="math/tex">t=1</script></h2>
<p><img src="https://blog.aidangomez.ca/assets/lstm-forward-1.png" alt="Forward pass @ t=1" /></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&a_{1} = \tanh(W_{a} \cdot x_{1} + U_{a} \cdot out_{0} + b_{a}) = \tanh(\begin{bmatrix} 0.45\ 0.25 \end{bmatrix} \begin{bmatrix} 0.5\\ 3 \end{bmatrix} + \begin{bmatrix} 0.15 \end{bmatrix} \begin{bmatrix} 0.53631 \end{bmatrix} + \begin{bmatrix} 0.2 \end{bmatrix}) = 0.84980\\
&i_{1} = \sigma(W_{i} \cdot x_{1} + U_{i} \cdot out_{0} + b_{i}) = \sigma(\begin{bmatrix} 0.95\ 0.8 \end{bmatrix} \begin{bmatrix} 0.5\\ 3 \end{bmatrix} + \begin{bmatrix} 0.8 \end{bmatrix} \begin{bmatrix} 0.53631 \end{bmatrix} + \begin{bmatrix} 0.65 \end{bmatrix}) = 0.98118\\
&f_{1} = \sigma(W_{f} \cdot x_{1} + U_{f} \cdot out_{0} + b_{f}) = \sigma(\begin{bmatrix} 0.7\ 0.45 \end{bmatrix} \begin{bmatrix} 0.5\\ 3 \end{bmatrix} + \begin{bmatrix} 0.1 \end{bmatrix} \begin{bmatrix} 0.53631 \end{bmatrix} + \begin{bmatrix} 0.15 \end{bmatrix}) = 0.87030\\
&o_{1} = \sigma(W_{o} \cdot x_{1} + U_{o} \cdot out_{0} + b_{o}) = \sigma(\begin{bmatrix} 0.6\ 0.4 \end{bmatrix} \begin{bmatrix} 0.5\\ 3 \end{bmatrix} + \begin{bmatrix} 0.25 \end{bmatrix} \begin{bmatrix} 0.53631 \end{bmatrix} + \begin{bmatrix} 0.1 \end{bmatrix}) = 0.84993\\
\\
&state_{1} = a_{1} \odot i_{1} + f_{1} \odot state_{0} = 0.84980 \times 0.98118 + 0.87030 \times 0.78572 = 1.5176 \\
&out_{1} = \tanh(state_{1}) \odot o_{1} = \tanh(1.5176) \times 0.84993 = 0.77197
\end{aligned} %]]></script>
<p>And since we’re done our sequence we have everything we need to begin backpropogating.</p>
<h2 id="backward--t1">Backward @ <script type="math/tex">t=1</script></h2>
<p><img src="https://blog.aidangomez.ca/assets/lstm-backward-1.png" alt="Backward pass @ t=1" /></p>
<p>First we’ll need to compute the difference in output from the expected (label).</p>
<p><strong>Note</strong> for this we’ll be using L2 Loss: <script type="math/tex">E(x, \hat x) = \dfrac{(x - \hat x)^{2}}{2}</script>. The derivate w.r.t. <script type="math/tex">x</script> is <script type="math/tex">\partial_{x}E(x, \hat x) = x - \hat x</script>.</p>
<script type="math/tex; mode=display">\begin{aligned}
\Delta_{1} = \partial_{x}E = 0.77197 - 1.25 = -0.47803
\end{aligned}</script>
<p><script type="math/tex">\Delta out_{1} = 0</script> because there are no future time-steps.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\delta out_{1} &= \Delta_{1} + \Delta out_{1} = -0.47803 + 0 = -0.47803\\
\delta state_{1} &= \delta out_{1} \odot o_{1} \odot (1 - \tanh^{2}(state_{1})) + \delta state_{2} \odot f_{2} = -0.47803 \times 0.84993 \times (1 - \tanh^{2}(1.5176)) + 0 \times 0 = -0.07111\\
\delta a_{1} &= \delta state_{1} \odot i_{1} \odot (1 - a_{1}^{2}) = -0.07111 \times 0.98118 \times (1 - 0.84980^{2}) = -0.01938\\
\delta i_{1} &= \delta state_{1} \odot a_{1} \odot i_{1} \odot (1 - i_{1}) = -0.07111 \times 0.84980 \times 0.98118 \times (1 - 0.98118) = -0.00112\\
\delta f_{1} &= \delta state_{1} \odot state_{0} \odot f_{1} \odot (1 - f_{1}) = -0.07111 \times 0.78572 \times 0.87030 \times (1 - 0.87030) = -0.00631\\
\delta o_{1} &= \delta out_{1} \odot \tanh(state_{1}) \odot o_{1} \odot (1 - o_{1}) = -0.47803 \times \tanh(1.5176) \times 0.84993 \times (1 - 0.84993) = -0.05538\\
\\
\delta x_{1} &= W^{T} \cdot \delta gates_{1}\\
&= \begin{bmatrix} 0.45 \ 0.95 \ 0.70 \ 0.60 \\ 0.25 \ 0.80 \ 0.45 \ 0.40\end{bmatrix} \begin{bmatrix} -0.01938 \\ -0.00112 \\ -0.00631 \\ -0.05538\end{bmatrix} = \begin{bmatrix} -0.04743 \\ -0.03073 \end{bmatrix}\\
\Delta out_{0} &= U^{T} \cdot \delta gates_{1}\\
&= \begin{bmatrix} 0.15 \ 0.80 \ 0.10 \ 0.25 \end{bmatrix} \begin{bmatrix} -0.01938 \\ -0.00112 \\ -0.00631 \\ -0.05538\end{bmatrix} = -0.01828\\
\end{aligned} %]]></script>
<p>Now we can pass back our <script type="math/tex">\Delta out_{0}</script> and continue on computing…</p>
<h2 id="backward--t0-">Backward @ <script type="math/tex">t=0</script></h2>
<p><img src="https://blog.aidangomez.ca/assets/lstm-backward-0.png" alt="Backward pass @ t=0" /></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\Delta_{0} &= \partial_{x}E = 0.53631 - 0.5 = 0.03631\\
\Delta out_{0} &= -0.01828, \text{ passed back from T=1}\\
\\
\delta out_{0} &= \Delta_{0} + \Delta out_{0} = 0.03631 + -0.01828 = 0.01803\\
\delta state_{0} &= \delta out_{0} \odot o_{0} \odot (1 - \tanh^{2}(state_{0})) + \delta state_{1} \odot f_{1} = 0.01803 \times 0.81757 \times (1 - \tanh^{2}(0.78572)) + -0.07111 \times 0.87030 = -0.05349\\
\delta a_{0} &= \delta state_{0} \odot i_{0} \odot (1 - a_{0}^{2}) = -0.05349 \times 0.96083 \times (1 - 0.81775^{2}) = -0.01703\\
\delta i_{0} &= \delta state_{0} \odot a_{0} \odot i_{0} \odot (1 - i_{0}) = -0.05349 \times 0.81775 \times 0.96083 \times (1 - 0.96083) = -0.00165\\
\delta f_{0} &= \delta state_{0} \odot state_{-1} \odot f_{0} \odot (1 - f_{0}) = -0.05349 \times 0 \times 0.85195 \times (1 - 0.85195) = 0\\
\delta o_{0} &= \delta out_{0} \odot \tanh(state_{0}) \odot o_{0} \odot (1 - o_{0}) = 0.01803 \times \tanh(0.78572) \times 0.81757 \times (1 - 0.81757) = 0.00176\\
\\
\delta x_{0} &= W^{T} \cdot \delta gates_{0}\\
&= \begin{bmatrix} 0.45 \ 0.95 \ 0.70 \ 0.60 \\ 0.25 \ 0.80 \ 0.45 \ 0.40\end{bmatrix} \begin{bmatrix} -0.01703 \\ -0.00165 \\ 0 \\ 0.00176 \end{bmatrix} = \begin{bmatrix} -0.00817 \\ -0.00487 \end{bmatrix}\\
\Delta out_{-1} &= U^{T} \cdot \delta gates_{1}\\
&= \begin{bmatrix} 0.15 \ 0.80 \ 0.10 \ 0.25 \end{bmatrix} \begin{bmatrix} -0.01703 \\ -0.00165 \\ 0 \\ 0.00176 \end{bmatrix} = -0.00343\\
\end{aligned} %]]></script>
<p>And we’re done the backward step!</p>
<p>Now we’ll need to update our internal parameters according to whatever solving algorithm you’ve chosen. I’m going to use a simple Stochastic Gradient Descent (SGD) update with learning rate: <script type="math/tex">\lambda = 0.1</script>.</p>
<p>We’ll need to compute how much our weights are going to change by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\delta W &= \sum\limits^{T}_{t=0} \delta gates_{t} \otimes x_{t}\\
&= \begin{bmatrix} -0.01703 \\ -0.00165 \\ 0 \\ 0.00176 \end{bmatrix} \begin{bmatrix} 1.0 \ 2.0 \end{bmatrix} + \begin{bmatrix} -0.01938 \\ -0.00112 \\ -0.00631 \\ -0.05538\end{bmatrix} \begin{bmatrix} 0.5 \ 3.0 \end{bmatrix} = \begin{bmatrix} -0.02672 \ -0.0922 \\ -0.00221 \ -0.00666 \\ -0.00316 \ -0.01893 \\ -0.02593 \ -0.16262 \end{bmatrix}\\
\delta U &= \sum\limits^{T-1}_{t=0} \delta gates_{t+1} \otimes out_{t}\\
&= \begin{bmatrix} -0.01938 \\ -0.00112 \\ -0.00631 \\ -0.05538\end{bmatrix} \begin{bmatrix} 0.53631 \end{bmatrix} = \begin{bmatrix} -0.01039 \\ -0.00060 \\ -0.00338 \\ -0.02970 \end{bmatrix}\\
\delta b &= \sum\limits^{T}_{t=0} \delta gates_{t+1}\\
&= \begin{bmatrix} -0.01703 \\ -0.00165 \\ 0 \\ 0.00176 \end{bmatrix} + \begin{bmatrix} -0.01938 \\ -0.00112 \\ -0.00631 \\ -0.05538\end{bmatrix} = \begin{bmatrix} -0.03641 \\ -0.00277 \\ -0.00631 \\ -0.05362 \end{bmatrix}
\end{aligned} %]]></script>
<p>And updating out parameters based on the SGD update function: <script type="math/tex">W^{new} = W^{old} - \lambda * \delta W^{old}</script> we get our new weight set:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
W_{a} &= \begin{bmatrix} 0.45267\\ 0.25922 \end{bmatrix}, U_{a} = \begin{bmatrix} 0.15104 \end{bmatrix}, b_{a} = \begin{bmatrix} 0.20364 \end{bmatrix}\\
W_{i} &= \begin{bmatrix} 0.95022\\ 0.80067 \end{bmatrix}, U_{i} = \begin{bmatrix} 0.80006 \end{bmatrix}, b_{i} = \begin{bmatrix} 0.65028 \end{bmatrix}\\
W_{f} &= \begin{bmatrix} 0.70031\\ 0.45189 \end{bmatrix}, U_{f} = \begin{bmatrix} 0.10034 \end{bmatrix}, b_{f} = \begin{bmatrix} 0.15063 \end{bmatrix}\\
W_{o} &= \begin{bmatrix} 0.60259\\ 0.41626 \end{bmatrix}, U_{o} = \begin{bmatrix} 0.25297 \end{bmatrix}, b_{o} = \begin{bmatrix} 0.10536 \end{bmatrix}
\end{aligned} %]]></script>
<p>And that completes one iteration of solving an LSTM cell!</p>
<p>Of course, this whole process is sequential in nature and a small error will render all subsequent calculations useless, so if you catch <strong>ANYTHING</strong> email me at <a href="mailto:hello@aidangomez.ca">hello@aidangomez.ca</a></p>aidangomezOn Creating Artificial Intelligence2015-12-13T03:08:30+00:002015-12-13T03:08:30+00:00https://blog.aidangomez.ca/2015/12/13/On-Creating-Artificial-Intelligence<p><img src="https://cdn-images-1.medium.com/max/2000/1*mGDXMZlSnxDF1ofbM3E6dw.jpeg" alt="“Cortex” the stunning work of Greg Dunn" /></p>
<p>See: <a href="http://arxiv.org/abs/1511.08130">Facebook’s: “A Roadmap towards Machine Intelligence”</a></p>
<p>Recently, Tomas Mikolov, Armand Joulin and Marco Baroni of Facebook’s AI Research department released a paper titled “A Roadmap toward Machine Intelligence”; one that outlined a highly abstracted and theoretical guide on the development of artificial intelligence.</p>
<p>The paper proposed some novel and interesting methodologies in a notably accessible format and it’s well worth the short read.</p>
<p>I wish to take a look at some of the most thought-provoking points raised, as well as offer some more commentary on how we might make progress towards the lofty goal of AI.</p>
<h1 id="communication">Communication</h1>
<p>The capacity for communicaton in intelligent machines is a necessity — and far from a unique suggestion — however, the way by which communication is generated and utilized is crucial to the methodology put forward. What is novel, is the suggestion of communication and natural language development providing a guarantee — of sorts — about the machine’s ability to learn.</p>
<p>The authors propose (to some degree) that the sole prerequisite of a machine capable of artificial intelligence, is the ability to learn. The machine begins training entirely naïve to its environment, purpose, and capabilities. It should be mentioned; I’m ignoring their propositions of pattern interpretation and internalization — as well as other suggestions, such as the elasticity of internal models — which I feel are redundant to discuss at present, as they are necessary prerequisites to a successful implementation of the proposed communication-based methodology.</p>
<p>From this base of “a machine that can learn” the authors propose that training must begin by first learning natural communication in order to interact with its teacher and the environment.</p>
<h2 id="a-natural-learning-approach">A natural learning approach</h2>
<p>Humans are master replicators, and weak innovators; as such, most of our technological success has drawn from our ability to recognize patterns and reverse-engineer our environment. It seems only logical to conduct the training of our intelligent technology in the same manner by which we train ourselves.</p>
<p>After birth, a child is little more than a blob of overwhelmed carbon. It is hard coded only with the most dire necessities for life: cry and flail when hungry, cry and flail when in pain, cry and flail when alone, etc. As infants, we begin attempted interpretation of some extremely complex input — our senses. To begin with, we are pretty much useless at it. Luckily, our brains are designed to recognize and react to patterns, so even with very few tools in our belt, we are able to improve our cognition rapidly.</p>
<p>We aren’t born with language, we learn it from our input; the reaction of our teachers (parents) provides positive feedback, and the connection between language and the physical world provides a tool to express our desires (more on desire, later.)</p>
<p>This same idea applies to the suggested training methodology; give our machine the most distilled set of tools necessary for it to function and have it prove itself by learning and orienting itself towards the environment. Our machine should be able to find patterns in its input and react with an output that maximizes positive environmental response — communication!</p>
<p>The most successful algorithms capable of learning complex patterns — to a high degree of abstraction — are without question neural networks. As the name suggests, these algorithms and models are based on our own brain.</p>
<h2 id="intelligence">Intelligence</h2>
<p>Like many in and around the field of computer science, I spend countless hours in the shower pondering the definition of intelligence. What is it that makes humanity unique to all other life? One trait that seems to be most promising is our innate ability to recognize high order patterns, within abstract data. While nearly all lifeforms are capable of some pattern recognition and, in some cases, complex pattern recognition (see: <a href="https://www.nytimes.com/2008/08/26/science/26crow.html?_r=0">Human Facial Recognition in Crows</a>); humans seem to have a unique propensity for recognizing patterns in unnatural data — beyond the realm of our senses and into the realm of mental abstractions. I would conjecture that it is our ability to recognize patterns that spurred the development of our complex communication tool, language.</p>
<p>For most animals, expressing observations about patterns present in their physical surroundings suffices for nearly all communication needs. In humans, our ability to recognize abstract patterns required us to be able to communicate these purely mental constructs (independent of our senses) to others in the group. Thus, allowing for distributed brainpower in decision making, as well as faster, more complete transfer of complex ideas.</p>
<p>It’s difficult to imagine how a great ape would express a thought pattern such as, “the yield of these fruit-bearing trees seems to be increased if we allow more sunlight to hit its leaves” to its kin. So, regardless of a great ape’s ability to innovate, the passage of complex knowledge ends with each individual.</p>
<h2 id="training">Training</h2>
<p>### Assume:
- we begin with a machine capable of receiving input and giving output;
- it is capable of learning by reinforcement (+/– feedback) from its teacher;
- it is void of any information about how to interact with its environment.</p>
<h3 id="method">Method:</h3>
<ul>
<li>The machine will receive input in the form of natural language instructions (i.e. “turn left”) from its teacher</li>
<li>The machine will query and instruct upon the environment</li>
<li>The machine will respond to the teacher (in simple cases, it will relay the actions it took within the environment)</li>
</ul>
<p>We can see that, to begin with, the machine will simply spew random characters as a response to the teacher’s instructions – resulting in negative feedback from the teachers. Eventually, the machine will happen upon a correct random input, which will spur the education of its communication-based interaction with the teacher and environment.</p>
<p>So, the training method proposed by the authors begins by first teaching our machine to communicate with its teacher and environment; from here, incremental steps in complexity are taken, introducing the machine to further nuances and ambiguity in language.</p>
<p>The authors stress the importance of small-batch training: the ability to learn from very little exposure to phenomena. Most training regimes typically rely on large amounts of data — with uniform representation frequency for each output category. Considering the massive amounts of research and subsequent innovation, I have no doubt that new solving strategies, that more effectively deal with small batch inference, will be developed. Another point I’ll raise shortly is that of outliers, which I believe to be significant in any training regiment.</p>
<h2 id="elasticity">Elasticity</h2>
<p>Another point that the authors have raised is the idea that any model capable of using their training methodology must be able to increase its own internal complexity, dynamically, to scale as problems become more complex. This is something that I am in complete agreement with. I have always found that NNs reliance on a programmer to decide upon the complexity of its parameter space seems wholly unnecessary. A massive step for machine learning will be creating models that scale their parameters and complexity elastically; expanding to incorporate recurring outliers that may represent an unconsidered feature or category, contracting to eliminate redundancies.</p>
<p>Methods for network reduction already exist (see: <a href="http://axon.cs.byu.edu/papers/Menke.OracleJournal.pdf">neural network reduction</a>), and I have no doubt that in the next couple of years we will see training regimes that stack the use of multiple networks (as in the cited paper) to manage complexity.</p>
<h2 id="modularity">Modularity</h2>
<p>This leads into how we presently apply neural networks; we use neural networks disjointly to interpret particular types of data. Only very recently in the field of deep learning (for instance, <a href="http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf">Google’s Inception network</a>) have we been joining sub-networks together into a larger net to solve highly complex problems. I think this is an incredibly intuitive ‘next step’: we have a neuron which can solve extremely simple problems — inform when input is above a threshold — then, we combine these into simple networks to solve more complex problems. The next logical step is to combine nets into a ‘super-network’ that uses sub-networks for sub-problems and then the overarching network to extract the super-conclusion from the sub-networks’ sub-conclusions.</p>
<p><img src="https://cdn-images-1.medium.com/max/800/1*fSdhPRjTiAXO13tooELWaw.png" alt="A single neuron of a neural network" /></p>
<p>Humans use a similar decision making system: we leverage information from all of our inputs (senses) to draw conclusions about our environment and how we should react it. Neural networks can now interpret images, sounds and more abstract data (see: <a href="http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf">language modelling</a>), the logical next step is combining these modular components to form a larger system.</p>
<p>I am convinced that the combination of modularity and elastic scaling will provide us with the best chance of creating effective decision making machines. It’s only to be expected that a network — when presented with enough outlier data of a similar form — will allocate a new sub-network to try and interpret this new phenomena.</p>
<h2 id="memory">Memory</h2>
<p>Another key tool to human learning is our memories. We store a vast amount of information in highly distilled forms from a lifetime of experiences. Recent information is held in a highly detailed and uncompressed format, while older information (that has been thoroughly processed) can be stored in a more obscure, reduced format. Oddities or outliers are ‘embossed’ in our minds (see: Bayesian Surprise Attracts Human Attention); they are easy to find and often when our minds wander, we wander to these events. It’s as if our minds dedicate idle processing power to drawing conclusions about events that we have trouble categorizing.</p>
<p>When we experience new information, we tend to look back into our past experiences to see if it can draw any more insight from them; this way, we can draw out potential value from this new information, via connections with experiences we are familiar with. Neural networks in their present state, are largely naïve to their past experience. They extract consistent and generalized features, while ignoring nuances and discarding unique instances. Perhaps as networks become more complex and varied, we will develop a method of storing and grouping previous input that is particularly unique. The net may then have a training step that looks at these unique instances and tries to determine whether a new category (elasticity) or perhaps a new network (modularity) could be used to extract value from these curious points of data.</p>
<h2 id="curiosity">Curiosity</h2>
<p>This is something that I’ve never considered when thinking about what properties should make up AI, and it is an enormous oversight on my part. Curiosity seems so intimately linked to human development, that to have it anywhere but at the forefront of our minds when creating intelligent machines could severely limit how quickly we solve the problem.</p>
<p>I would define curiosity as the pursuit of (better: desire for) previously unexperienced information. This plays directly into the presented notions of a machine that effectively scales to include outliers and new data. The authors suggest that the machine be given ‘free time’ to explore its environment and apply the skills it has been developing in training, as well as learn new ones. It’s this idea of curiosity that will spark the machine intelligence; it’s purpose should be to learn as much about the world as possible, and it should achieve this through experience.</p>
<h1 id="conclusion">Conclusion</h1>
<p>I think that the ideas presented in Facebook’s paper are both interesting and easily consumed; allowing for collective brainstorming across areas of expertise.</p>
<p>As we learn more about our own neurological learning processes (see: <a href="http://news.sciencemag.org/brain-behavior/2015/10/mysterious-holes-neuron-net-may-help-store-long-term-memories">Science’s ‘Holes’ in neuron net store long-term memories</a>), the more effective our computational models will become.</p>
<p>Our success on this topic is entirely dependant on the collaboration of two fairly distant fields: Biology and Computer Science (Man and Machine?).</p>
<p>A good question for humanity to start asking itself: if we are capable of creating a curious, virtuous, and sustainable form of life via intelligent machines; is the <a href="https://observer.com/2015/08/stephen-hawking-elon-musk-and-bill-gates-warn-about-artificial-intelligence/">fear of this machine becoming our successor</a> well founded or necessary?</p>
<p>If you enjoyed the article please recommend, and feel free to follow my publications!</p>aidangomez