Teaching a Machine to Love  XOR


The XOR function outputs true if one of the two inputs are true

The exclusive or function, also known as XOR (but never going by both names simultaneously), has a special relationship to artificial intelligence in general, and neural networks in particular. This is thanks to a prominent book from 1969 by Marvin Minsky and Seymour Papert entitled “Perceptrons: an Introduction to Computational Geometry.” Depending on who you ask, this text was single-handedly responsible for the AI winter due to its critiques of the state of the art neural network of the time. In an alternative view, few people ever actually read the book but everyone heard about it, and the tendency was to generalize a special-case limitation of local and single-layer perceptrons to the point where interest and funding for neural networks evaporated. In any case, thanks to back-propagation, neural networks are now in widespread use and we can easily train a three-layer neural network to replicate the XOR function.

In words, the XOR function is true for two inputs if one of them, but not both, is true. When you plot the XOR as a graph, it becomes obvious why the early perceptron would have trouble getting it right more than half the time.


There’s not a way to draw a straight 2D line on the graph and separate the true and false outputs for XOR, red and green in the sketch above. Go ahead and try. The same is going to be true trying to use a plane to separate a 3D version and so on to higher dimensions.


That’s a problem because a single layer perceptron can only classify points linearly. But if we allow ourselves a curved boundary, we can separate the true and false outputs easily, which is exactly what we get by adding a hidden layer to a neural network.


The truth-table for XOR is as follows:

Input Output
00 0
01 1
10 1
11 0

If we want to train a neural network to replicate the table above, we use backpropagation to flow the output errors backward through the network based on the neuron activations of each node. Based on the gradient of these activations and the error in the layer immediately above, the network error can be optimized by something like gradient descent. As a result, our network can now be taught to represent a non-linear function. For a network with two inputs, three hidden units, and one output the training might go something like this:


Update (2017/03/02) Here’s the gist for making the gif above:

# Training a neural XOR circuit
# In Marvin Minsky Seymour Papert's in/famous critique of perceptrons () published in 1969, they argued that neural networks
# had extremely limited utility, proving that the perceptrons of the time could not even learn
# the exclusive OR function. This played some role
# Now we can easily teach a neural network an XOR function by incorporating more layers.
# Truth table:
# Input | Output
# 00 | 0
# 01 | 1
# 10 | 1
# 11 | 0
# I used craffel's draw_neural_net.py at https://gist.github.com/craffel/2d727968c3aaebd10359
# Date 2017/01/22
# http://www.thescinder.com
# Blog post https://thescinder.com/2017/01/24/teaching-a-machine-to-love-xor/
# Imports
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Input vector based on the truth table above
a0 = np.array([[0, 0],[0, 1],[1, 0],[1,1]])
# Target output
y = np.array([[0],[1],[1],[0]])
def draw_neural_net(ax, left, right, bottom, top, layer_sizes,Theta0,Theta1):
Public Gist from craffel
Draw a neural network cartoon using matplotilb.
>>> fig = plt.figure(figsize=(12, 12))
>>> draw_neural_net(fig.gca(), .1, .9, .1, .9, [4, 7, 2])
– ax : matplotlib.axes.AxesSubplot
The axes on which to plot the cartoon (get e.g. by plt.gca())
– left : float
The center of the leftmost node(s) will be placed here
– right : float
The center of the rightmost node(s) will be placed here
– bottom : float
The center of the bottommost node(s) will be placed here
– top : float
The center of the topmost node(s) will be placed here
– layer_sizes : list of int
List of layer sizes, including input and output dimensionality
n_layers = len(layer_sizes)
v_spacing = (top bottom)/float(max(layer_sizes))
h_spacing = (right left)/float(len(layer_sizes) 1)
# Nodes
for n, layer_size in enumerate(layer_sizes):
layer_top = v_spacing*(layer_size 1)/2. + (top + bottom)/2.
for m in range(layer_size):
circle = plt.Circle((n*h_spacing + left, layer_top m*v_spacing), v_spacing/4.,
color='#999999', ec='k', zorder=4)
# Edges
for n, (layer_size_a, layer_size_b) in enumerate(zip(layer_sizes[:1], layer_sizes[1:])):
layer_top_a = v_spacing*(layer_size_a 1)/2. + (top + bottom)/2.
layer_top_b = v_spacing*(layer_size_b 1)/2. + (top + bottom)/2.
for m in range(layer_size_a):
for o in range(layer_size_b):
if (n == 0):
line = plt.Line2D([n*h_spacing + left, (n + 1)*h_spacing + left],
[layer_top_a m*v_spacing, layer_top_b o*v_spacing], c='#8888dd',lw=Theta0[m,o])
elif (n == 1):
line = plt.Line2D([n*h_spacing + left, (n + 1)*h_spacing + left],
[layer_top_a m*v_spacing, layer_top_b o*v_spacing], c='#8888cc',lw=Theta1[m,o])
line = plt.Line2D([n*h_spacing + left, (n + 1)*h_spacing + left],
[layer_top_a m*v_spacing, layer_top_b o*v_spacing], c='r')
# Neuron functions
def sigmoid(x):
#The sigmoid function is 0.5 at 0, ~1 at infinity and ~0 at -infinity
#This is a good activation function for neural networks
mySig = 1/(1+np.exp(x))
return mySig
def sigmoidGradient(x):
#used for calculating the gradient at NN nodes to back-propagate during NN learning.
myDer = sigmoid(x)*(1sigmoid(x))
return myDer
# Initialize neural network connections with random values. This breaks symmetry so the network can learn.
Theta0 = 2*np.random.random((2,3))1
Theta1 = 2*np.random.random((3,1))1
# Train the network
myEpochs = 25000
m = np.shape(a0)[0]
# J is a vector we'll use to keep track of the error function as we learn
J = []
#set the learning rate
lR = 1e-1
#This is a weight penalty that keeps the
myLambda = 0#3e-2
fig, ax = plt.subplots(1,1,figsize=(12,12))
fig2,ax2 = plt.subplots(1,2,figsize=(12,4))
for j in range(myEpochs):
# Forward propagation
z1 = np.dot(a0,Theta0)
a1 = sigmoid(z1)
z2 = np.dot(a1,Theta1)
a2 = sigmoid(z2)
# The error
E = (ya2)
# Back propagation
d2 = E.T
d1 = np.dot(Theta1,d2) * sigmoidGradient(z1.T)
Delta1 = 0*Theta1
Delta0 = 0*Theta0
for c in range(m):
Delta1 = Delta1 + np.dot(np.array([a1[c,:]]).T,np.array([d2[:,c]]))
Delta0 = Delta0 + np.dot(np.array([a0[c,:]]).T,np.array([d1[:,c]]))
w8Loss1 = myLambda * Theta1
w8Loss0 = myLambda * Theta0
Theta1Grad = Delta1/m + w8Loss1
Theta0Grad = Delta0/m + w8Loss0
Theta1 = Theta1 + Theta1Grad * lR #+ stoch1 * stochMultiplier
Theta0 = Theta0 + Theta0Grad * lR #+ stoch0 * stochMultiplier
if (j % 250 == 0):
#Save frames from the learning session
matplotlib.rcParams.update({'figure.titlesize': 42})
matplotlib.rcParams.update({'axes.titlesize': 24})
draw_neural_net(fig.gca(), .1,.9,.1,.9,[np.shape(a0)[1],np.shape(a1)[1],np.shape(a2)[1]],Theta0,Theta1)
fig.suptitle('Neural Network Iteration'+str(j))
#fig2.suptitle('Mean Error')
ax2[0].set_title('Mean Error')
#suptitle('Mean Error and Output Vector')

view raw
hosted with ❤ by GitHub

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.