Backpropagation networks

Earlier in these notes I used the Rescorla-Wagner model of associative learning as an example of how to implement computational models of cognition in R. In this and later sections, I’ll expand the dicussion of models to cover a variety of other models in the field. I’ll start with the backpropagation rule for learning in connectionist networks.

0.1 Scripts and data set

The iris_recode.csv file contains the classic iris data slightly reorganised as purely numeric data (here) is the script that generated it.
The first version of the modelling code implements a simple two-layer backpropagation network for the iris data: iris_twolayer.R
The second version of the code implements the same model, but expressing the learning rules as matrix operations in order to speed up the calculations: iris_twolayer2.R

At the moment the scripts don’t do anything other than learn a classification rule. The goal for the full exercise will (eventually) be to examine what the model is learning across the series of “epochs”, and consider the relationship between this connectionist network and a probabilistic logistic regression model, but for now it’s a bit simpler than that!

In this tutorial we’ll only cover a very simple version of a backpropagation network, the two-layer “perceptron” model. There are two versions of the code posted above. The code in the iris_twolayer.R script is probably the more intuitive version, as it updates the association weights one at a time, but R code runs much faster when you express the learning rule using matrix operations, which is hwat the iris_twolayer2.R version does. Let’s start with a walk through of the more intuitive version…

0.2 Input and output patterns

First, let’s take a look at the training data. I’m going to use the classic “iris” data set that comes bundled with R, but I’ve reorganised the data in a form that is a little bit more useful for thinking about the learning problem involved, and expressed it as a numeric matrix.

irises <- read_csv("./data/iris_recode.csv") %>% as.matrix()

## Parsed with column specification:
## cols(
##   sepal_length = col_double(),
##   sepal_width = col_double(),
##   petal_length = col_double(),
##   petal_width = col_double(),
##   context = col_integer(),
##   species_setosa = col_integer(),
##   species_versicolor = col_integer(),
##   species_virginica = col_integer()
## )

This data set has columns containing many features. First there are the input features, which consist of two features relating to the petals, two features relating to the sepal, and a context feature that is 1 for every flower. Additionally, there are three binary valued output features corresponding to the species of each flower, dummy coded so that only the correct species has value 1 and the incorrect species have value 0. Here are the names:

input_names <- c("sepal_length", "sepal_width", "petal_length", "petal_width", "context")
output_names <- c("species_setosa", "species_versicolor", "species_virginica")

So for the first flower, the network would be given this pattern as input:

input <- irises[1, input_names]
input

## sepal_length  sepal_width petal_length  petal_width      context
##          5.1          3.5          1.4          0.2          1.0

and we need to train it to produce this target pattern as the output:

target <- irises[1, output_names]
target

##     species_setosa species_versicolor  species_virginica
##                  1                  0                  0

0.3 Connection weights between input and output

In its simplest form we can describe the knowledge possessed by our network as a set of associative strengths between every input feature and every output feature. In that sense we can think of it as a generalisation of how the Rescorla-Wagner model represents knowledge:

n_input <- length(input_names)
n_output <- length(output_names)
n_weights <- n_input * n_output

So what we’ll do is create a weight matrix that sets the initial associative strength to zero, with a tiny bit of random noise added to each of these associative weights:

weight <- matrix(
  data = rnorm(n_weights) *.01,
  nrow = n_input,
  ncol = n_output,
  dimnames = list(input_names, output_names)
)
weight

##              species_setosa species_versicolor species_virginica
## sepal_length    0.016901179       -0.023666536      -0.012072306
## sepal_width     0.009016444        0.006361849       0.014630776
## petal_length   -0.008415695        0.004238456       0.002181564
## petal_width    -0.012105927        0.003337767      -0.003406105
## context         0.007639070       -0.014859420      -0.003135950

Here’s the network we want to code:

While we’re at it, store a copy for later:

old_weight <- weight

0.4 Making predictions

In the Rescorla-Wagner model, when the learner is shown a compound stimulus with elements A and B with individual associative strengths \(v_A\) and \(v_B\), the association strength for the compound AB is assumed to be additive \(v_{AB} = v_{A} + v_{B}\). We could do this for our backpropagation network too, but it is much more common to assume a logistic activation function. So we’ll need to define this activation function:

logit <- function(x){
  y <- 1/(1 + exp(-x))
  return(y)
}

So what we do is first take the sumof the inputs and then pass them through our new logitfunction. So let’s say we want to compute the strength associated with the first species:

output_1 <- sum(input * weight[,1]) %>% logit()
output_1

## [1] 0.5277688

More generally though we can loop over the three species:

# initialise the output nodes at zero
output <- rep(0, n_output)
names(output) <- output_names

# feed forward to every output node by taking a weighted sum of
# the inputs and passing it through a logit function
for(o in 1:n_output) {
  output[o] <- sum(input * weight[,o]) %>% logit()
}

# print the result
output

##     species_setosa species_versicolor  species_virginica
##          0.5277688          0.4733525          0.4972190

As you can see, initially the model has no knowledge at all! It’s predicting a value of about 0.5 for every category!

0.5 Learning from error

The prediction error is very familiar:

prediction_error <- target - output
prediction_error

##     species_setosa species_versicolor  species_virginica
##          0.4722312         -0.4733525         -0.4972190

Here is the code implementing the learning rule. What we’re doing is looping over every weight in the network, and then adjusting the strength proportional to the prediction error:

learning_rate <- .1

# for each of the weights connecting to an output node...
for(o in 1:n_output) {
  for(i in 1:n_input) {
    
    # associative learning for this weight scales in a manner that depends on
    # both the input value and output value. this is similar to the way that
    # Rescorla-Wagner has CS scaling (alpha) and US scaling (beta) parameters
    # but the specifics are slightly different (Equation 5 & 6 in the paper)
    scale_io <- input[i] * output[o] * (1-output[o])
    
    # adjust the weights proportional to the error and the scaling (Equation 8)
    weight[i,o] <- weight[i,o] + (prediction_error[o] * scale_io * learning_rate)
    
  }
}

(Let’s not worry too much about the scale_iofactor for now). So let’s look at the input, output, target, and prediction error:

input

## sepal_length  sepal_width petal_length  petal_width      context
##          5.1          3.5          1.4          0.2          1.0

output

##     species_setosa species_versicolor  species_virginica
##          0.5277688          0.4733525          0.4972190

target

##     species_setosa species_versicolor  species_virginica
##                  1                  0                  0

prediction_error

##     species_setosa species_versicolor  species_virginica
##          0.4722312         -0.4733525         -0.4972190

Now let’s look at how the weights changed:

weight - old_weight

##              species_setosa species_versicolor species_virginica
## sepal_length    0.060023771        -0.06018103      -0.063393464
## sepal_width     0.041192784        -0.04130071      -0.043505319
## petal_length    0.016477114        -0.01652028      -0.017402127
## petal_width     0.002353873        -0.00236004      -0.002486018
## context         0.011769367        -0.01180020      -0.012430091

Not surprisingly everything to setosa has gone up and the others down. But notice the scale!

0.6 Visualising the learning

For the actual simulation I’ll set the learning rate to .01, run it for 5000 epochs and average across 100 independent runs just to smooth out any artifacts of randomisation¹ How the weights change over epochs:

0.7 Resources

The Rumelhart et al (1986) paper cached for teaching purposes here
A very good, but somewhat technical summary of backpropagation by Michael Nielsen
Really nice resources in Python by Olivia Guest.

The script to run it is here, a csv with the connections weights is here, a zipped csv (11Mb) with sum squared error to each item on each presentation here.↩