Bobby Anguelov's Blog

A day in the life of a wannabe game developer

Basic Neural Network Tutorial : C++ Implementation and Source Code

Introduction

So I’ve now finished the first version of my second neural network tutorial covering the implementation and training of a neural network. I noticed mistakes and better ways of phrasing things in the first tutorial (thanks for the comments guys) and rewrote large sections. This will probably occur with this tutorial in the coming week so please bear with me. I’m pretty overloaded with work and assignments so I haven’t been able to dedicate as much time as I would have liked to this tutorial, even so I feel its rather complete and any gaps will be filled in by my source code.

Implementation

Okay so how do we implement our neural network? I’m not going to cover every aspect in great detail since you can just look at my source code. I’m just going to go over the very basic architecture. So what critical data storage do we need for our neural network?

  • Our neuron values
  • Our weights
  • Our weight changes
  • Our error gradients

Now I’ve seen various implementations and wait for it… here comes an OO rant: I don’t understand why people feel the need to encapsulate everything in classes. The most common implementation of a neural network I’ve come across is the one where each neuron is modeled as an object and the neural network stores all these objects. Now explain to me that advantages of such an approach? All you’re doing is wasting memory and code space. There is no sane reason to do this apart from trying to be super OO.

The other common approach is to model each layer as an object? Again what the hell are people smoking? I guess i can blame this on idiot professors that know the theory but can’t code their way out of a paper bag. Like for example my father is a math’s professor and is absolutely brilliant in regards to developing algorithms and just seeing elegant solutions to problems but he cannot write code at all. I’ve had to convert his original discrete pulse transform Matlab code into c++, and in the process of converting/optimizing it, I ended up developing a whole new approach and technique that wasnt possible to achieve directly from the maths – the so called fast discrete pulse transform.

I also tend to be a bit of a perfectionist and am firm believer in occam’s razor (well a modified version) – “the simplest solution is usually the best solution”. Okay enough ranting – the point is dont use classes when you dont need to! Useless encapsulation is a terrible terrible mistake, one that most college students make since its drilled into them during their years of study!

So below is how I structured my neural network and afaik it’s as efficient as possible. If anyone can further optimize my implementation please do!  I’d love to see your techniques (I love learning how to make code faster).

A neural network is a very simple thing and so must be implemented very simply. All the above values are just numbers so why can’t I just use multi-dimensional arrays. The short answer is that I can and I do. Storing each of those sets of number in multi-dimensional arrays is the most efficient and simplest way of doing it with the added benefit that if you use the index variables i/j/k for input/hidden/output loops respectively, then your code will almost exactly resemble the formulas in the theory tutorial and in this one.

So now that the major architecture design decision has been mentioned, here’s what the end result looks like:

Initialization of Neural Network

Okay so how do we initialize our neural network? Well again its super simple, we set all the values of the inputs, deltas and error gradients to 0. What about the weights?

The weights between the layers be have to randomly initialized. They are usually set to small values often between in the range of [-0.5, 0.5], but there are many other initialization strategies available for example, we can normally distribute the weights over the number of inputs but changing the range to: [ -2.4/n, 2.4/n] where n is the number of input neurons.

Technically you can set the weights to a fixed value since the weight updates will correct the weights eventually but this will be terribly inefficient. The whole point of setting the initialization range is to reduce the number of epochs required for training. Using weights closer to the required ones will obviously need less changes than weights that differ greatly.

If for example you have an implementation where the data for training coming in is very similar, for example an image recognition system that takes in various silhouettes and their classification and trains using that data. After a few training sessions with different data you can observe the range in which the final weights are found and initialize the weights randomly within that range, this technique will further reduce the number of epochs required to train the network.

Training data and training problem

So we’ve created our neural network but what good does that do without data to train it on? I’ve selected a letter classification problem as an example. I downloaded a letter recognition dataset from the UCI machine learning repository. ( http://archive.ics.uci.edu/ml/datasets/Letter+Recognition ).

I’ll be using this test data to explain the basics of training your neural network. The first thing is to format the dataset, in our case we had a list of 16 attributes and the letter those attribute represent. To make life simple the first problem I selected was to distinguish the letter ‘a’ from the rest. So I simply replaced the letter with a 1 for an “a” and a 0 for everything else, this effectively reduces the output to a single Boolean value.

Another problem i’ve included is a vowel regonition problem, where the output neuron is 1 for (a,e,i,o,u) and 0 for all other letters. This is a more complex problem and as such the accuracy achieved will be much lower.

So you can already see that we have 16 inputs (the attributes) and a single output, so most of our NN architecture is done, how many hidden neurons do we need? Well we don’t know, we have to play around with different numbers of hidden neurons till we get a good result.

I’ve also noticed that there seems to be an accuracy ceiling for every NN’s architecture. A good method to find out how many hidden neurons you need would be to: train the network several times (10+) and see what the average accuracy is at the end and then use this accuracy as a measure to compare different architectures.

Something that was asked in the comments section was what is the range for the input values, well the sigmoid functions is continuous over the range (-inf, inf) so any values will work but it a good idea to keep the values within the active region (region of greatest change / steepest gradient) of your activation function.

Okay moving along, so we’ve “formatted” our data the next step is to split it up so that we can train our neural network with it.

The training data sets

So in our case we have this massive data set of 20000 patterns (entries), we can’t just stick all of the data into our network since the network will learn that data and we have no way of checking how well the network will do with unseen data. This problem is referred to as over-fitting, basically the network starts remembering the input data and will fail to correctly handle unseen data. I don’t want to go into too much detail here as this is something of an advanced research topic. Once you are familiar with neural networks you can read up on over-fitting on your own.

So we don’t want the network to memorize the input data, so obviously we’ll have to separate our training set. Classically we split the data set into three parts: the training data, the generalization data and the validation data. It is also often recommended to shuffle the initial dataset before splitting it to ensure that your data sets are different each time.

  • The training data is what we used to train the network and update the weights with so it must be the largest chunk.
  • Generalization data is data that we’ll run through the NN at the end of each epoch to see how well the network manages to handle unseen data.
  • The validation data will be run though the neural network once training has completed (i.e. The stopping conditions have been met), this gives us our final validation error.

The classic split of the dataset is 60%, 20%, 20% for the training, generalization and validation data sets respectively. Let’s ignore the validation set for now; the training set is used to train the network, so if you can remember from the last tutorial for every piece of data (pattern) in the training set the following occurs:

  • Feed pattern through NN
  • Check errors and calculate error gradients
  • Update weights according to error gradients (back propagation)

Once we have processed all the patterns in the training data set, the process begins again from the start. Each run through of all the patterns in the training set is called an epoch. This brings us to the question how do we decide how many epochs to run?

Stopping Conditions

There are several measures used to decide when to stop training. I’m going to list the various measures, what they mean and how to calculate them.

  • Maximum Epochs Reached – The first measure is really easy, all it means is that the NN will stop once a set number of epochs have elapsed.
  • Training Set Accuracy – This is the number of correctly classified patterns over the total number of patterns in the training set for an epoch. Obviously the higher the accuracy the better. But remember that this is the accuracy on previously seen patterns so you can use this alone to stop your training.
  • Generalization Set Accuracy – This is the number of correctly classified patterns over the total number of patterns in the generalization set for an epoch. This gets calculated at the end of an epoch once all the weight changes have been completed. This represents the accuracy of the network in dealing with unseen patterns. Again you can’t use this measure alone since this could have a much higher accuracy that the training set error.
  • Training Set Mean Squared Error (MSE) – this is the average of the sum of the squared errors (desired – actual) for each pattern in the training set. This gives a more detailed measure of the current networks accuracy, the smaller the MSE the better the network performs.
  • Generalization Set Mean Squared Error (MSE) – this is the average of the sum of the squared errors (desired – actual) for each pattern in the generalization set.

So now we have these measures so how do we stop the network, well what I use is either the training and generalization accuracies or MSEs in addition to a maximum epoch. So I’ll stop once both my training and generalization set accuracies are above some value or the MSE’s are below some value. This way you cover all your bases.

Just remember that while your accuracies might be the same over several epochs the MSE may have changed, the MSE is a measure with a much higher resolution and often is a better choice to use for stopping conditions.

Now we come to an important point that people often overlook, every time you train the network the data sets are different (if you shuffled the initial dataset as recommended earlier) and the weights are random, so obviously your results will differ each time you train the network. Usually the only difference is the number of epochs required to reach the NN’s accuracy ceiling.

Advanced Data Partitioning Methods

There are several advanced concept I want to deal with here. Firstly when dealing with large datasets, pushing through a massive training dataset through the network may not be the best solution. Since the NN will have to process all the patterns in the training set over and over, and there could be tens of thousands of patterns, you can imagine how slow this will be.
So there are several techniques developed to partition the training set to provide better performance in regards to both time taken to train and accuracy. I’m only going to cover two basic ones that I’ve implemented into my data set reader found below.

Growing Dataset:

This approach involves training with a growing subset of the training set; this subset will increase with a fixed percentage each time training has completed (stopping conditions for the neural network have been met). This carries on until the training data set contains all the training patterns.

Windowed Data set:

This approach creates a window of fixed size and moves it across the dataset. This move occurs once training has completed for that window. You’ll specify the window size and the step size. This method stops once the window has covered the entire training data set.

Momentum

Momentum is a technique used to speed up the training of a BPN. Remember that the weight update is a move along the gradient of the activation function in the direction specified by the output error. Momentum is just that, it basically keeps you moving in the direction of the previous step. This also has the added bonus that when you change direction you don’t immediately jump in the opposite direction but your initial step is a small one. Basically if you overshoot you’ve missed your ideal point and you don’t wish to overshoot it again and so momentum helps to prevent that.

The only change to the implementation is in regards to the weight updates, remember from part one that the weights updates were as follows:

Momentum is usually set to a high value between 0 and 1.  You’ll notice that with momentum set to 0 the weight updates are identical to original ones. The effect of this momentum term is shown below for our training problem with the learning rate set to 0.01 and the momentum set to 0.8.

As you can see from the graphs the BPN will usually converge a lot faster with momentum added than without it and it also has the added benefit of allowing the back-propagation to avoid local minima’s in the search space and to traverse areas where the error space doesn’t change. This is evident from all the fluctuations seen in the graph.

The momentum formula’s shown above are also known as the generalized delta rule.

Batch / Online learning

The last thing I want to go over is the difference between batch learning and stochastic or on-line learning.
Stochastic learning occurs when the neuron weights are updated after each individual piece of data is passed through the system. The FFNN therefore changes with every piece of data and is in a constant state of change during training. This is the way we’ve been doing it up to now.

Batch Learning on the other hand stores each neuron weight change when it occurs, and only at the end of each epoch does it update the weights with the net change over the training set. This means the neural network will only update once at the end of each epoch. Implementation wise this change is minor as all you need to do is just store the weight changes (the delta w values) and just update the weights at the end of the epoch.

The effects of this I’ll leave up to you guys to discover for yourselves. I can’t spoon feed you everything.

Source code and Implementation

In the zip file below you’ll find the complete visual studio 2k8 project for the following:

  • My neural network class (has CSV logging, and has supports for momentum and batch/stochastic learning)
  • My CSV data reader class (loads CSV files, has several data partitioning approaches built in)
  • The test data files for the above problem
  • A test implementation of the above training problem

My code is highly commented and very easy to read so any questions I haven’t answered should hopefully be answered by the code. I hope this has helped you guys understand neural networks.

Neural Network Project and C++ Source code : nnImplementation.zip
My First Neural Network Tutorial : Theory of a Neural Network

UPDATE:

I’ve updated the source code to use a better archictural design, its cleaner, simpler and easier to understand: nnImplementationV2.zip

23 April 2008 Posted by Bobby | Artificial Intelligence, Neural Networks, Programming | , , , , , , , , , , | 49 Comments

Basic Neural Network Tutorial – Theory

Introduction

Well this tutorial has been a long time coming. Neural Networks (NNs) are something that i’m interested in and also a technique that gets mentioned a lot in movies and by pseudo-geeks when referring to AI in general. They are made out to be these really intense and complicated systems when in fact they are nothing more than a simple input output machine (well at least for the standard Feed Forward Neural Networks (FFNN) ).  As with any field the more you delve into it the more technical it gets and NNs are the same, the more research you do into them the more complicated architectures, training techniques, activation functions become. For now this is just a simple primer into NNs.

There are many different types of neural networks and techniques for training them but I’m just going to focus on the most basic one of them all – the classic back propagation neural network (BPN).  The back propagation refers to the fact that any mistakes made by the network during training get sent backwards through it in an attempt to correct it and so teach the network whats right and wrong.

This BPN uses the gradient descent learning method. Trying to describe this simply at this point is going to be difficult so I’ll leave it for a bit later, all you need to know is that it’s so called because it follows the steepest gradient down a surface which represents the error function as it tries to find the minimum of the error function and by doing so decrease the error.

I wanted to skip over the basics of neural networks, and all that this is your brain and this is a neuron but I guess it’s unavoidable. I’m not going to go into great detail as there is plenty of information already available online. Here are the wiki entries on FFNNs and back propagation:

http://en.wikipedia.org/wiki/Back-propagation , http://en.wikipedia.org/wiki/Feedforward_neural_network .

This is the second version of the tutorial since half way through the first one I realized that I needed to actually go over some of the theory properly before I could go over the implementation and so i’ve decided to split the tutorial into two parts: part 1(this) will go over the basic theory needed and part  2 will discuss some more advanced topics and the implementation.

Now I havent even told you what a Neural Network is and what is it used for. Silly me! NNs have a variety of uses especially in classification or function-fitting problems, they can also be use to create emerging behaviour in agents reacting to environment sensors. They are one of the most important artificial intelligence tools available today. Just for the record I am by no means an expert on neural networks, I just have a bit of experience implementing and successfully using BPN’s in various image classification problems before.

The Neuron

Okay enough blabbering from me; let’s get into the thick of it. The basic building block of a NN is the neuron. The basic neuron is consists of a black box with weighted inputs and an output.

Note: perceptron – neuron that classifies its inputs into one of two categories, basically the ouput of a neuron is clamped to 1 or 0.

perceptron.jpg

The black box section of the neuron consists of an activation function F(X), in our case its F(wSum – T) where wSum is the weighted sum of the inputs and T is a threshold or bias value. We’ll come back to the threshold value just now. The weights are initialized to some small random values and during training get updated. The weighted sum (wSum) is given below.

wsum.jpg

Simple, huh? Now for the function F, there are various functions that can be used for F; the most common ones include the step function and the sigmoid function. We will be using the sigmoid function in our BPN as its again the classical activation function. The sigmoid function and its derivative are defined as:

sigmoid.jpg

The sigmoid function has the following graph :

sigmoidfunction.png

Note: Now its very important to realize that the sigmoid function can never return a 0 or a 1 due to its asymptopic nature. So often its a good idea to treat values over 0.9 as 1 and under 0.1 as 0.

Now we need to cover an important point regarding the input data and the desired output. Lets use the binary OR operator as an example to explain the function of the weights and threshold. With OR we want a binary output telling us whether its true or not, so a single perceptron with two inputs is created. Now the search space for the neural network can be drawn as follows:

lsds.jpg

The dark blue dots represents values of true and the light blue dot represents a value of false, you can clearly see how the two classes are seperable. We can draw a line seperating them as in the above example. This seperating line is called a hyperplane. A single neuron can create a single hyperplane and the above function can be solved by a single neuron.

Another important point is that the hyperplane above is a straight line, this means we used a linear activation function (i.e. a step function) for our neuron. If we used a sigmoid function or similar the hyperplane would resemble a sigmoid shape as seen below. (not the best image so please excuse my poor paint skills). The hyperplane generated by the image depends on the activation function used.

sigmoidhp.jpg

Remember that Threshold (Bias) value we had earlier? What does that do? Simply put it shifts the hyperplane left and right while the weights orientate the hyperplane. In graphical terms the Threshold translates the hyperplane while the weights rotate it. This threshold also need to be updated during the learning process.

I’m not going to go into the details of how a neuron learns in detail or provide examples; there are many excellent books and online guides to do this. The basic procedure is as follows:

  • run an input pattern through the function
  • calculate the error (desired value – actual value)
  • update the weights according to learning rate and error
  • move onto next pattern

The learning rate term is a term that hasnt been mentioned before and is very important, it greatly affects the performance and accuracy of your network. I’ll go over this in more detail once we get to the weight updates.

The Multilayer Neural Network

As I mentioned before for linearly separable problems a single neuron is sufficient but what about problems that have more than one class or ones where data isn’t so well separated like in the example below:

nlsds.jpg

Here we need at least two hyper-planes to solve this problem so we need 2 neurons. This requires us to link up these neurons together, to link them up we’ll need shared inputs and outputs – in other words a multilayer neural network. The standard architecture of a NN consists of 3 layers: an input layer, a hidden layer and an output layer. There are several proofs available that you will almost never need more than 3 layers (I’ll try get links to the papers soon) and also more importantly we want to keep things simple.

NOTE: you almost never know what your search space looks like, thats why you’re using a neural network, often you’ll have to experiment with the neural network architecture in regards to how many hidden neurals you need to get a good result.

A basic multilayer NN:

bpn.jpg

Above is a basic multi layer neural network, the inputs are shard and so are the ouputs, note that each of these links have seperate weights. Now what are those square blocks in the neural network? They are our thresholds (bias) values, instead of having to store and update separate thresholds for each neuron (remember each neuron’s activation function took a weighted sum minus a threshold as input) , we simply create 2 extra neurons with a constant value of -1. These neurons are then hooked up to the rest of the network and have their own weights (these are technically the threshold values).

This results in the weighted sum + the weight of the threshold multiplied by -1, obviously you can see its the same as we had earlier. Now when we update the weights for the network during backpropagation we automatically update the thresholds as well, saving us a few calculations and headaches.

Okay so far everything has (hopefully) been pretty simple especially if you have a bit of a background in NNs or have read through an  introductory chapter in an AI textbook. There are only 3 things left to discuss – calculating the errors at the output, updating the weights (the back propagation) and the stopping conditions.

The only control over this architecture you have is over the number of hidden neurons since your inputs and desired outputs are already known, so deciding on how many hidden neurons you need is often a tricky matter, too many is never good, and neither is too little, some careful experimentation will often be required to find out an optimal amount of hidden neurons.

I’m not going to go over feeding the input forward as its really simple: all you do is calculate the output ( the value of the activation function for the weighted sum of inputs ) at a neuron and use it as the input for the next layer.

The Error gradients

Okay so obviously we need to update the weights in our neural network to give the correct output at the output layer. This forms the basis of training the neural network. We will make use of back-propagation for these weight updates. This just means input is fed in, the errors calculated and filtered back though the network making changes to the weights to try reduce the error.

The weight changes are calculated by using the gradient descent method. This means we follow the steepest path on the error function to try and minimize it. I’m not going to go into the math behind gradient descent, the error function and so on since its not really needed, simply put all we’re doing is just taking the error at the output neurons (Desired value – actual value) and multiplying it by the gradient of the sigmoid function.  If the difference is positive we need to move up the gradient of the activation function and if its negative we need to move down the gradient of the activation function.

errorgradientsexplanation.png

This is the formula to calculate the basic error gradient for each output neuron k:

egoutput.jpg

There is a difference between the error gradients at the output and hidden layers. The hidden layer’s error gradient is based on the output layer’s error gradient (back propagation) so for the hidden layer the error gradient for each hidden neuron is the gradient of the activation function multiplied by the weighted sum of the errors at the output layer originating from that neuron (wow, getting a bit crazy here eh?):

eghidden.jpg

The Weight Update

The final step in the algorithm is to update the weights, this occurs as follows:

The alpha value you see above is the learning rate, this is usually a value between 0 and 1. It affects how large the weight adjustmets are and so also affects the learning speed of the network. This value need to be careful selected to provide the best results, too low and it will take ages to learn, too high and the adjustments might be too large and the accuracy will suffer as the network will constantly jump over a better solution and generally get stuck at some sub-optimal accuracy.

The Learning algorithm

The BPN learns during a training epoch, you will probably go through several epochs before the network has sufficiently learnt to handle all the data you’ve provided it and the end result is satisfactory. A training epoch is described below:

For each input entry in the training data set:

  • feed input data in (feed forward)
  • check output against desired value and feed back error (back-propagate)

Where back-propagation consists of :

  • calculate error gradients
  • update weights

Stopping Conditions

These are some commonly used stopping conditions used for neural networks: desired accuracy , desired mean square error and elapsed epochs. I wont go over these in too much detail now as I will be covering them in the next tutorial with some training examples. The main reason i’m not going into detail here is that i havent described the training of the network in detail, i need to go over the creating of training data sets, what generalization and validation errors are and so on. All this will be covered in greater detail in the next tutorial.

Conclusion

So this is it for my initial tutorial on the basics of neural networks; stay tuned for my next tutorial where I’m going to go over a few more things regarding neural networks including stopping conditions, training techniques, discussion of stochastic and batch learning, some examples of me training the network and I’ll also include my implementation of a classic back-propagation neural network class in c++ that features momentum and batch learning.

Continuation

Tutorial Continues in Part 2 : implementation and c++ source code – NN Tutorial Part 2

3 April 2008 Posted by Bobby | Artificial Intelligence, General, Neural Networks, Programming | , , , , , , , , | 40 Comments

Radon Transform C++ Implementation Update

Thanks to haDoan for noticing that i had left out the definition for the pixel struct in the source code. When i put up the code, i ripped it out of my surveillance project and didnt noticed the definition was located in a different file.

I’ve updated the source code to include the struct definition : http://www.cs.up.ac.za/cs/banguelov/blog/radonTransformer.h

2 April 2008 Posted by Bobby | Computer Vision, Image Processing, Programming | | 6 Comments