Bobby Anguelov's Blog

A day in the life of a wannabe game developer

Basic Neural Network Tutorial – Theory

Introduction

Well this tutorial has been a long time coming. Neural Networks (NNs) are something that i’m interested in and also a technique that gets mentioned a lot in movies and by pseudo-geeks when referring to AI in general. They are made out to be these really intense and complicated systems when in fact they are nothing more than a simple input output machine (well at least for the standard Feed Forward Neural Networks (FFNN) ).  As with any field the more you delve into it the more technical it gets and NNs are the same, the more research you do into them the more complicated architectures, training techniques, activation functions become. For now this is just a simple primer into NNs.

There are many different types of neural networks and techniques for training them but I’m just going to focus on the most basic one of them all – the classic back propagation neural network (BPN).  The back propagation refers to the fact that any mistakes made by the network during training get sent backwards through it in an attempt to correct it and so teach the network whats right and wrong.

This BPN uses the gradient descent learning method. Trying to describe this simply at this point is going to be difficult so I’ll leave it for a bit later, all you need to know is that it’s so called because it follows the steepest gradient down a surface which represents the error function as it tries to find the minimum of the error function and by doing so decrease the error.

I wanted to skip over the basics of neural networks, and all that this is your brain and this is a neuron but I guess it’s unavoidable. I’m not going to go into great detail as there is plenty of information already available online. Here are the wiki entries on FFNNs and back propagation:

http://en.wikipedia.org/wiki/Back-propagation , http://en.wikipedia.org/wiki/Feedforward_neural_network .

This is the second version of the tutorial since half way through the first one I realized that I needed to actually go over some of the theory properly before I could go over the implementation and so i’ve decided to split the tutorial into two parts: part 1(this) will go over the basic theory needed and part  2 will discuss some more advanced topics and the implementation.

Now I havent even told you what a Neural Network is and what is it used for. Silly me! NNs have a variety of uses especially in classification or function-fitting problems, they can also be use to create emerging behaviour in agents reacting to environment sensors. They are one of the most important artificial intelligence tools available today. Just for the record I am by no means an expert on neural networks, I just have a bit of experience implementing and successfully using BPN’s in various image classification problems before.

The Neuron

Okay enough blabbering from me; let’s get into the thick of it. The basic building block of a NN is the neuron. The basic neuron is consists of a black box with weighted inputs and an output.

Note: perceptron – neuron that classifies its inputs into one of two categories, basically the ouput of a neuron is clamped to 1 or 0.

perceptron.jpg

The black box section of the neuron consists of an activation function F(X), in our case its F(wSum – T) where wSum is the weighted sum of the inputs and T is a threshold or bias value. We’ll come back to the threshold value just now. The weights are initialized to some small random values and during training get updated. The weighted sum (wSum) is given below.

wsum.jpg

Simple, huh? Now for the function F, there are various functions that can be used for F; the most common ones include the step function and the sigmoid function. We will be using the sigmoid function in our BPN as its again the classical activation function. The sigmoid function and its derivative are defined as:

sigmoid.jpg

The sigmoid function has the following graph :

sigmoidfunction.png

Note: Now its very important to realize that the sigmoid function can never return a 0 or a 1 due to its asymptopic nature. So often its a good idea to treat values over 0.9 as 1 and under 0.1 as 0.

Now we need to cover an important point regarding the input data and the desired output. Lets use the binary OR operator as an example to explain the function of the weights and threshold. With OR we want a binary output telling us whether its true or not, so a single perceptron with two inputs is created. Now the search space for the neural network can be drawn as follows:

lsds.jpg

The dark blue dots represents values of true and the light blue dot represents a value of false, you can clearly see how the two classes are seperable. We can draw a line seperating them as in the above example. This seperating line is called a hyperplane. A single neuron can create a single hyperplane and the above function can be solved by a single neuron.

Another important point is that the hyperplane above is a straight line, this means we used a linear activation function (i.e. a step function) for our neuron. If we used a sigmoid function or similar the hyperplane would resemble a sigmoid shape as seen below. (not the best image so please excuse my poor paint skills). The hyperplane generated by the image depends on the activation function used.

sigmoidhp.jpg

Remember that Threshold (Bias) value we had earlier? What does that do? Simply put it shifts the hyperplane left and right while the weights orientate the hyperplane. In graphical terms the Threshold translates the hyperplane while the weights rotate it. This threshold also need to be updated during the learning process.

I’m not going to go into the details of how a neuron learns in detail or provide examples; there are many excellent books and online guides to do this. The basic procedure is as follows:

  • run an input pattern through the function
  • calculate the error (desired value – actual value)
  • update the weights according to learning rate and error
  • move onto next pattern

The learning rate term is a term that hasnt been mentioned before and is very important, it greatly affects the performance and accuracy of your network. I’ll go over this in more detail once we get to the weight updates.

The Multilayer Neural Network

As I mentioned before for linearly separable problems a single neuron is sufficient but what about problems that have more than one class or ones where data isn’t so well separated like in the example below:

nlsds.jpg

Here we need at least two hyper-planes to solve this problem so we need 2 neurons. This requires us to link up these neurons together, to link them up we’ll need shared inputs and outputs – in other words a multilayer neural network. The standard architecture of a NN consists of 3 layers: an input layer, a hidden layer and an output layer. There are several proofs available that you will almost never need more than 3 layers (I’ll try get links to the papers soon) and also more importantly we want to keep things simple.

NOTE: you almost never know what your search space looks like, thats why you’re using a neural network, often you’ll have to experiment with the neural network architecture in regards to how many hidden neurals you need to get a good result.

A basic multilayer NN:

bpn.jpg

Above is a basic multi layer neural network, the inputs are shard and so are the ouputs, note that each of these links have seperate weights. Now what are those square blocks in the neural network? They are our thresholds (bias) values, instead of having to store and update separate thresholds for each neuron (remember each neuron’s activation function took a weighted sum minus a threshold as input) , we simply create 2 extra neurons with a constant value of -1. These neurons are then hooked up to the rest of the network and have their own weights (these are technically the threshold values).

This results in the weighted sum + the weight of the threshold multiplied by -1, obviously you can see its the same as we had earlier. Now when we update the weights for the network during backpropagation we automatically update the thresholds as well, saving us a few calculations and headaches.

Okay so far everything has (hopefully) been pretty simple especially if you have a bit of a background in NNs or have read through an  introductory chapter in an AI textbook. There are only 3 things left to discuss – calculating the errors at the output, updating the weights (the back propagation) and the stopping conditions.

The only control over this architecture you have is over the number of hidden neurons since your inputs and desired outputs are already known, so deciding on how many hidden neurons you need is often a tricky matter, too many is never good, and neither is too little, some careful experimentation will often be required to find out an optimal amount of hidden neurons.

I’m not going to go over feeding the input forward as its really simple: all you do is calculate the output ( the value of the activation function for the weighted sum of inputs ) at a neuron and use it as the input for the next layer.

The Error gradients

Okay so obviously we need to update the weights in our neural network to give the correct output at the output layer. This forms the basis of training the neural network. We will make use of back-propagation for these weight updates. This just means input is fed in, the errors calculated and filtered back though the network making changes to the weights to try reduce the error.

The weight changes are calculated by using the gradient descent method. This means we follow the steepest path on the error function to try and minimize it. I’m not going to go into the math behind gradient descent, the error function and so on since its not really needed, simply put all we’re doing is just taking the error at the output neurons (Desired value – actual value) and multiplying it by the gradient of the sigmoid function.  If the difference is positive we need to move up the gradient of the activation function and if its negative we need to move down the gradient of the activation function.

errorgradientsexplanation.png

This is the formula to calculate the basic error gradient for each output neuron k:

egoutput.jpg

There is a difference between the error gradients at the output and hidden layers. The hidden layer’s error gradient is based on the output layer’s error gradient (back propagation) so for the hidden layer the error gradient for each hidden neuron is the gradient of the activation function multiplied by the weighted sum of the errors at the output layer originating from that neuron (wow, getting a bit crazy here eh?):

eghidden.jpg

The Weight Update

The final step in the algorithm is to update the weights, this occurs as follows:

The alpha value you see above is the learning rate, this is usually a value between 0 and 1. It affects how large the weight adjustmets are and so also affects the learning speed of the network. This value need to be careful selected to provide the best results, too low and it will take ages to learn, too high and the adjustments might be too large and the accuracy will suffer as the network will constantly jump over a better solution and generally get stuck at some sub-optimal accuracy.

The Learning algorithm

The BPN learns during a training epoch, you will probably go through several epochs before the network has sufficiently learnt to handle all the data you’ve provided it and the end result is satisfactory. A training epoch is described below:

For each input entry in the training data set:

  • feed input data in (feed forward)
  • check output against desired value and feed back error (back-propagate)

Where back-propagation consists of :

  • calculate error gradients
  • update weights

Stopping Conditions

These are some commonly used stopping conditions used for neural networks: desired accuracy , desired mean square error and elapsed epochs. I wont go over these in too much detail now as I will be covering them in the next tutorial with some training examples. The main reason i’m not going into detail here is that i havent described the training of the network in detail, i need to go over the creating of training data sets, what generalization and validation errors are and so on. All this will be covered in greater detail in the next tutorial.

Conclusion

So this is it for my initial tutorial on the basics of neural networks; stay tuned for my next tutorial where I’m going to go over a few more things regarding neural networks including stopping conditions, training techniques, discussion of stochastic and batch learning, some examples of me training the network and I’ll also include my implementation of a classic back-propagation neural network class in c++ that features momentum and batch learning.

Continuation

Tutorial Continues in Part 2 : implementation and c++ source code – NN Tutorial Part 2

3 April 2008 - Posted by Bobby | Artificial Intelligence, General, Neural Networks, Programming | , , , , , , , , | 40 Comments

40 Comments »

  1. kewl tutorial. learning algorithms such as NN have always interested me. Your explanations for the most part, are pretty simple to follow, but at crucial times you go off the rails leaving a disconnect. i guess you gap-out the things you already know and take for granted, but would be cool if you remembered them more explicitly. ;)

    Comment by Capricorn | 7 April 2008 | Reply

  2. thanks for the response, can you point out where i blank out or even anything you think I’m missing or want to know? Maybe i am leaving things out…

    Part 2 will have more info and source code so that should help clear things up, i hope :P

    Comment by Bobby | 7 April 2008 | Reply

  3. rite. now when is part 2 due?

    Comment by Capricorn | 10 April 2008 | Reply

  4. i’ve edited part 1 a bit, added more info and rewrote some sections. i should be able to put up part 2 tommorrow, hopefully…

    depends on how rough work is…

    Comment by Bobby | 10 April 2008 | Reply

  5. great work, how did you do the graphics?

    Comment by - | 16 April 2008 | Reply

  6. i used microsoft visio to do the pictures. The formulas were done in microsoft word 2k7 (it has an excellent equation editor)

    Comment by Bobby | 16 April 2008 | Reply

  7. thanks! and again: good job.

    Comment by - | 16 April 2008 | Reply

  8. I love this tutorial.
    Newbie in Neural Networks and understood everything.
    thanks !!

    Comment by contremaitre | 18 April 2008 | Reply

  9. thanks! It makes me happy to know I’ve managed to help someone!!

    I’m busy with the second part at the moment, and i realize I’m actually covering a lot of things you wont find in textbooks. Well the textbooks seem take it for granted that you’ll magically work these things out on your own not to mention that their explanation often leave a lot to be desired.

    Hopefully these tutorials will save you guys some time… :P

    Comment by Bobby | 18 April 2008 | Reply

  10. Can you please explain me The sigmoid function please, what means “e”? and -x

    Please someone post a step by step calculus for the activation function with an example of a neuron

    Comment by Alex | 23 April 2008 | Reply

  11. e is simply a standard constant (euler’s number) with the value 2.71828…

    http://en.wikipedia.org/wiki/E_(mathematical_constant)

    e^x is the exponential function ( represented by the exp function in c/c++ ), so in our case the sigmoid is the inverse of 1 + the exponential function of -x.

    x is the input parameter you put into the sigmoid function to get a result (the y value if you want to graph it). So for an input of lets say 3 the output of the sigmoid function will be 1/1+e^-3 or in c++ code

    double input = 3;
    double output = 1 / ( 1 + exp(-input) );

    I hope this helps. Its a weird question to ask though especially if you’re interested in neural networks. I’d expect you to have at least a basic calculus education.

    Comment by Bobby | 23 April 2008 | Reply

  12. Hi, I found this website interesting. I am a beginner in neural networks and need some basic info on it.
    I’ll be pleased if you provide me with the answers.
    1. What is the role of a neuron in a NN?
    2. How to set the target value for the network?
    3. What is the purpose of an activation function?
    4. How to calculate the subsequent weighted components in a network?
    5. Given n input pairs, how many neurons are needed and how many layers are needed?
    6. In what terms do we get the actual output?

    Comment by Nisha | 23 April 2008 | Reply

  13. hey nisha,

    I’m not sure if you read the tutorial properly since questions 1,3,4,5 and 6 are answered in the above tutorial.

    and question 2 is answered in the second tutorial. Like i said in the tutorial I’m not going over the basics of what a neural network is, what a neuron is, why it works etc. this tutorial is a simple explanation of the more complicated topics for a basic BPN.

    Comment by Bobby | 23 April 2008 | Reply

  14. Thanks Bobby, i thought “e” is some part of the nouron and that’s why i was confused… yes it helps me
    I’m programming in Visual Basic and i think this is the solution:
    1 / (1 + (Exp(-3))) = 0.952574126822433
    Can you check it for me pls? In C++ the result it’s the same?
    I just whant to check if it’s okay because in vb it is a little diferent than in c++

    Comment by Alex | 23 April 2008 | Reply

  15. that looks perfect! best of luck with your neural network, if you want download my source code in the second tutorial and look through the comments, they will probably be super useful, as you can see how it was implemented.

    Comment by Bobby | 24 April 2008 | Reply

  16. If i make a nn with perceptons they will adjust to the patern that i’m giving, i guess i must stop the wheights from changing after they are trained? I mean… this can be the only solution for a nn… correct me please if i’m wrong

    Comment by Alex | 25 April 2008 | Reply

  17. yes, remember tho the neurons in the neural network aren’t perceptrons since their output isn’t just 0 or 1. they are just neurons.

    once you’ve trained the network all you do is feed data forward and check the result. Don’t train again unless you have new data that the neural network cant handle, and when you train again you have to train both the old and the data.

    Comment by Bobby | 25 April 2008 | Reply

  18. Man, c++ is dead. Learn java, or c# if you love microsoft so much.

    Comment by boris | 7 May 2008 | Reply

  19. hahaha, i know both and another 10 or so languages… C++, C# and php are my favorite tho. Java IMHO for idiots and c# although great cant come close to the performance or control of c++…

    that comment is so stupid, that I’m not gonna bother deleting or arguing :P

    Comment by Bobby | 7 May 2008 | Reply

  20. poh

    Comment by jgh | 8 May 2008 | Reply

  21. what functions i must use in the first (input, weight) layer while the network is training? I can use here the function with desired result but, betwen the hiden layer and output what funtion will i use? The same one?

    Comment by Sonyx | 13 June 2008 | Reply

  22. i don’t really understand what you’re asking, there are no functions between the layers just the weights for the links.

    Comment by Bobby | 14 June 2008 | Reply

  23. Thanks Bobby… I made it, i understand it now and made my first multilayer nn and it works great, thx again, keep in touch, write me on my email so i can have your address, bye bye

    Comment by Sonyx | 29 June 2008 | Reply

  24. there is a way for the nework to get trained with fewer iterations? An example, for learning the xor i need something like 7 000 iterations, it’s not a little to much, i just ask you, is there a way 4 the nework to learn faster?

    Comment by Sonyx | 22 September 2008 | Reply

  25. the performance of the learning is dependent on the learning parameters: the learning rate and momentum. The architecture of the NN is extremely important here too, so for XOR you’d need a architecture of around 2 hidden neurons and then you must play around with the learning rate.

    Comment by Bobby | 22 September 2008 | Reply

  26. - with the learning rate 1 it’s ready after 5440 iterations
    - with higher learning rate than 1 it’s less accurate
    - and with no learning rate it’s ready after 5520 ierations
    So, why bother using learning rate? It’s pretty much the same
    Correct me please if i’m wrong on this one

    Comment by Sonyx | 22 September 2008 | Reply

  27. you shouldnt be using a learning rate greater than 1, try something like 0.001 or similar. If the learning rate is too large, the weight changes are too great and it jumps past the correct values.

    Comment by Bobby | 22 September 2008 | Reply

  28. i get a reasonable error when i use 1 as learning rate or when i’m not using it at all, with the LR less then 1 it takes much longer for the nn to train, and with values above 1 it does’nt train properly
    i did’nt even mentioned or thought of 0.001 because i whant it to train more faster, and this really it’s slowing it down
    it’s ok with 5500 iterations for solving xor, i just asked if there is a way to do it more faster…
    and i’m doing this with some randoom weights, not some weights that helps the network to do it more faster
    i’m courious if the network is able to solve it quicker, or this is like an average training, with 5000 iterations for getting a good output with a low error

    Comment by Sonyx | 22 September 2008 | Reply

  29. okay listen to me, your architecture should be as follows: 2 inputs, 2 hidden, 1 output

    i just tested with my nn, with a learning rate of 0.4 and momentum at 0.9 i get 100% accuracy in around 20-30 iterations on a data set of 100 patterns, increasing the number of hidden neurons may help decrease the iteration time, with a learning rate of 1 and 4 hidden neurons, i can get it down to between 10 and 20 iterations.

    its all trial and error to get the optimal parameters, people spend years finding out good techniques to automatically find the optimal parameters, and also this form of training the neural network is the most basic.

    Comment by Bobby | 22 September 2008 | Reply

  30. Hi man I’m working on a company that ask me if it’s possible to detect some product in assemble line on a industry (Beer production for example) with a camera filming the assembly line.

    Do you know how I can use NN to detect those image patterns..
    Ps: I can treat the image with DSP to transform the beer in somethink grayscale and extruded to a very simple form

    Comment by Leonardo | 22 September 2008 | Reply

  31. HI,

    Its a nice explanation about BP algorithm. I have implemented a C- code for this BP algo for character recognition, and I am facing an issue in the training part. Actually I am looking to train the neural network for all 36 characters(26 -> alphabets + 9 -> integers) but it takes a lot of time for that.

    Can anyone suggest me some efficient method for training the BP neural network.

    Comment by Rahul T | 14 October 2008 | Reply

  32. BP is the training method, you can try speeding it up with momentum as i explained in the second part. other than that you can look at perhaps using a genetic algorithm to train your neural network.

    Comment by Bobby | 14 October 2008 | Reply

  33. Nice page – although I’m not sure about the accuracy of the statement that the output hyperplane takes on the shape of the output function of your network. Do you have any references for this – I’d be interested to know if it true.

    A sigmoid neuron with 2 inputs will produce a straight line division in the input space. The main advantage of the sigmoid over the a digital MCP unit is that sigmoid functions have a gradient over which BP learning can be applied.

    Comment by Pete | 11 November 2008 | Reply

  34. hey pete, you may be correct about the hyperplane, that was something i remembered from a lecture a long time ago, so i may be incorrect, I’ll try see if i can dig up some info regarding it. :/

    Comment by Bobby | 14 November 2008 | Reply

  35. Hi,
    At begining what I’m going reach. There is neural network recognizing letters and obtain brail’s reprezentation of letter.

    So I assume input pattern has (5×7)inputs, output pattern has 6 outputs.
    Activation function is sigmoid function. The difference is in computing RO, well my implementation is based on http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

    Of course I encountered problems, and didnt know why net is not learning properly then I have chacked xor – its works and one of N (so one output was responsiable for recognision one letter otherwise 0) outputs and there are some problems.

    But at most I want to know if in Your opinion usage given equations in this tutorial is crucial to get positive results.

    Comment by jones | 6 December 2008 | Reply

  36. This tutorial is WRONG. When you say “Another important point is that the hyperplane above is a straight line, this means we used a linear activation function (i.e. a step function) for our neuron. If we used a sigmoid function or similar the hyperplane would resemble a sigmoid shape as seen below. (not the best image so please excuse my poor paint skills). The hyperplane generated by the image depends on the activation function used” It is clear that the form of the activation function can change the shape of the hyper-surface that divides the input space. But a sigmoid does not look like a sigmoid when plotted in the input space. The whole idea of activation function is to apply it to a linear combination of the inputs, and so in your example the sigmoidal structure is seen not in the plane but in the third dimension. In the plane you are watching from above and you would see not a simple hyperplane but a soft surface coming from the plane (that has zero output) towards another plane with the highest (1) activity. If I’m wrong, you could calculate the XOR function with a single neuron with a sigmoidal activation function and I bet my money you cannot do that.
    Moreover a step function is non-linear, for linearity is defined as f(x) is linear iff f(ax+by)=af(x)+bf(y),k which is clearly not the case for a step function.

    Comment by Juan C. Valle LIsboa | 27 March 2009 | Reply

  37. Hi Juan,

    You are completely correct, I never really proof read the tutorial that well, which is obvious from my “linear” step function reference.

    I will correct that. Also you are right about the hyperplane issue as well, furthermore I’ve made another mistake by referring to decision boundaries as hyperplanes, even so the shape is not affected to a great extent by the activation function.

    I did some quick research and correct if I’m wrong but the hyperplane actually gets generated by the linear combination of the weights and inputs? the decision is boundary is created on this hyperplane / hyper surface by the activation function?

    Also this was meant as a beginning tutorial, so I have no idea why I even mentioned the whole topic. Thanks for the comment and I’ll correct the text first chance i get!

    Comment by Bobby | 27 March 2009 | Reply

  38. [...] There is a very good intro to neural networks theory written by some other WordPress blogger. It can be found here. [...]

    Pingback by A simple non-linear neuron model « dare2be | eb2erad | 12 October 2009 | Reply

  39. Sir, Thank you for your code.. This made us undrstand how complicated an NN code is…….

    Comment by KUSUM | 20 November 2009 | Reply

  40. Sir I have just gone through the code.. And found that NN is really complicated…

    Comment by Gopus | 20 November 2009 | Reply


Leave a comment