Wow, it’s been a rough couple of weeks: I had to hand in my graphics project, study for a statistics test, fighting off my allergies (I hate spring) and then I had to study for my finals. At least I have my degree now.
Anyways, I promised I’d write about the radon transformation I used to convert from the extracted images to a numerical format suitable for input into our neural network. This technique is extremely effective and is already used in industry for just such purposes. We tested it on demo day with very minimal data and it worked remarkably well.
Before I get knee deep in the technical aspects of the system, I need to mention this: due to the preprocessing done on the motion detected, there is no need for a complicated AI system; the radon transformation and the recursive feature extractor together remove a lot of noise and problems that may have been present otherwise. The radon transform especially helps as we have built in scaling so this does not have to be taken into account later on. Also from the results of the transformation, objects similar in shape have extremely similar radon transformations so the training time of the neural network was reduced as was the amount of hidden neurons necessary.
In the final demo we used a neural network with 408 inputs and only 4 hidden neurons, scary isn’t it.😛
Now back to the nitty gritty: the radon transform. If you Google the “radon transform” you’ll probably get the Wikipedia page with a scary looking equation. I also got a fright the first time I saw this but after some research it’s really simple.
The basic idea of the radon transform (or my modified version thereof) is simple: if you look at your 2D image in the XY plane, you simply flatten the image onto the X axis (figure1), then divide the X axis into several beams and you work out the amount of pixels within each beam. Your output will be the pixel contributions of the object to each beam. Then you’d rotate the object and flatten it once again. Doing this for multiple angles will give you a very good representation of the objects shape.
The most basic (and unfortunately most commonly used technique) for image classification is to simple get the centroid of the object and then trace the outline of the object giving you a silhouette. Now this doesn’t sound so bad does it? Well, it is firstly it doesn’t handle broken up images well (not without major preprocessing or modification) and it also loses a lot of detail and can provide false matches. In figure1 below we have the radon transform of a solid circle and a hollow circle, a standard outline trace would provide the exact shape result for these obviously different shapes while as you can see the radon transform (in one projection) provides completely different results. Again this pre-processing will take the strain of the neural network (or other AI technique we’ll use for classification).
Okay now for the technical details: as you remember we flatten the image according to some projection. Figure2 shows some of these projections. Now if you look at figure 2 you might notice that the now flattened image’s top border can be seen as a graph of some function, so the amount of pixels in a beam is the approximate area under the graph between the left and right end points of the beam.
Now that picture is misleading as you might think that that it is a square object that we’ve rotated and flattened, but it is in fact a single pixel. The algorithm works on a per pixel basis. Instead of actually flattening the object, we simply work out the equation of the graph for a single rotated pixel and then use that to run through all the pixels in the object, work out the left most and right most and then add them to their respective beams.
Now some of you are screaming that if we just rotate the pixels it will be wrong as we aren’t rotating the entire object but that is taken into account later on.
Now how do we calculate the area under the graph for each pixel and how do we figure out what beam to add it to since a beam will have lots and lots of pixels in it? Also the beam widths will differ per object.
What we do is simply divide the beams into lots of sub-beams, so that multiple sub-beams pass through each pixel. Then for each pixel we work out the left most sub-beam and the right most sub-beam that passes through the pixel. This then becomes the domain for the equation of the graph we have earlier and we loop through each sub-beam, calculate the pixels contribution to it(the area under the graph) and then add it to the sub-beam total. This is shown in figure 3. What you also notice from figure 3 is that there is a small degree of approximation to reduce the calculations required for the area, but remember that we’re talking about fractions of a pixel here so the total error in approximation can easily be ignored.
Now for each projection we run through each pixel and add it to the appropriate sub-beam. Once this is complete we sum the sub-beams up into the initial amount of beams and then we divide each beam by the scaling factor. The scaling factor is simply the total pixels over the beam width; this reduces the total area for the beams to 1. So every object gets reduces to an n-beam representation where the sum of all beams is equal to 1.
Okay, my explanation is very basic and I’m sure mathematicians would point out various mistakes and so on , but I’m trying to make this easy to understand and to follow, it is not meant as a 100% mathematically accurate explanation, obviously if you wish to implement something like this, you wouldn’t only use my guide here as a reference. I’ve also left out some details but they should become apparent from the below explanations.
I’m struggling to find a good way to structure this guide so I’m just going to run through the algorithm simply just to finish off.
The first steps we need to take before we can process the object is to get the total number of pixels, work out the centroid and the approximate radius of the object. Using the radius we work out the amount of beams and sub-beams we need for the transformation. Remember that we want several sub-beams to pass through each pixel.
Now we run each of our projection functions to calculate the sub-beams totals. I’ll run through the basic procedure for a projection:Work out the center of the pixel on the new axis (this where the rotation of the object comes into play)Work out the left most and right most sub-beams that pass through the pixel.For each sub-beam add the pixel’s contribution to it.
Note: For some equations there is an incline to the graph and so this needs to be calculated too, and processed separately. I.e. Work out the left most and right most sub-beams for the increasing incline and the decreasing incline and then using that work out out all the section separately.
Now once all the sub-beams have been calculated, we work out the scaling factor which is: beamWidth /numPixels. We then sum all the sub-beams into beams (per projection) and multiply each one by the scaling factor. And that’s it. We have our complete numerical representation of our image.
Note: I used only 8 projections as I had very limited CPU time left at this stage of the project and had to limit the amount of processing that needs to be done, obviously more projections will be better but then again too many would be worse. A fine balance needs to be found, I personally think that 8 projections are more than sufficient for my needs. Again GPGPU programming would be so useful here!
C++ Source Code: https://github.com/BobbyAnguelov/RadonTransformer