My last article was on the topic of alphabet recognition using artificial intelligence – https://exact-byte.com/mobile-application-for-sign-language-recognition/
In the meantime, I was working on another application, this time for a company located in the local area. I made an application that works on a device similar to RaspberryPI (more information here – https://www.raspberrypi.com/) and recognizes whether a car has occupied a parking space and reports which parking space is free.
The RaspberryPI is not a very powerful device and it doesn’t work perfectly when used for demanding operations such as real-time object recognition, but the solution I’ve come up with is pretty solid.
Although the device requires active cooling, it can recognize about 20 vehicles in about 5 seconds. It can work with multiple parking lots, but in a row – in other words, two parking lots in 10 seconds.
This is RaspberryPI in the image:
The original idea behind the project was to use something very similar to this here – https://school.geekwall.in/p/SJRJn01Nr
So, if we have one parking lot, as in the picture below:
And if we can define the parking space as we want – we define (“by drawing”) where the parking space is in the application itself, and mark it in the image, as in the following example.
To detect every vehicle in the parking lot, as in the image:
And if there is a car in that marked place, the parking space is occupied.
We want to get the result via the API – the result is entered into the database, and at any time we can retrieve the data via the REST API.
What would an example of a REST API call look like? Something like this:
Python 3.7, Flask, SQLAlchemy, PyTorch, and other auxiliary libraries were used to create the program.
What to pick for the architecture?
The question is which architecture to choose for a project like this.
The initial architecture was based on two services; one a regular Python Flask web application with a database containing backend and frontend, and the other an ML service to which queries are sent.
This second ML service is essentially wrapped with TorchServe (https://pytorch.org/serve/ ) and serves the model out of the box.
Although in the initial example the Mask R-CNN was used as an architecture due to its precision, which is also an architecture for object detection, it is difficult to use something like this on a small device like RPi.
This is because such a demanding architecture takes approximately 30-40 seconds to process a single image! That’s right, 30-40 seconds for one image!
I don’t have to explain how much that small device heats up when it tries to do something like that. Especially if it tries to work with more parking lots that contain more parking spaces.
It definitely can’t work all the time because after 20 minutes it overheats and starts to shut down. I even tested the temperature measurement with 3 different parking lots and a pause of 5 seconds in between to see how it works. The conclusion is as follows:
The temperature starts to jump closer and closer to 70, where the device simply starts to shut down the cores, which increases the load on the processor, heats up further, and after a while, it shuts down.
Is it necessary to use object detection architecture at all? Can we do something with the “ordinary” image recognition?
Image classification is a much less demanding problem than object detection. The program takes an individual image and returns what it “sees” in the image.
The algorithm doesn’t need to return the regions where it found the objects since there is an assumption that there is one object in the image, so it is necessary to classify that one object.
But the problem arises when we have two objects in the image, and we ask the program to give us one answer about what is in the image.
What answer are we going to get? We cannot say that our algorithm is wrong if it answers “cat” or “dog”. But what we are interested in is not only what is in the image, but also whether there are more objects in the image. Why? Because we can imagine a situation like this.
Do we want to detect that the tree is present in the image here, or do we want to (also?) detect that the car is in the image? These are two separate objects, but it would be a mistake for someone to mark a parking space like this, as much as it would be a mistake if the answer they received back was “tree”, because that would make the number of parking spaces smaller than it is.
Anyway, the problem with image classification is that we don’t get all the data back and it is hard to imagine how the program would work if the user marked the image and we would just ask if there is a car in the image or not. Because with that information alone, we can’t answer someone when they ask “why couldn’t I park here even though it says that there is a free parking space here?”. The answer “our neural network did not detect a free space because it thought there was a tree in that place” is not the answer someone wants to hear.
Of course, the example is exaggerated for a reason. But let’s look at a slightly more realistic example, which is in the image below.
There is no guarantee that under certain conditions our algorithm – which is “dumb” in many respects and can hardly be compared with a common sense of humans – won’t sometimes return “bushes” or “meadows” as answers to the question of what is in the image.
There is no guarantee that if a person crosses a parking space at that moment – even though a car is parked there – our algorithm won’t decide to return a “person” in response to the question of what it sees in that space.
To cut a long story short, the classification of images, in this case, is not robust enough – it is not sufficiently error-resistant. Therefore, this is not a good solution for detecting parking spaces.
Of course, we may end up having similar problems even when object detection is used.
In what way can we be sure that, even when we detect all the vehicles in the picture, certain parking space is occupied?
If we take an example, again from the image, with this configuration, we can easily assess whether the car is in the parking space. We see that the red square that marks the parking space is inside the green square, hence that space is occupied.
But what if a truck comes to that place? Would that be valid? No. How about if a slightly bigger vehicle comes to that place? It is also possible that it would not work. Should we then blame stupid people who do not understand that this is “just a program” and that the boundaries of the parking space should be clearly defined? Or is there a better way?
The answer given by the author of the article (https://school.geekwall.in/p/SJRJn01Nr ) seems quite reasonable. We will simply calculate whether the intersection of these two squares is large. Such a measure, called “Intersection over Union”, is a simple measure that calculates how much does the parking space square overlaps with the found vehicle? In our case, that’s great. It works.
But what is that measure? How large the overlap must be? 50%? 60%? 70%?
The problem arises in defining the exact amount. At 100% we reach the original problem – the squares must match perfectly. That’s hard.
What would happen if we put this value at 50%? Or less? Because, quite frankly, what’s the worst that can happen?
Well, this can happen.
Now we can see that if the amount of this overlap is somewhere around 0.7, then the algorithm will return the value that the car is in that position. Why? Because about 30% of the car from the parking space beside “enters” our square.
But let us assume that this is manageable if we put the probability to be around 50%. What would happen? We can imagine someone parking in the parking space beside. If it is a larger vehicle or parked closer to our parking space, we can easily imagine that the square is more than 50% full. This is solvable if we are looking at the overlap of just one vehicle, but again we can imagine a situation where a car can occupy less than 50%.
To cut a long story short, there are a lot of good reasons in practice why something like this is too rigid when used in a less-than-ideal parking lot.
What is the final solution then? The final solution is quite simple and quite brilliant, because it does not seem so obvious at first. Also, it is much faster to calculate.
The idea that we have a parking space where the vehicle comes is true, but because of the very way we defined that user requirement, we are bound to make that larger square be the parking space and the smaller square the vehicle.
Are we really bound or can we do something about it?
So therein lies the solution to the whole conundrum. We define small squares that we put in the middle of the parking space so that the vehicle must occupy the center of the parking space. We can see small green squares (to the right of this selected one) in the image.
No extra calculations and thinking is required, the thing is quite simple: is the square of the parking space within the detected vehicle?
Yes – the parking space is occupied; no – it is free.
Such solutions are very beautiful because they are not intuitive but they perfectly match the requirement.
There are limitations, of course. At the user’s request, I looked at how successful the neural network is at detecting vehicles from a “bird’s eye view”.
The conclusion is that the application is not very successful in detecting vehicles from the air, and has very low accuracy. Approximately, it detects 5 out of 50 vehicles when “looking” from a “bird’s eye view”. The assumption is that this is due to the small number of images taken from a “bird’s eye view” that were used to train this neural network.
It would be ideal if we could get images of such parking lots and train the neural network to recognize vehicles from such a perspective. In order to do that, we would have to mark (label) all the vehicles on a large number of images, in order to properly learn to recognize vehicles.
To summarize, it is necessary to use object detection with some faster neural network architectures.
Also, it would be easiest to use a smaller square or a simple dot to mark parking spaces, so that object detection can then detect a vehicle in a parking space with the most robust result possible.
The application works very precisely when we use cameras that have usual perspectives, but when we place these cameras to have a “bird’s eye view”, the program cannot be used unless it is trained on additional examples from such a perspective.