POWER READ
We have all seen it in the movies: a camera tens of meters away that can zoom almost infinitely, down to the minute details, allowing operators to locate and track thousands of faces. They can recognise a particular person among thousands of others, and once they do, they can analyse his or her face to know their gender, age, ethnicity, emotions, micro-expressions, personality, whether they’re lying, where they’re looking at or even what they’re about to do next.
But is this how it really works? What is technologically possible today and what is not? What’s Hollywood fiction and what’s reality? And what should we expect in the near future? This is what Facial Analysis, an umbrella term that comes under the general research field of Computer Vision (CV), is all about.
CV is an interdisciplinary scientific and engineering field that focuses on developing techniques to help computers “see” and understand the content of images and videos. What CV aims to do is to mimic functions of the human vision (not necessarily by copying it) and solve many visual subtasks that all of us effortlessly perform in our everyday lives, such as locating objects, reading text, navigating in a room, recognising and reading faces, and so on.
CV requires an image-capturing device, like a camera, and a processing unit, like a PC’s processor which will analyse image data using complicated algorithms and extract useful information from them. While cameras comprise the majority of inputs in CV, any sensor that can produce images can be part of a CV system. Some examples include radiographic sensors capturing X-rays or inspecting production lines, LIDARs or Time of Flight sensors. Whatever image-capturing device you use, the pipeline is usually the same.
CV is also considered as a part (or even a subset) of AI. In fact, CV has been one of the major drivers of AI advancements in the last few years, especially with the widespread adoption of Deep Learning (DL). DL is a subset of Machine Learning and AI, which has revolutionised many aspects of CV such as object recognition. The breakthrough came in 2012, when a DL model called “AlexNet” won the annual ImageNet object recognition competition by a significant margin, compared to the best models of the previous years. AlexNet was able to recognise thousands of object categories, by analysing millions of images, with an error rate of 15.3% (previous best was ~26.1%!). In sports terms, this is equivalent to crashing Usain Bolt’s world record in the 100-meter race from 9.58sec down to the superhuman 5.62sec! Such was the scientific impact of this, that ever since, DL has been dominating many aspects of CV, among which, Face Analysis.
Facial Analysis (FA) is a series of computational tasks that extracts useful information from images or videos of faces, and is part of the CV field. Most people use the term “Face Recognition” to describe any technologies related to FA. In actuality, Face Recognition is just one part of this series of tasks that FA encompasses.
In order to analyse a face, first you need to locate it. This is called Face Detection. The objective here is to locate any areas of pixels in a photo or video that correspond to faces. Face Detection is not about the identity of a person (i.e. whether this face belongs to Bob, Susan or a wanted terrorist) but just whether this area of pixels is a face or not. The output of Face Detection is usually a “bounding box”, a square or rectangle that marks the pixel area in which a face exists.
By itself, Face Detection can give a lot of insightful information, based on the application you’re targeting. For example, if you want to measure how many people are visiting your store every hour, or count how many people are standing outside your shop window, face detection can help you achieve intelligent estimates. Face Detection is a mature technology if operating specifications are met (see later section on this).
Face tracking is only applicable to videos; allowing you to track a face from the previous face detection process across successive video frames. You can track multiple detected faces at the same time. This doesn’t mean you know the identity of the person, just that it is the same face that was detected several moments before. Face Tracking is useful when there are multiple faces in a space and you don’t want to re-detect (and perhaps re-count) the same face twice. If you’re counting people in attendance, face tracking is particularly useful.
Once you have localised a face using face detection, you may need to know the identity of a person. Is this Bob, Susan or a wanted terrorist? For face recognition to work, you’ll need a database of known faces, among which you can find the new unknown detected face. Face recognition is commonly used in automated entrance control in the premises of an office. You have a database of the faces of all employees, and once a new face is detected at the entrance of the office, it is matched with the existing faces of the authorised employees.
Face Verification attempts to answer the question: “given two faces, do they belong to the same person?”. There is no database of known faces involved. That is, while you may conclude that the two faces belong to the same person, you still don’t know who this person is. Have you ever unlocked your mobile phone with your face? That’s face verification in action.
Once you’ve located a face in an image, there are dedicated algorithms to estimate the age, gender and race of the person, but not their identity. A typical output of such algorithms would look like this: “23 years old, Caucasian female”, or “elderly Asian male”. Usually, face demographics is used in digital signage or retail store analysis to generate aggregated statistics for the average demographic profile of the people who viewed the signage or entered the store.
Specialised families of CV algorithms called Facial Emotions or Facial Expression Analysis can analyse facial contortions to estimate the emotions portrayed by it.
There are three major approaches when it comes to facial emotion. The first one attempts to detect a fixed number of predefined emotions on the face, such as the seven basic universal prototypical emotions: happy, surprised, afraid, angry, disgusted, sad, or neutral. This is the simplest form of face emotions analysis, and is the most widespread. The second approach aims at estimating Valence (positive or negative, and to what extent) and Arousal (energetic or passive, and to what extent) of a facial expression. The last approach aims to understand which exact muscle groups of the face are activated (called facial action units) within the current facial expression.
This type of facial analysis estimates the orientation of the face and the eyes, relative to the camera. Specifically, where exactly the person is looking at and where they’re turning. This can be used in advertisements to estimate whether a person is actually looking at it, and count impressions, or to check whether a student is paying attention to the content presented by a teacher.
Other types of face analysis include facial attractiveness (prediction of how attractive a face is), facial skin analysis (estimation of the conditions of the skin at the face), heart rate estimation from the face (simply by analysing minute colour fluctuations on the facial skin in order to estimate blood flow in the face, from a typical face video), drowsiness (estimating whether someone is sleepy or not), personality prediction (by analysing the facial features of a face), sexual orientation prediction (by analysing facial features) and face synthesis (generating synthetic faces that may look indistinguishable from real ones, or fictional faces of non-existing people).
Some of these techniques are rather controversial, such as sexual orientation and personality prediction. However, research has already been published in these areas and as long as there is abundant data to work with, there is a high chance that someone will eventually commercialise these models. The approach of face synthesis has also attracted a lot of negative attention lately, with high publicity regarding “deep fakes”, where a person’s face is transferred on to the body of another, blurring the line between reality and animations. This technique has the potential to disrupt news, since it will become exceedingly difficult to distinguish between a synthetic and a real face on a video.
Hollywood and the news have created a distorted reality regarding CV and FA. While the news tends to over-emphasise success stories and under-report failures of technology, movies tend to show applications that are simply impossible. Let’s try to bust some of the most typical misconceptions.
The democratisation of AI knowledge, tools and data, has made it easier than ever for almost anyone with basic programming knowledge to put together a basic FA system. This is attested by the numerous startups in this domain, with promise of some form of FA. Although generally positive, this also gives rise to a series of very important questions and concerns.
While all the FA techniques outlined in Chapter One may work well within a range of specifications, they don’t work all the time. In fact, there are many cases where they may completely fail. It goes back to the old computer science rule: “garbage in, garbage out”. If the quality of the image data you are inputting is very bad, then there is little to work with and the result will also be bad. Several factors like image resolution, illumination strength and type, quality of the camera, or camera position can dramatically affect CV and FA.
Think about it as a spectrum between two extremes. In one extreme, is the “Skype” scenario, where you are sitting in front of your laptop (or your mobile phone), within centimetres of a good quality camera, in frontal position, usually indoors with ideal uniform illumination. The complete opposite is the “in the wild” scenario, where a person may be tens or hundreds of meters away from a surveillance camera, usually outdoors, with uncontrollable directional illumination (which causes shadows), in a profile non-frontal position (where part of the face is not visible), wearing sunglasses or a hat (which may cause occlusions) and is walking (which may cause motion blur).
Most FA algorithms will work well in the Skype scenario. However, the moment you start to deviate from these “ideal” conditions, many of them start to break down. The more you start to approach the “in the wild” scenario, the more FA starts to become useless.
However, different FA algorithms have different tolerance to non-ideal conditions. For example, modern face detection and tracking techniques are quite successful in non-ideal conditions, giving good face detection rates even for faces that are far away from the camera. On the other hand, face recognition (knowing the identity of the person), face expression analysis (knowing the emotions of the person), or other niche FA techniques are greatly affected by “in the wild” conditions. Face head pose estimation and demographics may be more successful than face recognition, but their performance can be compromised greatly by non-ideal conditions.
Do note that certain conditions may have more detrimental effects to FA than others. The relative position of the face to the camera is one of them. The more you deviate from a frontal image and move towards a side view of the face, the less accurate FA algorithms become. Distance from the camera is another one, as it affects the image resolution and quality.
If you’re thinking of using new FA technologies in your business, familiarise yourself with the limitations of the technology. Ask yourself what your ideal use-case is, and then find out if what you have in mind is within the working specifications of the technology.
Can governments use face recognition and surveillance cameras to automatically find suspects easily?
Not easily. Searching for a particular face in an “ocean” of videos constantly streaming from thousands of cameras all over a city or a country, is literally like searching for a needle in a haystack. Remember that, for outdoor surveillance cameras, we are talking about “in the wild” scenarios (low resolution, motion blur, long distance from the camera, non-frontal images, shadows, headgear and so on). Current face recognition technology is not good enough for such conditions, and there is a good chance that it will never be. Unless you are ready to deal with thousands of false alarms, and can afford to have an army of people who are evaluating these false detections, you’re better off not relying on these systems.
This is exactly what happened when authorities tried to use Face Recognition during the Boston Marathon bombings in 2013. During the manhunt of the two terrorists, officials decided to use a commercial face recognition system, feeding footage to them from the city’s CCTV cameras. The results were numerous false alarms. After a while, the authorities chose to pull the plug and rely on more traditional techniques. What was the reason for this fiasco? Not meeting the operating conditions of the face recognition system. The required resolution for analysing each face was 90 pixels between the two eyes. Images from CCTVs were about 12 pixels. Add in motion blur, shadows, occlusions and non-frontal headpose, you understand why the technology failed.
You may argue that FA technology must have matured since 2013. During the 2017 UEFA Champions League Final in Cardiff, the UK Police tested a new face recognition system that would analyse the faces of spectators and flag potential matches with faces stored on a custody database. The results were equally alarming: 92% false detections. Again, the problem was that these types of algorithms cannot operate in “the wild”.
A better application of FA technology would involve controlled operating conditions and the cooperation of people. For example, analysing the faces of passengers at an immigration checkpoint or the entrance of a building, under a frontal pose, good illumination, and a good camera.
The question on everyone’s mind seems to be: “can a computer know what I’m feeling?”. By and large, the answer is no, especially if you don’t want them to know.
Detecting whether someone is smiling is relatively easy, even for mid-range distances and non-cooperative scenarios. Estimating more elaborate emotions however, is almost impossible in non-cooperating environments. Add to that the layer of interpretation and you’ve just got an unclear conclusion. An algorithm may correctly detect a smile, but the meaning of this smile is totally out of reach for today’s algorithms. Is the smile due to politeness, attraction or nervousness? No real-world system can do this. A lack of a way to capture and analyse social context is the main shortcoming here.
At the moment, FA systems cannot really handle social masking - changing your emotions for social purposes. While there’s research on how to distinguish a real smile from a fake one, the level of visual details required in order to perform such a judgment is not feasible for most real-life non-cooperative scenarios. No FA system has been shown to detect social masking outside a fully controlled environment in a lab.
So if you are sitting in a café, drinking your coffee and talking to your friend, don’t be afraid of the CCTV camera up in the corner. It is there only for security and it is highly unlikely that anyone can use this footage to infer your emotions. If the use-case involves simply detecting and counting smiles within a close range of interaction with the camera, and a more relatively controlled headpose, then this is feasible with today’s technology.
Biases can happen for different reasons. Typically, it arises from data imbalance: not having an equally balanced dataset across the categories you are training your FA system for. If, for example, the training set comprises 80% Caucasian faces because these images can be found more easily on the internet, the resulting FA system will have different performance across races. It will be very good at telling the face attributes of Caucasian people, but when tested with other races in real-life conditions, it may give wrong results. If you use such a system without knowing this limitation, then to a certain extent you might be indirectly creating a racially biased system.
Data imbalance is very important, but is not the worst case. The most dangerous type of bias is the one that is not obvious; arising from hidden patterns in the data which may not reflect real-world cases. For example, a FA system was developed by a university in order to estimate facial demographics (age/ gender/ race), boasting very high accuracy in this task. When another team of researchers developed a new explanatory model to understand what exactly the FA system had learnt to recognise, they were shocked. Inside the data used for training, all the young people happened to be smiling, whereas the elderly ones had a straight face. This means that the moment someone was smiling, his/her estimated age lowered as per the FA system! Obviously, this is not the case in the real world.
There are many academic “horror” stories where researchers only later discovered that their system learnt something totally different from what they originally intended. There are discussions about using FA algorithms to estimate credit risk, university admissions, or even employment. If such unexpected hidden patterns may slip in the training data, terrible discriminations can be introduced in these systems. And the worst thing will be that we may not be aware of them.
Can you be sure that an FA system was trained legitimately, without compromising peoples’ privacy?
No you really cannot tell! This is actually one of the greyest areas in modern AI. Engineers need large amounts of data, so they are forced to source images from anywhere possible. They’ll usually “crawl” the internet for face images using specific keywords (e.g. happy face, angry face, 28 years old and so on). Any website with face images may be scanned and images may be copied into a huge training dataset.
Suppose you have created a personal photo album on your website, or have uploaded your personal photos in a photo-sharing website like Flickr. You may even have indicated some type of license that disallows others from using your photos for other purposes. There is still no guarantee that your images will not be used for training some kind of FA system, and in fact, there is a great chance that they are already part of some company’s training image set.
The problem is that, once you finish training a FA AI, all you end up with is a series of numbers, which cannot give direct indication about the origin of the training dataset used. It’s easy for someone to use photos that they don’t have permission to use and go undetected. Unfortunately, if someone wants to cheat, there is nothing that can be done to stop them at the moment. As a crude rule of thumb, if a FA system comes from a country where rules and regulations are usually overlooked (or are plainly non-existent) in the name of economic development, there is a high chance that it may have been trained with non-legitimate data.
And what happens to the data that is captured? FA systems nowadays typically stream the captured content to the cloud, where powerful computers can process these images faster and report back the results. This creates privacy issues, since images of faces are captured and transmitted somewhere else without people’s expressed consent.
In many countries like Europe and Singapore, captured facial images are not allowed to be transmitted to another location without the person’s consent. This means that all FA systems should process the facial data locally, and only transmit anonymised aggregated statistics. Of course, this type of regulation may not be practiced in other countries with bad human privacy track records.
As with other fast paced technological fields, it is very difficult to predict what the future holds. As AI becomes increasingly democratised, there will be lots of incremental improvements in FA techniques. There is, however, a limit to how much things can improve. You can safely assume that the movie scenario where a camera captures and analyses minuscule facial information from tens of meters away in an uncontrolled environment isn’t going to happen. This has less to do with the sophistication of the algorithms, and involves the basic limitations in physics. “In the wild” conditions will continue to challenge FA systems.
As for the hidden biases in FA and AI systems, they will probably be improved. A lot of effort is currently being put into developing techniques that will explain what AI models learn, to uncover possible hidden biases. The field of “Explainable AI” or “XAI” is the result of these efforts. New regulations and committees are also being formed to standardise and safeguard how AI and FA systems are trained and deployed.
Face synthesis and deep fakes will probably become indistinguishable to the naked eye from real images. But new techniques to distinguish between fake and real faces will start to emerge. We may enter a cycle similar to the hacking paradigm, where hackers are always one step ahead and security researchers try to protect computing systems. Similarly, “fakers” may develop more sophisticated faking techniques, trying to bypass the “real face” detection software. The impact of this on the world of news and journalism may be huge. It is difficult to predict how this will turn out.
Face analysis technology is nowhere nearly as sophisticated as the movies or even the news would make it seem. It simply isn’t possible to tell a person’s identity, let alone their emotions, from tens of metres away due to limitations in physics.
By now you know that face detection is different from face tracking, recognition, and verification. Each of these algorithms are different, as are the use-cases for each technology. If you’re hoping to use FA in your business, know exactly what you want to achieve and identify which of these delivers the best results.
To keep up to date, you could follow reviews from the major scientific CV conferences (such as CVPR, ICCV, ECCV), especially if you’d like to get a more technical overview. If you don’t want to dive too deep into technical details, you could read articles in scientific news aggregators, like Medium or Flipboard.
Sign up for our newsletter and get useful change strategies sent straight to your inbox.