Facial Recognition in the Wild
Published Aug. 7, 2019, 10:52 a.m. by Moderator
The main objective of this paper is to understand the development, types of systems, and use cases of facial recognition. The paper explains the evolution of various facial recognition systems to better understand how to create better facial recognition models. The outcome of the research is to create an open source database named ‘In the wild’ which will contain different sets of images representing different emotions which will be used to test and train image recognition models.
Facial expression can be described as any simple motion or change involving any part of the face such as eyes, nose, mouth, and eyebrows. Facial expressions play a significant part in human to human interactions or communication. Individuals usually use facial cues to socially evaluate their peers and determine which course of action they will ultimately take (Hehman et.al, 2015). Facial expressions accounts for between 50 – 70% of human communication (Cherry, 2019). Facial expressions involves a variety of human emotions that are then expressed on the face. Sometimes these facial expressions might not be easily observable but as humans get older, they learn to recognise these subtle facial expressions (Frith, 2009). Facial expressions can indicate a multitude of emotions such as anxiety, anger, fear, happiness, sadness, jealousy, surprise, relief, disgust, shame and greed just to mention a few (Changing Minds, 2019).
In 1862, a French scientist named Duchenne conducted a study on facial muscles to understand its relationship with facial expressions. His main objective was to perform an anatomical analysis of human emotions. To test his hypothesis he conducted electro-stimulation of specific facial muscles to determine elicit reactions. The electrical power was controlled to minimize damage to the skin. In this experiment he was able to isolate and identify a different facial expressions and the muscles involved in each facial expression (Duchenne and de Boulogne, 1990).
The study of electro-stimulation of specific facial muscle responsible for facial expression was first carried out by a French scientist called Duchenne and this research or study was first studied as far back as 1862. His research was based on facial expression for communication where he came up with different facial expressions (Duchenne and de Boulogne, 1990).Thesis: This paper analyses the role of facial expression analysis in emotion recognition.
Facial recognition dates back to the 19th century with notable works including Charles Darwin’s book titled ‘The Expression of Emotion by Man and Animal’ and ‘The Mechanism of Human Facial Expression’ by Duchenne and de Boulogne. In his book, Darwin claimed that one cannot understand human emotions without understanding animal emotions. He argued that human behaviour is a construct of evolution. He identified similarities between human and animal emotion noting that particular emotions cut across species. To test his theories Darwin performed experiments on animals, infants, mentally ill patients and adults from divergent cultures (Ekman, 2014).
After Darwin’s publication facial recognition studies took a back seat until the 1960’s when it begun its revival. The main reasons cited include the unavailability of evidence to back the claims and the dispute involving the anthropomorphic equation of human emotions and animal emotions. Opponents believed that animals were incapable of experiencing equivalent emotions. Recent studies indicate that non-human primates are capable of experiencing and recognizing emotions. Darwin’s work also concentrated on the intrinsic human emotions rather than the outward expression of the emotion (Ekman, 2014).
In the 1960’s, Woodrow Bledsoe developed a technique which he referred to as ‘man-machine facial recognition’. This technique involved manual classification of facial photographs using a RAND tablet (Lydick, 2019). The RAND tablet was a precursor to the now widely used computer keyboard and it allowed users to input text onto a computer (Davis and Ellis, 2019). The operator in the Bledsoe technique involved mapping out the coordinates of the intimate facial features such as hairline, eyes, and nose. The name associated with the subject is then recorded to the database with the RAND tablet. Using these parameters the recognition system would measure the coordinates of any ‘new photograph’ and retrieve a photograph from the database that closely associates with the ‘new photograph’ (Lydick, 2019).
Modern facial recognition has grown in leap and bounds due to technological advancements such as powerful computers with high processing power, superior algorithms and the emergence internet based technologies. One of the initial facial recognition system was developed by Kohonen in which he demonstrated that a simple neural network had the capability of detecting organised and normalized images (Choudhury, 2019). The main challenge that early developers faced was that the facial recognition systems could only recognised structured data such as aligned images in small datasets. The early facial recognition systems were unable to perform when the parameters were unknown and when the datasets were huge.
In 1989, Kirby and Sirovich introduced an algorithm that made it possible to calculate Eigen faces and less than a hundred were required to perform facial recognition on aligned and normalised photographs. In 1991, Turk and Pentland further enhanced the use case of the algorithm to identify faces embedded in natural imagery in addition to the location and scale of the face. It is during this period that facial recognition gained much interest (Choudhury, 2019).
Another example of facial recognition was ‘FaceIt’ which borrowed most of its designs from Bledsoe’s model but with an additional capability to detect a 3D model of the face. This system used specialised cameras that generate 3D images that are stored with a unique identifier inside the database. The system is also able to store 2D images but the images have to be taken at an angle of less than 55 degrees. The algorithm is able to compare both 2D and 3D images to assess inherent similarities between stored images and the new images and then retrieve the image(s) with most similarity (Lydick, 2019).
The system can also perform ‘surface texture analysis’ which examines the texture of an individual’s skin to create a 3D representation of the skin which is then stored as a ‘skin print’. Both the face print and skin print are used to increase the accuracy levels. The developers have claimed that the use of both prints increased their accuracy by 25% (Lydick, 2019). Effective facial recognition systems should be able to automatically recognise different human emotional states for them to effectively function in real life human settings. However majority of the collected datasets contain images that have been collected in very controlled conditions, this limits the performance and accuracy of the facial recognition algorithms in a non-controlled environment such as busy city street (Kollias et.al, 2019).
High performance deep learning networks have much higher accuracy as the model is trained using large datasets with many features. However training deep learning networks takes a long time in addition to massive computational power which is very resource intensive. Video facial recognition differs from image facial recognition as video facial recognition follows the facial expression from the beginning to the end. In video facial recognition the facial expression progresses over three phases; the onset (beginning), the apex (peak) and offset (end). There is no determined standard for determining the apex of a facial expression. This is left to the subjective selection of the researcher or machine learning engineers.
In contrast image facial recognition only consider the apex upon which features are extracted. Video facial recognition provides more features making them better at facial recognition. Classification is entirely dependent on the subjective view of the researcher. Facial recognition systems are subject to bias and therefore are unable to handle every single aspect of facial expressions as the model does not have enough data to fully make accurate predictions.
Recognizing emotions in the wild
Facial expressions can indicate a multitude of emotions, Paul Ekman divided emotions into 6 main categories happiness, sadness, anger, and disgust, surprise and fear (Ekman and Friesen, 1971). The expression ‘In the wild’ is an ironic terminology that denotes an uncontrolled environment. Uncontrolled environments are places where humans have no control over the forces that form their environment. Facial expressions are human responses to the challenges or benefits that the individual encounters within their environment.
Facial expression and its analysis
One of the foremost studies in facial recognition analysis was done by Ekman and Friesen in 1971 in which they experimented on illiterate inhabitants of a native tribe of Papua New Guinea. In this experiment the researchers wanted to compare facial recognition across literate and preliterate cultures. They believed that facial muscles movement and emotions were universal to all cultures and thus specific facial expressions would be common and easily recognizable by all. Individuals in distinct cultures have been socialised through prolonged learning to elicit particular facial expressions at specific social settings hence the existence of culture specific facial expressions (Ekman and Friesen, 1971).
This human experiment was an attempt by the researchers to prove that facial expressions are universal and only vary across cultures after socialisation. The researcher’s chosen research method was observatory. In this experiment the researchers observed and recorded the reactions of the inhabitants as someone read them emotional stories. At the end of the narration, the subjects of the study were shown photographs depicting different emotions. They were tasked to identify the photographs that showed the emotion that was appropriate to the story. The experiment showed that the pre-literate subjects and college educated subjects identified the same emotions based on the photographs. The researchers identified six basic emotions; happiness, sadness, anger, disgust, surprise and fear (Ekman and Friesen, 1971).
The researchers then divided these emotions into activity units which they referred to as AUs. Each activity unit represents minor changes in facial musculature. This experiment demonstrated that facial expression are made up of multiple activity units. Activity units do not provide unambiguous results due to their inherent nature. Further to address this shortcoming, geometry to clearly identify the coordinates and appearance highlights for expression acknowledgement which then act as encoders used to verify specific activity units (Ekman and Friesen, 1971).
Facial recognition analysis involves a three-stage process; preprocessing, feature extraction and classification:
- Preprocessing: This is the first stage involved in facial recognition. In this stage the facial image of the subject is captured by camera or scanned images of existing photographs are used. Image preprocessing involves conversion of the image to a normalised image from which features can then be extracted. Other image processing during this stage include cropping and image resizing (Hemalatha and Sumathi, 2014).
- Feature Extraction: This stage involves analysis of the processed image to retrieve its features such as geometric parameters and appearance. The geometric part involves recording of the size and structure of the eyes, mouth, nose and eyebrows. Geometric analysis records these features as vectors shape while appearance features are the texture of the skin. Geometric features are very important as they record the shape and location of the facial features relative to each other. Geometric analysis extracts six out of eight points from the face guaranteeing that the scale highlight will be very similar (Yao et.al, 2019). Once these features have been extracted they undergo normalisation to ensure that the inputs conform to the required units to ensure that the algorithm recognises and utilises all features to ensure its accuracy in facial recognition. It is these extracted facial parameters that are used as vectors to reproduce a 3D image of the face. The main examples of feature extraction techniques include Discrete Cosine Transform (DCT), Gabor Filter, Principle Component Analysis (PCA), Independent Component Analysis, and Linear Discriminant Analysis (Hemalatha and Sumathi, 2014).
- Classification: Classification involves the use of different machine learning models to classify and identify the features. Classification models include Logistic Regression, Ada Boost, Support Vector Machine, Decision Forests and K-Nearest Neighbour. This is the final stage of facial recognition and it is divided in two categories. The first category is frame and sequence-based expression recognition method which does not use time-based recording of the whole picture during processing instead it utilises cutting-edge input image records with or without picture frames in a majority of the cases making the photo static or ‘a body of sequence’ that’s handled independently. The second category is sequence-based recognition method which uses the time-based statistics of the sequences to record the expressions of one or more picture frames simultaneously in order to apply the temporal facts and strategies (Liu et.al, 2014). This system uses a classifier which is rule based to understand activity units of the eyes and brows spontaneously. This method analyses all activity units in several frames within an image collection to perform facial recognition (Hemalatha and Sumathi, 2014).
Facial Action Coding System (FACS)
Paul Ekman and Wallace V. Friesen developed a taxonomic classification of facial movements denoted as activity units. The system was originally developed by Swedish anatomist Carl-Herman Hjortsjö in 1970 before Ekman and Wallace developed it further. In in its original format the coding system only listed 23 activity units (AUs). Ekman and Wallace enlarged the coding system to include 46 activity units (AU’s). Each activity unit represents a specific facial movement. The activity units covered both head and eye movements. The main reason for this taxonomic variation was due to the nature of facial expressions. Some facial expressions involve more than one muscles for example frowning or squinting (Ekman and Friesen, 1978).
Facial Action Coding System (FACS) was first published in 1978 and then republished in 2002. The Facial Action Coding System is used by psychologists and behaviorists to determine the emotion of a subject based on their facial expression(s). It has successfully been used to assist patients suffering from mental illnesses. Application of Facial Action Coding System (FACS) is currently done manually and this has been cited as one of the key impediments to comprehensive research on facial expressions as consequents of human emotions (Ekman and Friesen, 1978).
Due to its subjectivity and lengthy turnaround times, Facial Action Coding System (FACS) has been mounted as an autonomous recognition system that detects faces in motion pictures, extracts the geometrical parameters of the faces to produces temporal profiles of each facial movement or activity unit. Facial Action Coding System (FACS) includes a 44-action gadget. 30 of which are anatomically related to the contraction of a set of facial muscular tissues. The anatomical basis of the remaining 14 are unspecified. These unspecified 14 parameters are denoted as different movements. Many action devices can be coded symmetrically or asymmetrically. For motion devices that change in intensity, a 5-point ordinal scale is used to measure the of muscle contraction (Ekman and Friesen, 1971).
Review of an Existing Database
A facial expression database is a saved collection of pictures or video clips pictures or video clips with facial expression of different emotions as represented by different individuals. There are many open source databases, a majority of which are based on the Facial Action Coding System (FACS) by Paul Ekman and Wallace V. Friesen. Open source face databases provide learners, researchers, software engineers with a ready-made datasets that they can use to test out their machine learning models. There are many databases that provide different varieties of images depending on the use case or challenge a developer is striving to solve. Some examples of open source face databases include the Cohn-Kanade database, the Yale Face Database, the Iranian Face database, the Hong Kong Polytechnic University Hyperspectral database, and the Indian Face Database (Koestinger et.al, 2011).
In terms of emotions, the Cohn-Kanade database is the best choice as action units are coded but the drawback is that image quality is very low. The six essential classes included are anger, disgust, fear, happiness, sadness, and surprise. It is made up of about 500 image sequences sourced from a hundred subjects. The subjects were aged between 18 and 30 years. 65% were females and 35% were males. The racial composition included 15% African-American, 3% Latino while the rest were Caucasians. The subject’s face was captured by two cameras; one at the front and one positioned thirty degrees to the subject’s right side. Only the frontal images are provided in the database. The subjects were asked to perform 23 facial movements which were then captured by the cameras.
Facial recognition has rapidly developed from its infancy when Charles Darwin and Duchenne published their works, ‘The Expression of Emotion by Man and Animal’ and ‘The Mechanism of Human Facial Expression’ respectively in the late 19th century. Facial recognition has undergone transformation since the 1990’s fueled by technological innovations such as the internet and powerful computers that could perform millions of simultaneous calculations at a fraction of the speed of previous models. The accuracy of facial recognition systems has also improved with newer algorithms that have made it easier to test and train models. Open source databases have also made machine learning accessible to all researchers around the globe who do not have to collect their own data to test or train their models.
Clarke, R.J. (2005). Research models and methodologies. In HDR Seminar Series. Faculty of Commerce. doc/uow012042. pdf (Vol. 7, p. 2013).
Changing Minds. (2019). Face body language. [online] Available at: http://changingminds.org/techniques/body/parts_body_language/face_body_language.htm [Accessed 1 Jul. 2019].
Choudhury, T. (2019). History of Face Recognition. [online] Vismod.media.mit.edu. Available at: https://vismod.media.mit.edu/tech-reports/TR-516/node7.html [Accessed 1 Jul. 2019].
Davis, M. and Ellis, T. (2019). The RAND Tablet. [online] Rand.org. Available at: https://www.rand.org/pubs/research_memoranda/RM4122.html [Accessed 1 Jul. 2019].
Dhall, A., Goecke, R., Joshi, J., Wagner, M. and Gedeon, T. (2013). Emotion recognition in the wild challenge 2013. In Proceedings of the 15th ACM on International conference on multimodal interaction (pp. 509-516). ACM.
Kollias, D., Tzirakis, P., Nicolaou, M., Papaioannou, A., Zhao, G., Schuller, B., Kotsia, I., Zafeiriou, S. (2019). Deep Affect Prediction in‑the‑Wild: Aff ‑Wild Database and Challenge, Deep Architectures, and Beyond. International Journal of Computer Vision.
Duchenne, G.B. and de Boulogne, G.B.D. (1990). The mechanism of human facial expression. Cambridge university press.
Ekman, P. (2014). Darwin and Facial Expression. [online] Pdfs.semanticscholar.org. Available at: https://pdfs.semanticscholar.org/3887/b43b5aa2cafdc7a61802cee511145dfba70a.pdf [Accessed 1 Jul. 2019].
Ekman, P. and Friesen, W.V. (1971). Constants across cultures in the face and emotion. [online] Available at: https://pdfs.semanticscholar.org/a368/8ebcce02bbafa87e3f644841d7d78172fc08.pdf [Accessed 1 Jul. 2019].
Frith, C. (2009). Role of facial expressions in social interactions. [online] Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2781887/ [Accessed 1 Jul. 2019].
Hehman, E., Flake, J.K. and Freeman, J.B. (2015). Static and dynamic facial cues differentially affect the consistency of social evaluations. Personality and Social Psychology Bulletin, 41(8), pp.1123-1134.
Hemalatha, G. and Sumathi, C.P. (2014). A study of techniques for facial detection and expression classification. International Journal of Computer Science and Engineering Survey, 5(2), p.27.
Koestinger, M., Wohlhart, P., Roth, P.M. and Bischof, H. (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE international conference on computer vision workshops (ICCV workshops) (pp. 2144-2151). IEEE.
Liu, M., Shan, S., Wang, R. and Chen, X. (2014). Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1749-1756).
Lydick, N. (2019). A Brief Overview of Facial Recognition. [online] Eecs.umich.edu. Available at: http://www.eecs.umich.edu/courses/eecs487/w07/sa/pdf/nlydick-facial-recognition.pdf [Accessed 1 Jul. 2019].
New York University. (2019). What is Research Design?. [online] Available at: https://www.nyu.edu/classes/bkg/methods/005847ch1.pdf [Accessed 1 Jul. 2019].
Yao, A., Shao, J., Ma, N. and Chen, Y. (2015). Capturing au-aware facial features and their latent relations for emotion recognition in the wild. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (pp. 451-458). ACM.