Key role in designing AR, VR and metaverse systems is knowledge about the human interaction with stimuli. Researchers and industry spend extended effort to investigate how individuals react to certain stimuli. In this study we exposed 12 individuals to two specific experimental videos which included multiple sequences of images and experimental sounds and we asked them to record their facial expressions using their web camera during the time they were watching the videos. We then analyzed the recorded videos using deep face algorithm to extract estimations about the seven universal emotions. We then searched for possible moments where stimuli, image, color, sound and combinations of them, might reflect certain emotional sharp alterations of the subject by comparing the graphs of the values of each emotion with the experimental videos.
A lot of research has been conducted in the field of Quality of User’s Experience (QUX) and individual’s emotional behavior when specific stimuli are present. Many studies explore the emotions induced to individuals when they are exposed to colors, images, sounds, music, videos, heat, cold etc. Multiple ways of emotion detection and recognition have been employed to estimate a prediction about how an individual feels when the stimuli are present. Electroencephalographs, facial expression analysis, electrical skin conductance and eye gazing are some of the methods used by researchers to estimate individual’s emotional state and reaction during counterpart experiments. Seven emotions have been universally standardized and linked to seven facial expressions which are happiness, sadness, anger, disgust, neutral, fear and surprise. Modern machine learning algorithms like DeepFace algorithm, which is based on Keras and TensorFlow, analyze the facial expressions of the participants to derive their emotions. The first step is to detect the face and then remove the background and non-face areas (face alignment phase). The output of the face recognition stage is a bounding box for the face (a 4 element vector), a 10 element vector for facial landmark localization and the positions of five facial landmarks, two for the eyes, two for the mouth and one for the nose. The final step is to classify the given face into one of the basic emotion categories. A fully connected CNN model, with three convolution layers, is employed as a feature extraction tool.
Experimental Procedure: We designed two experimental videos, one video of eight minutes duration and the other of five minutes. Both were products of montage of multiple small video clips containing scenes of the sea, scenes of a port, instant alterations of dark frames with frames containing more light and other optical ingredients. Both videos contained experimental sound montages with sounds from ports and ships as well as human voices. The videos were uploaded on YouTube and the links were sent to the participants. The participants were then asked to watch the videos and at the same time to record their faces using their web camera. The recorded videos containing facial expressions were analyzed using the DeepFace algorithm, and estimations about the expressed emotion per video frame were extracted. These datasets were then examined to extract conclusions about possible moments where the stimuli could have triggered significant facial expressions.
Discussion: One of the first conclusions is that facial expression analysis is rather a very complex task. From the first steps of the analysis, there is evidence that each of the individuals that was examined seemed to express a dominant emotion. For example, an individual’s face was generally neutral during the video and very few sparks of other emotions were discovered. On the other hand, the experimental video was found to be very complex as a stimuli and did not trigger clear emotions to the participants, for example, joy, anger or sadness. The data are still analyzed to extract further conclusions which will be presented at the final version.
Back