A Novel Approach to Depression Detection Using Pov Glasses and Machine Learning for Multimodal Analysis

30 APRIL 2025, WEDNESDAY
13:00-14:00 ORAL PRESENTATION SESSION - 13

Hakan Kayış¹, Murat Çelik², Vildan Çakır Kardeş³, Hatice Aysima Karabulut³, Ezgi Özkan³, Çınar Gedizlioğlu⁴, Burcu Özbaran⁵, Nuray Atasoy²

Show more (Affiliations)

1. Department of Child and Adolescent Psychiatry, Zonguldak Bülent Ecevit University, Zonguldak, Türkiye
2. Freelance Researcher, Ankara, Türkiye
3. Department of Psychiatry, Zonguldak Bülent Ecevit University, Zonguldak, Türkiye
4. Department of Computer Engineering, Izmir University of Economics, Izmir, Türkiye
5. Department of Child and Adolescent Psychiatry, Ege University, Izmir, Türkiye

DOI: 10.5080/kes27.abs150 Page 6-7

BACKGROUND AND AIM: In the diagnosis of depression, traditional methods, primarily relying on self-reported symptoms and clinician interviews, often suffer from biases and inaccuracies, which can lead to misdiagnosis or underdiagnosis of depressive disorders. In this study, audio and visual data captured during patients’ semi-structured interviews and natural interactions are analyzed to facilitate an objective diagnosis of depression.
METHODS: The study included 44 patients with depression and 41 healthy controls, aged 18-55 years. Diagnosis was based on DSM-5 criteria, and depressive symptoms were assessed using the Beck Depression Inventory. Exclusion criteria included neurological deficits, severe visual impairments, coexisting psychiatric disorders, psychiatric medication use, and botox treatment within the last 6 months. Semi-structured interviews were recorded using POV glasses with audio and video capabilities, capturing footage at 30 frames per second and 1920x1080 resolution. These took place in the same room, with consistent lighting (400-600 Lux) and a 100 cm distance between the researcher and participant. In these standard interviews, the researcher asked the participant four different questions, with at least 30 seconds of recording for each question. The questions were as follows: ‘Can you describe how you feel during the last days ?’, ‘Can you describe a typical day for you, starting from the morning ?’, ‘Could you share a positive memory of yours with me ?’, and ‘Could you share a negative memory of yours with me ?’. The audio and video data were analyzed using computer software, incorporating open-source tools such as OpenCV, MediaPipe (Google n.d.) for facial landmark detection and head movement analysis, PyFeat (Cheong et al. 2023) for facial expression analysis, and Whisper AI for speech recognition and transcription. Using the software, the following parameters were measured: total eye gaze duration, gaze duration to the right and left, duration of smiles, concurrent eye gaze and smile, duration of neutral and happy faces expressed by the participant for all questions, number of blinks, blink duration, eye openness, total head movements, number of rapid head movements, response time, silence ratio, and the number of words spoken by the participant in response to the researcher for the second question. The emotions expressed by the participants and their duration, as well as the duration of smiles, were measured using the Facial Action Coding System (FACS). With FACS, Ekman and Friesen (1978) classified facial expressions of universal emotions based on facial muscle movements and their various combinations. In our study, the Py-Feat software was used for the analysis of FACS data (Cheong et al. 2023). The method for measuring eye gaze estimation, developed by Abdelrahman et al. (2023) using Deep CNN (Deep Convolutional Neural Network), was used to assess eye gaze estimation in images. Smiles were measured using Action Unit AU12 from the FACS. In the software, the duration of instances where both eye gaze and AU12 occurred simultaneously was measured. Eye openness and the number and duration of blink were measured using the Eye Aspect Ratio (EAR) (Devi C. et al. 2022). EAR is a simple geometric measure that indicates whether an eye is ‘open’ or ‘closed,’ and also measures the level of eye openness. It uses six landmarks around each eye (e.g., p0–p5), which are measured using MediaPipe. The EAR is then defined as: EAR = (d(p1,p5) + d(p2+p4)/2) / d(p0,p3), where d(⋅) is the Euclidean distance between two points. Then, a numerical cutoff (blink_threshold = 0.2) is set, which is commonly used in the literature. If EAR < blink_threshold, the eye is considered ‘closed’ for that frame. Head movements were quantified by measuring the number of rapid head movements and the total amount of head movement. A rapid head movement occurs when the change in angle (Δθ) for any axis (pitch, yaw, or roll) exceeds 5° between two frames. Total head movement is calculated as the sum of the absolute angle changes for each axis. In our study, we statistically compared all the parameters measured between the groups. Subsequently, we applied machine learning techniques to assess the ability of the model to make diagnostic predictions. The ethical approval for this study was granted by the Ethics Committee of Zonguldak Bülent Ecevit University, with decision number 2024/21.
RESULTS: There were no significant differences between the groups in terms of age (depression mean = 37.8, control mean = 37.1, p = 0.81) or gender (depression: 24 females, 20 males; control: 21 females, 20 males, p = 0.75). Eye gaze duration number of blinks, blink duration, duration of smiles, concurrent eye gaze and smile, happy face, and neutral face had p values ≤ 0.05, while right and left gaze, rapid head movements, total head movements, eye openness, response time, silence ratio, and number of words had p values > 0.05 (Table 1). After applying the Bonferroni correction (p < 0.0033), statistically significant differences were observed in duration of eye gaze and duration of happy faces between the two groups. We employed the AdaBoost algorithm with decision trees after experimenting with various machine learning techniques, including other treebased models, nearest neighbor methods, and support vector machines. Recursive feature elimination was used to discard less relevant features, resulting in a more generalizable model. The learning rate and number of estimators for the AdaBoost algorithm were set to 0.5 and 100, respectively. Using leaveone- out cross-validation, the model achieved 87.06% accuracy, 86.67% precision, 88.64% sensitivity, 85.37% specificity, and an 87.65% F1-Score.
CONCLUSIONS: Challenges associated with healthcare access and cost persist as significant barriers to early disease detection, primarily due to limited availability of essential diagnostic tests and treatments. These constraints complicate the identification of diseases at their initial stages. In this study, we developed an innovative methodology to aid in the diagnosis of depression without interfering with the natural progression of psychiatric evaluations. Specifically, our approach facilitates an objective assessment of depression through the analysis of audio and visual data collected during patients’ semi-structured interviews and natural interactions, achieving an accuracy rate of 87.06%. To the best of our knowledge, this method represents a completely new approach that has not been previously applied in the literature. REFERENCES A. A. Abdelrahman, T. Hempel, A. Khalifa et al. (2023) “L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments” 2023 8th International Conference on Frontiers of Signal Processing (ICFSP), Corfu, Greece, 2023, pp. 98-102, doi: 10.1109/ ICFSP59764.2023.10372944 Cheong, J.H., Jolly, E., Xie, T. et al. (2023) “Py-Feat: Python Facial Expression Analysis Toolbox” Affec Sci 4, 781–796. https://doi. org/10.1007/s42761-023-00191-4 Dewi C, Chen R, Jiang X et al. (2022) “Adjusting eye aspect ratio for strong eye blink detection based on facial landmarks” PeerJ Computer Science 8:e943 Ekman, P. & Friesen, W. (1978) Facial action coding system: a technique for the measurement of facial movement. Palo Alto: Consulting Psychologists Google. (n.d.) “MediaPipe” https://google.github.io/mediapipe/ Keywords: Depression, Eye gaze, Machine learning, Point-of- View Glasses

27th National Clinical Education Symposium Presentation Abstracts

Actions for Presentation

Other presentations

30 APRIL 2025, WEDNESDAY
13:00-14:00 ORAL PRESENTATION SESSION - 13