Deep Neural Network-Based Modeling of Multimodal Human–Computer Interaction in Aircraft Cockpits

1. Introduction

With the rapid development of aviation industry technology, various airplane cockpits widely use human–computer interaction (HCI) technology. In flight, the operator needs to deal with all kinds of flight information and complete human–computer interaction through a large number of manual operations. A large number of manual operations reduce the efficiency of human–computer interaction and easily cause operator errors. The gaze interaction [1,2,3] can effectively solve this problem. The gaze interaction defines the position and trajectory of the operator’s gaze as computer interaction commands. The gaze interaction can replace the traditional peripheral interaction methods of mouse or keystrokes and reduce the operator’s manual operation. However, the “Midas touch [4]” problem limits the application of gaze interaction techniques. “The Midas touch” problem makes it difficult for the interaction system to recognize the operator’s intention. When the operator looks at the target, the gaze interaction system cannot tell whether the operator is viewing information or triggering an action.

Many researchers have added speech, gesture, and EEG modalities to the gaze interaction to establish a multimodal interaction approach to solving the Midas touch problem. In contrast to gaze interaction, multimodal interaction methods that combine gaze and other modalities can discern the intent of the operator’s gaze behavior. In addition, multimodal interaction enables computers to have human-like perceptual functions. It makes the human–computer interaction process closer to exchanging information between humans [5]. Among the gaze-based multimodal interactions, gaze+EEG has more potential applications than gaze+speech and gaze+posture. This is because gaze+EEG is neither disturbed by other users’ speech nor requires additional physical movements. However, in the existing studies on gaze+EEG for multimodal human–computer interaction, the interaction objects are mostly larger-sized, stationary targets. They do not apply to selecting smaller-sized, moving aircraft targets on the cockpit radar page. Therefore, this paper proposes to build a multimodal human–computer interaction model that uses a long short-term memory network (LSTM) and transformer architecture for the operator’s target selection in the cockpit radar page. The model can deeply extract features from eye movement and EEG data, thus effectively improving the accuracy of moving and small radar target selection.

A brilliant, comfortable, personalized driving experience is the future direction of intelligent cockpit development. Multimodal human–computer interaction technology is one of the core driving forces for developing intelligent cockpits. This study explores the application of multimodal human–computer interaction in future intelligent cockpits. In a gaze+EEG multimodal interaction, the gaze provides a clear visual focus, and the EEG reflects the cognitive state of the manipulator. This interaction method can understand the user’s intention more comprehensively and improve the accuracy of the interaction. However, in current practical applications, we must address issues such as integrating eye tracking with brain–computer interface devices, system latency, and user adaptability. Nonetheless, the multimodal human–computer interaction technology of gaze+EEG still has a lot of potential for development and application in the future because it can enhance the level of cockpit intelligence and user experience.

2. Related Work

The modalities of human–computer interaction include traditional input tools and more novel input methods. The conventional input tools include a mouse, keyboard, joystick, buttons, etc. More novel input modalities include speech, eye tracking [6], hand gestures [7], head posture, and EEG. Each mode has unique interaction characteristics, such as the advantages and limitations of the device, the accuracy and stability of interaction, and the muscle fatigue and cognitive load caused to users. Human interaction is a combination of multiple information channels. This multimodal fusion interaction method is more conducive to human perception of the computer environment and the computer’s understanding of human intentions [8]. Since more than 80% of the information humans receive originates from the visual system, many researchers have conducted studies combining gaze with other modalities to accomplish human–computer interaction.

2.1. Gaze+Speech Combination

Multimodal interaction based on gaze and speech is an early hotspot in the new human–computer interaction technology research. Castellina et al. [9] implemented a new interaction interface based on eye-tracking and speech systems. Participants selected multiple areas on the screen through the gaze-tracking system. The speech system served as an instruction for operating. Carrino et al. [10] implemented multimodal human–machine interaction based on wearable visual indications and voice commands. However, the use of wearable visual methods for human–computer interaction restricts the freedom and flexibility of user interaction. Sengupta et al. [11] used a combination of speech and gaze modes for web browsing and evaluated the time to complete browsing tasks for five participants. The evaluation results showed that multimodal interaction performance was better than single-mode interaction. Rozado et al. [12] designed an open-source interactive software called VoxVisio 1.0. The user gazed at the object to be interacted with in the user interface through the software and then uttered voice commands to trigger actions.

Multimodal interaction based on gaze and speech is superior to unimodal interaction. However, it still has many disadvantages. For example, when using voice to interact, speech control systems often need to infer the meaning expressed by users based on the context due to the inherent ambiguity of human language. In situations where semantic contexts are uncertain, the error accuracy is high. Interference may also occur in voice input among multiple users. And the privacy of voice input cannot be guaranteed.

2.2. Gaze+Posture Combination

Add gestures or head poses as input commands to the computer in the gaze interaction, and it becomes a multimodal human–computer interaction. Related studies are as follows.

Chatterjee et al. [13] proposed a gaze+gesture technique for computer-desktop interaction. This technology uses gaze as a cursor and gestures to determine interaction. Compared with gaze or gesture alone, the performance of the gaze+gesture technology is higher. Pfeuffer et al. [14] looked at the target in virtual reality and then moved it with a “pinch” motion. The whole interaction required no handheld control device. Feng et al. [15] introduced a typing interface that combines head posture and gaze input speed. The interface selected a word based on gaze path information. Then, text input operations were triggered by nodding and shaking the head. Compared with eye typing, this typing method reduces the area size for buttons and word candidate lists. Jalaliniya et al. [16] found that integrating head gesture and gaze interaction was faster than head gesture interaction.

When using hand gestures or head posture confirmation interaction, users have to keep their hands or heads in the detection area space all the time. In addition, the user is easily fatigued by making hand gestures and head movements for a long time.

2.3. Gaze+EEG Combination

EEG information can reflect human intentions effectively. The intention of the user to interact can be confirmed using EEG [17]. Some research has already verified the feasibility of combining gaze with EEG. For example, Huang et al. [18] controlled the cursor movement by combining gaze and EEG information and compared it with the gaze-only control method. The comparison results showed that the combination of gaze and EEG achieved higher accuracy in controlling the cursor. Wang et al. [19] designed a semi-autonomous object-grasping system based on eye movements and EEG for patients with physical disabilities. The system calculates the user’s attention value through EEG signals. When the attention value exceeds a set threshold, the system activates a robotic arm to grasp a static target the user is eyeing. Yi et al. [20] proposed a multimodal target recognition model to help patients with disorders of consciousness in clinical communication. The model uses a multi-scale convolutional network to extract features from the EEG and visual data and recognizes patient-selected text according to these features.

One key goal in the research of aviation human–computer interaction systems is to improve the efficiency of human–computer interaction in aircraft cockpit intelligent assistive interaction systems. Gaze+EEG multimodal interaction can reduce the operator’s hand operation and speed up the HCI. The interaction target’s size and motion state dramatically impact the accuracy of existing gaze+EEG multimodal interaction techniques. This paper uses a deep learning network to extract features from eye movement and EEG data to improve the performance of the multimodal human–computer interaction to solve this problem.

3. Methods

3.1. Multimodal Interaction Processes for Target Selection

The purpose of multimodal interaction modeling is to complete the interaction task of target selection between the operator and the simulated interaction interface. The designed interaction interface is the screen of the radar system in the aircraft. The display screen shows multiple targets, and the operator “locks” the object with their gaze and then “confirms” the currently locked target with EEG.

The whole interaction flow is shown in Figure 1. First, EEG signals are collected using an EEG measurement system. Eye movement data are collected using an eye-tracking device. Then, the collected eye movement data are fed into the established target classification model to derive the current target the operator is looking at. The acquired EEG signals are fed into the built intent classification model to determine the operator’s intent (checked or unchecked) for the currently gazing target. The produces operator’s target selection results in the interaction interface.

3.2. Multi-Modal Interaction Model Based on LSTM and Transformer

Figure 2 shows the multimodal interaction model constructed in this paper. The target classification model uses LSTM, and the intention classification model uses the transformer network. The input of the target classification model is the operator’s gaze trajectory. The LSTM is used to extract eye movement features, and the current target that the operator is gazing at is output through the dense layers. The input of the intention classification model is the operator’s EEG signals. The transformer network extracts the EEG features, and then the current operator’s intent is output through the dense layers. Finally, the output of the target classification model is concatenated with the output of the intention classification model to obtain the final interaction result for target selection.

3.3. Target Classification Model

LSTM is the core component of the target classification network. LSTM is an improvement of the recurrent neural network (RNN) [21]. RNN can make predictions based on current and historical information. However, as the complexity of neural networks increases, RNN often suffers from information overload and local optimization problems. LSTM is a variant of RNN. LSTM can use gated units to make the network’s information extraction more selective, improving information utilization and time series prediction accuracy. The LSTM module consists of three parts: the forget gate, the input gate, and the output gate. The forget gate determines whether to delete some information from the previous memory cell state. The input gate selects messages from the candidate storage unit to update the cell state. The output gate filters information from the storage unit and only outputs the messages from part of the cell state.

Figure 3 illustrates the specific structure of the target classification model. The model contains two LSTM modules and dense layers. The input is the eye-tracking data. Since there are three classes of radar targets, the model outputs a 3-classification result.

3.4. Intention Classification Model

The transformer is the core component of an intention classification network. The transformer is a model with an attention mechanism, which has shown excellent performance in multiple fields in recent years [22]. Especially in sequence modeling, the transformer gradually replaces the existing CNN and RNN models. In comparison, CNN has difficulty in obtaining global correlation at shallow stacking. RNN makes it hard to model long-range dependencies in long sequences. The transformer has superior long-range dependency representation ability and can use self-attention mechanisms to capture both local and global correlations of input sequences. The transformer contains an encoder and a decoder. We only used the transformer’s encoder and modified its structure. The modified transformer encode removes the original embedding and feed-forward, but retains the feature extraction and amplification parts. The multi-head self-attention mechanism is an essential component of the modified transformer encoder.

Figure 4 shows the intention classification model’s specific structure. The model contains one transformer encode module and dense layers. The input is EEG sequence data. Since there are two types of EEG intentions, the model outputs a 2-classification result.

4. Experimental Design and Analysis of Results

4.1. Experimental Design

4.1.1. Experimental Objects

To collect eye movement data and EEG for the experiment, we recruited 10 males aged 21 to 25. All subjects had normal vision and no history of psychiatric disorders. Before the experiment, each subject was informed of the details of the experimental procedure and signed an informed consent form. We have established strict confidentiality measures to ensure the privacy and data security of the subjects. In addition, our institution’s ethics committee has reviewed and approved this study.

4.1.2. Experimental Devices

The eye movement data were collected through a non-wearable eye-tracking device previously developed and designed by the team. The hardware of this eye-tracking device consists of four infrared light sources and a near-infrared camera. Figure 5 shows the measurement flow of the eye tracker. The eye image was acquired from the near-infrared camera image by face detection. The eye images were input into the pupil and Pulchin spot recognition model, calculating the pupil center and four Pulchin spot coordinates. All the coordinate data were input into the eye-tracking model to obtain the subject’s gaze point on the screen. The sampling frequency of the eye tracker was 90 Hz.

We collected EEG data using an ErgoLAB EEG measurement system acquired by the team. The system was configured with an EEG cap using a 10-10 international standard lead system layout of electrodes. The conduction medium for each electrode was a wet sponge. The EEG instrument was capable of recording 32-lead EEG data. The sampling frequency of the EEG signals was 256 Hz.

4.1.3. Experimental Process

Figure 6 shows the experimental environment. To minimize interference from electromagnetic signals, electronic equipment not related to the experiment must be turned off for each experiment. Only one subject and operator were allowed to enter the laboratory at a time.

Before the experiment, the subjects put on the EEG cap, kept their hands naturally still and sat on the chair. The distance between the subject’s eyes and the display screen was about 60 cm. We ran the virtual radar interface program while debugging the eye movement acquisition and EEG measurement devices. At the beginning of the experiment, the radar interface on the screen did not display any aircraft targets, only showing the word relax. At this time, the subject was in a relaxed state. After 2 s, a randomly selected red, yellow, or blue airplane appeared on the screen, each with a different trajectory. The subject gazed at the red (or yellow or blue) airplane without imagining the motion of selecting a target. After 3 s, the screen displayed the word prepare, prompting the subject to prepare to imagine the motion. Then, 1 s later, a mouse image appeared on the screen. The subject continued to gaze at the red (or yellow or blue) airplane while imagining using the right hand to click the mouse to select the target. The entire experiment lasted for 3 s. The duration of one set of experiments was 9 s. Figure 7 is the timing diagram of the above experimental process.

Prolonged experiments may cause subjects to lose concentration, leading to poor experimental results. Therefore, after performing 200 sets of experiments, each subject had to be replaced by the next subject.

4.1.4. Data Preprocessing

During the experiment, the eye tracker and EEG recorded data simultaneously. We preprocessed the EEG signals by filtering twice. The first one was 50 Hz trap filtering to reduce industrial frequency interference. The next was NLMS adaptive filtering to remove noise that deviates severely from the mean. The data collection software recorded the data and timestamps. We sliced the original dataset by timestamp and labeled each slice of data. Figure 8 shows an example of data slicing.

4.2. Analysis of Results

4.2.1. Model Training Parameters

The sample dataset of this experiment contains eye movement and EEG data samples. There are 14,400 sets of eye movement data samples and 14,400 sets of EEG data samples. The epochs of network training are 600. The batch size of each exercise is 64. The initial learning rate is 10-3. The learning rate is automatically adjusted during training using the Adam optimizer. The loss function is the cross-entropy function. The mathematical expression for the loss function is as follows.

$L o s s = - \frac{1}{N} \sum_{j}^{N} P (X_{j}) l o g (q (X_{j}))$

(1)

where $P (X_{j})$ denotes the real probability of the sample and $q (X_{j})$ represents the predicted probability of the sample. The training should try to make the values of $P (X_{j})$ and $q (X_{j})$ infinitely close to each other.

The sample dataset is grouped for training using 5-fold cross-validation. The process of 5-fold cross-validation is as follows. First, the sample dataset is divided randomly into five equal-sized subsets. Any one of the subsets is the testing set for testing the model performance, and the remaining four subsets are the training set for model training. We then repeat the above method of dividing the dataset five times. The training and testing sets are different each time. Finally, we average the validation results from all testing sets. This average value represents the final performance of the model.

4.2.2. Results and Analysis of Target Classification

To evaluate the proposed multimodal human–computer interaction model’s performance comprehensively, we first assess the performance of the target and intention classification modules separately. We then group all eye movement data samples for training using 5-fold cross-validation. We randomly divide the 14,400 sets of eye movement data into five equal-sized subsets at a time.

One subset is the testing set, and its sample size is 2880. The remaining four subsets are the training set, and its sample size is 11,520. When dividing the dataset every time, we select a different subset as the testing set. The evaluation criterion of classification effect is classification accuracy. The formula for classification accuracy is as follows.

$A c c u r a c y = \frac{T R}{T R + F L} \times 100 %$

(2)

$T R$ denotes the number of samples whose classification results are identical to the class labels. $F L$ denotes the number of samples whose classification results differ from the class labels.

The final classification accuracy of the target classification model in this paper is the average of the classification accuracies obtained from the 5-fold cross-validation. It is compared with other classifiers to evaluate the model’s performance objectively. Table 1 shows the results of the target classification of eye movement data using different methods. The classification accuracy values in Table 1 show the superiority of our method.

4.2.3. Results and Analysis of Intention Classification

We also train the 14,400 sets of EEG data samples in groups using 5-fold cross-validation. The model needs to be cross-validated five times. For each cross-validation, the sample size of the training set is 11,520, and the sample size of the testing set is 2880. The final classification accuracy of the intention classification model in this paper is the average of the classification accuracies obtained from 5-fold cross-validations.

The intention classification module in the human–machine interaction model determines whether the subject intends to select the aircraft target by extracting features from the EEG signals. Previous studies have used both single-channel and multi-channel EEG data for classification. Therefore, we compare the difference in classification accuracy between inputting single-channel EEG data and 32-channel EEG channel data. Figure 9 shows the accuracy of the model with different numbers of channels. ‘All’ represents 32 channels. From Figure 10, we can see that when using 32 channels of EEG data, the model has a classification accuracy of up to 98.5%. Therefore, we send all 32-lead EEG data recorded into the model for training.

To evaluate the performance of the EEG intention classification models, we use the DCNN [27], DRCNN [28], FCNN [29], and BERT [30] for EEG classification on this paper’s dataset as well. All models use 32-channel EEG data. Figure 10 demonstrates the classification accuracy of the compared models on this paper’s dataset. The results in Figure 10 show that our proposed method obtained the highest classification accuracy. It is better than the other models.

4.2.4. Results and Analysis of Multimodal Interaction

To evaluate the overall performance of the multimodal interaction model, we send all eye movement data samples and EEG data into the model at the same time. The training and testing sets are divided using 5-fold cross-validation. The target classification model outputs the target number. The intention classification model outputs the intention number. The output of the multimodal interaction model is the result of concatenating the target number with the intent number. From the output of the multimodal interaction model, it is possible to know which target the subject is gazing at and whether the subject wants to select this target. We use interaction accuracy as the evaluation measure of the multimodal model. Equation (3) calculates the interaction accuracy.

$A c c = \frac{T T}{T T + T F + F T + F F} \times 100 %$

(3)

$A c c$ is the interaction accuracy. $T T$ denotes the number of samples for which the target and intention classification predictions are the same as the class labels. $T F$ denotes the number of samples for which the class labels are the same as the predictions for target classification and different from the projections for intention classification. $F T$ denotes the number of samples where the class labels differ from the predictions of the target classification and are the same as the predictions of the intention classification. $F F$ denotes the number of samples for which the predictions of both the target and intention classifications differ from the class labels.

Table 2 shows the results of 5-fold cross-validation of the multimodal interaction model, where

N

is the number of cross-validations. The results show that the multimodal interaction model in this paper is highly accurate.

4.2.5. Results and Analysis of Flight Simulation Validation

To further verify the proposed model’s performance, we test gaze interaction and the gaze+EEG interaction on a simulated flight platform. Figure 11 shows the testing process. Subjects select an aircraft target every 10 s during the flight simulation. The number of selections for all aircraft targets is 200.

Figure 12 shows the interaction accuracy of the two methods. The multimodal interaction model we constructed has an accuracy of over 95% for each of the selected airplanes. The results show that the gaze+EEG interaction was superior to the gaze interaction.

5. Discussion

The model we designed achieves the expected results in the human–computer interaction task with the selected target. In the designed multimodal interaction model, the target classification module utilizes the LSTM network framework, and the intention classification module uses a modified transformer framework. To evaluate the performance of the LSTM and transformer in the model, we compare them with multiple models. The comparison results prove the robustness of our model. In addition, the test results on the simulated flight platform also show that our designed multimodal human–computer interaction model based on gaze and EEG can improve the accuracy of the interaction.

The model in this paper focuses on the situation when the motion trajectories of the interacting targets are different. In other words, target classification works best when the motion trajectories of the interacting targets are quite different. However, the motion trajectories of the airplane targets on the actual radar page may be very similar. In this case, the classification performance of the proposed model will deteriorate. In particular, when the airplane targets all move along the identical flight path, the target classification network in this paper is difficult to distinguish by eye movement trajectories. In addition, it is necessary to expand the scale and variety of the dataset used by the model further. The model in this paper is still in the exploratory stage for practical applications and cannot directly apply to real cockpits. Because it still has many challenges, such as the system’s integration technology’s complexity, the system’s response time, and the adaptation of user personalization.

6. Conclusions

The research in this paper aims to improve the efficiency and accuracy of the human–computer interaction in aircraft cockpit scenarios as well as the naturalness of the interaction behavior. In this study, we propose a multimodal interaction model based on deep learning networks to implement target selection for radar interfaces. The model points to the task object by gaze and then determines by EEG whether or not to select the pointed task object. The comparison results with other models and the validation results of the simulated flight platform prove the advantages of the proposed model’s performance. Our study can provide a reference value for the design of intelligent aircraft cockpits. In future studies, we will expand the subject population by increasing the number of subjects, adding female subjects, and other measures. Large and diverse datasets can raise the model’s generalization ability. We will continue to optimize the network model to improve accuracy, considering that the trajectories of interacting targets may be very similar. In addition, future research should consider actual cockpit testing, adding other modalities, and user experience.

Author Contributions

Methodology, L.W.; writing—original draft, L.W.; writing—review and editing, L.W.; data analysis, H.Z.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (Number: 52072293).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Xi’an Technological University (10702) on July 2022.

Informed Consent Statement

Informed consent for participation and publication were obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Plopski, A.; Hirzle, T.; Norouzi, N.; Qian, L.; Bruder, G.; Langlotz, T. The eye in extended reality: A survey on gaze interaction and eye tracking in head-worn extended reality. ACM Comput. Surv. 2023, 55, 53. [Google Scholar] [CrossRef]
Land, M.F.; Tatler, B.W. Looking and Acting: Vision and Eye Movements in Natural Behaviour. In Looking and Acting: Vision and Eye Movements in Natural Behaviour; Oxford University Press: Oxford, UK, 2012. [Google Scholar]
Dondi, P.; Porta, M.; Donvito, A.; Volpe, G. A gaze-based interactive system to explore artwork imagery. J. Multimodal User Interfaces 2022, 16, 55–67. [Google Scholar] [CrossRef]
Niu, Y.; Li, X.; Yang, W.; Xue, C.; Peng, N.; Jin, T. Smooth Pursuit Study on An Eye-Control System for Continuous Variable Adjustment Tasks. Int. J. Hum. Comput. Interact. 2021, 1, 23–33. [Google Scholar] [CrossRef]
Card, S.K.; Moran, T.P.; Newell, A. The Psychology of Human-Computer Interaction; University of Michigan: Ann Arbor, MI, USA, 2008. [Google Scholar]
Mahanama, B.; Ashok, V.; Jayarathna, S. Multi-eyes: A framework for multi-user eye-tracking using webcams. In Proceedings of the 2024 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), San Jose, CA, USA, 7–9 August 2024; pp. 308–313. [Google Scholar]
Franslin, N.M.F.; Ng, G.W. Vision-based dynamic hand gesture recognition techniques and applications: A review. In Proceedings of the 8th International Conference on Computational Science and Technology, Labuan, Malaysia, 28–29 August 2021; Lecture Notes in Electrical Engineering. Springer: Singapore, 2022; Volume 835, pp. 125–138. [Google Scholar]
Jacob, R.J.K. What you look at is what you get: Eye movement-based interaction techniques. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Seattle, WA, USA, 1–5 April 1990; pp. 11–18. [Google Scholar]
Castellina, E.; Corno, F.; Pellegrino, P. Integrated speech and gaze control for realistic desktop environments. In Proceedings of the 2008 Symposium on Eye Tracking Research & Applications, Savannah, GA, USA, 26–28 March 2008; pp. 79–82. [Google Scholar]
Carrino, S.; Péclat, A.; Mugellini, E.; Khaled, O.A.; Lngold, R. Humans and smart environments: A novel multimodal interaction approach. In Proceedings of the 13th International Conference on Multimodal Interfaces, Alicante, Spain, 14–18 November 2011; pp. 105–112. [Google Scholar]
Sengupta, K.; Ke, M.; Menges, R.; Kumar, C.; Staab, S. Hands-free web browsing: Enriching the user experience with gaze and voice modality. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications, Warsaw, Poland, 14–17 June 2018; pp. 1–3. [Google Scholar]
Rozado, D.; McNeill, A.; Mazur, D. Voxvisio—Combining Gaze and Speech for Accessible Hci. In Proceedings of the RESNA/NCART, Arlington, VA, USA, 10–14 July 2016. [Google Scholar]
Chatterjee, I.; Xiao, R.; Harrison, C. Gaze+ gesture: Expressive, precise and targeted free-space interactions. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 131–138. [Google Scholar]
Pfeuffer, K.; Mayer, B.; Mardanbegi, D.; Gellersen, H. Gaze+ pinch interaction in virtual reality. In Proceedings of the 5th Symposium on Spatial User Interaction, Brighton, UK, 16–17 October 2017; pp. 99–108. [Google Scholar]
Feng, W.; Zou, J.; Kurauchi, A.; Morimoto, C.; Betke, M. Hgaze typing: Head-gesture assisted gaze typing. In Proceedings of the ACM Symposium on Eye Tracking Research and Applications, Online, 25–27 May 2021; pp. 1–11. [Google Scholar]
Jalaliniya, S.; Mardanbegi, D.; Pederson, T. MAGIC pointing for eyewear computers. In Proceedings of the 2015 ACM International Symposium on Wearable Computers, Osaka, Japan, 7–11 September 2015; pp. 155–158. [Google Scholar]
Wang, B.; Wang, J.; Wang, X.; Chen, L.; Zhang, H.; Jiao, C.; Wang, G.; Feng, K. An Identification Method for Road Hypnosis Based on Human EEG Data. Sensors 2024, 24, 4392. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Xiao, Y.; Xu, G. Predicting human intention-behavior through EEG signal analysis using multi-scale CNN]. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 18, 1722–1729. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Wang, Z.; Zou, Z.; Zhang, A.; Tian, R.; Gao, S. Semi-autonomous g-rasping system based on eye movement and EEG to assist the interaction between the disabled and environment. TechRxiv 2021. [Google Scholar] [CrossRef]
Yi, Z.; Pan, J.; Chen, Z.; Lu, D.; Cai, H.; Li, J.; Xie, Q. A hybrid BCI integrating EEG and Eye-Tracking for assisting clinical communication in patients with disorders of consciousness. IEEE Trans. Neural Syst. Rehabil. 2024, 32, 2759–2771. [Google Scholar] [CrossRef] [PubMed]
Huang, R.; Wei, C.; Wang, B.; Yang, J.; Xu, X.; Wu, S.; Huang, S. Well performance prediction based on Long Short-Term Memory (LSTM) neural network. J. Pet. Sci. Eng. 2022, 208, 109686. [Google Scholar] [CrossRef]
Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
Ye, Z.; Li, H. Based on Radial Basis Kernel function of Support Vector Machines for speaker recognition. In Proceedings of the 5th International Congress on Image and Signal Processing, Chongqing, China, 16–18 October 2012; pp. 1584–1587. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhang, K.; Robinson, N.; Lee, S.W.; Guan, C. Adaptive transfer learning for EEG motor imagery classification with deep convolutional neural network. Neural Netw. 2021, 136, 1–10. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Yang, H.; Li, J.; Chen, D.; Du, M. EEG-based intention recognition with deep recurrent-convolution neural network: Performance and channel selection by Grad-CAM. Neurocomputing 2020, 415, 225–233. [Google Scholar] [CrossRef]
Roots, K.; Muhammad, Y.; Muhammad, N. Fusion convolutional neural network for cross-subject EEG motor imagery classification. Computers 2020, 72, 72. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers). Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]

Figure 1.
Multimodal interaction process.

Figure 2.
Multimodal interaction mode.

Figure 3.
The specific structure of the target classification model.

Figure 4.
The intention classification model’s specific structure.

Figure 5.
The measurement flow of the eye tracker.

Figure 6.
Experimental environment.

Figure 7.
Timing diagram of the experimental flow.

Figure 8.
An example of data slicing.

Figure 9.
(a). Accuracy comparison of Fp1-C3 channel. (b). Accuracy comparison of Cz-All channel.

Figure 10.
Accuracy comparison of different models.

Figure 11.
Test experiment process.

Figure 12.
Interaction verification results.

Table 1.
Comparison of results of target classification.

Classifier	Parameters	Accuracy
SVM [23]	Kernel Type: RBF	70.2%
AlexNet [24]	num_convolution: 5 Batch size: 64 The initial learning rate: 10⁻³ Loss function: cross-entropy	78%
GoogLeNet [25]	num_layers: 22 Batch size: 64 The initial learning rate: 10⁻³ Loss function: cross-entropy	83.6%
RESNET [26]	Convolution kernel size: 3 × 3 Stride: 2 Padding: 1 Batch size: 64 The initial learning rate: 10⁻³	88.4%
OURS	‌num_lstm_layers: 2 Dropout rate: 0.5 Batch size: 64 The initial learning rate: 10⁻³ Loss function: cross-entropy	98%

Table 2.
The results of 5-fold cross-validation of the multimodal interaction model.

N	Acc
1	97.2%
2	96.8%
3	97%
4	97.9%
5	97.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Source link

Li Wang www.mdpi.com

Greenberg News

Deep Neural Network-Based Modeling of Multimodal Human–Computer Interaction in Aircraft Cockpits

1. Introduction