Deep Reinforcement Learning-Based Secrecy Rate Optimization for Simultaneously Transmitting and Reflecting Reconfigurable Intelligent Surface-Assisted Unmanned Aerial Vehicle-Integrated Sensing and Communication Systems

1. Introduction

The rapid development of sixth-generation communication technology (6G) has led to an increasing requirement for intelligent machine-type communication (IMTC), and the integration of sensing technologies to facilitate intelligent communications has emerged as an inevitable trend. Integrated sensing and communication (ISAC) has garnered significant interest from both studies and businesses as an essential research approach for 6G in recent years [1,2,3,4]. By integrating communication and sensing capabilities, ISAC can markedly improve spectral efficiency, reduce system complexity, and offer technical support for developing application contexts such as smart cities and intelligent transportation.

Compared to traditional terrestrial communication, the use of unmanned aerial vehicles (UAVs) as ISAC base stations (UAV-ISAC) offers significant advantages, particularly in providing line-of-sight links and optimizing the quality of communications. UAV-ISACs leverage their mobility to dynamically adjust flight altitudes and trajectories to avoid obstructions, enhancing the stability of communication and sensing connections while flexibly adapting to diverse environments. Furthermore, the three-dimensional mobility of UAVs allows them to extend their coverage to remote or obstructed areas that terrestrial base stations cannot easily reach, significantly improving overall system coverage. For instance, Meng et al. [5] have highlighted that UAVs can maintain uninterrupted communications while performing sensing tasks. Xu et al. [6] further enhanced this framework through the addition of computing capabilities for real-time data processing and decision-making. Xiao et al. [7] proposed improving the communication rate through dynamic resource allocation and beamforming to enhance obstruction resistance. These studies underscore how UAVs can strengthen the flexibility and expand the coverage of communications, driving the advancement of UAV-assisted ISAC systems.

While UAVs can adapt to different scenarios through adjusting their trajectories, factors such as obstacles, line-of-sight variations, and user distribution uncertainties still limit their communication-related performance in complex urban environments. Therefore, reconfigurable intelligent surfaces (RISs) have been introduced to enhance wireless systems. Pan et al. [8] have provided a comprehensive overview of recent advancements in RIS-aided wireless systems from a signal processing perspective, highlighting promising research directions that warrant further exploration. Accurate channel state information (CSI) estimation is crucial for practical RIS systems. In this aspect, Zhou et al. [9] have proposed a novel cascaded channel estimation strategy for RIS-assisted multi-user MISO systems, addressing the high overhead caused by the passive nature of RISs and the large number of reflecting elements. Byun et al. [10] have introduced a two-stage channel estimation technique for UAV-RIS communication systems, estimating them at different time scales. In this study, in order to simplify the system design, we assume the use of perfect CSI.

RISs have been widely used to enhance the stability and reliability of communications in UAV-ISAC systems. Wu et al. [11] have proposed an RIS-assisted UAV-enabled ISAC system that optimizes the trade-off between communication and sensing through jointly designing the RIS phase shift, UAV trajectory, beamforming, and user scheduling, as well as an iterative optimization algorithm for improved performance. Conventional RIS configurations require the transmitter and receiver to be positioned on the same side, significantly restricting the flexibility of their deployment and limiting system coverage. To overcome these limitations, simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs) enable the independent control of both transmission and reflection

(T & R)

channels, allowing signals to serve users on both sides of the surface. This dual functionality not only enhances system flexibility but also significantly expands communication and sensing coverage, making STAR-RISs particularly advantageous in complex and dynamic environments [12]. Chen et al. [13] have proposed a STAR-RIS-assisted DFRC-UAV-enabled ISAC system that maximizes the multicast capacity through jointly optimizing scheduling, transmission covariance, RIS coefficients, and the UAV trajectory, using an alternating optimization algorithm for sub-optimal solutions.

In the ISAC scenario, security issues are often of significant concern due to the multi-functional nature of integration, which expands the attack range. The system architecture is complex, and dynamic environments increase the risk of vulnerabilities. Yu et al. [14] have studied secure transmission in ISAC via RIS-UAV systems through the introduction of artificial noise interference and jointly optimizing UAV deployment, beamforming, and noise power to enhance the signal quality, thus addressing the non-convex problem. Salem et al. [15] have proposed the use of active RIS to optimize the secrecy rate of MU-MISO in ISAC systems through a jointly designed beamforming approach to enhance physical layer security, with the results demonstrating that active RISs outperforms passive RISs. Although the research presented in [14,15] employed an RIS-assisted ISAC, it was still based on a fixed-base station ISAC architecture. Compared to the UAV-ISAC approach, the transmission distance is more constrained, making it more susceptible to signal attenuation and blockage issues, which may affect the system’s stability and coverage. Zhang et al. [16] have proposed the joint optimization of resource allocation, RIS phase shifts, and the UAV trajectory to enhance the secrecy rate and energy efficiency in integrated sensing and communication scenarios. They employed an RIS-assisted UAV-ISAC scheme to maximize the secrecy rate; however, this scheme utilizes a conventional reflecting-only RIS. Different from traditional RISs, STAR-RISs can simultaneously control the reflection and transmission of signals, providing greater flexibility, making them particularly suitable for application in complex environments. However, research on the security of STAR-RIS-assisted UAV-ISAC systems is still relatively limited, and enhancing the security of such systems remains a crucial area of investigation.

Traditional convex optimization methods are widely used in RIS-assisted ISAC systems; however, their limitations become apparent when dealing with non-convex problems and high-dimensional state–action spaces. Reinforcement learning (RL), with its strong adaptability, has emerged as an essential tool for solving complex problems. In STAR-RIS-assisted UAV-ISAC systems, decision-making problems can be modeled using Markov decision processes (MDPs), further enhancing the effectiveness of RL. However, conventional RL algorithms face issues such as low sample efficiency and inadequate exploration of policies in high-dimensional continuous action spaces. Deep deterministic policy gradient (DDPG), as a deep reinforcement learning (DRL) approach, was designed for continuous action spaces. Moon et al. [17] have used the DDPG algorithm to optimize beamforming at both the ISAC base station and the RIS mounted on a UAV, aiming to maximize the secrecy rate. Additionally, the twin delayed deep deterministic policy gradient (TD3) approach resolves the problem of Q-value overestimation through the use of a dual-network architecture, delayed target updates, and target policy smoothing, serving to reduce Q-value overestimation while improving stability and sample efficiency, making TD3 more reliable for continuous control tasks. In STAR-RIS-assisted UAV-ISAC systems, TD3 can optimize the UAV trajectory, beamforming, and STAR-RIS phase control, effectively handling dynamic environments and improving both the quality and security of communications. Compared to single-agent DRL, multi-agent deep reinforcement learning (MADRL) employs a distributed strategy, decomposing the system into multiple intelligent modules (e.g., STAR-RIS and UAV-ISAC), effectively reducing computational burdens while enhancing system scalability and flexibility. While maintaining efficient learning and stability, MARL effectively addresses complex environments, improving the system’s adaptability and robustness.

In response to the communication performance requirements of UAV-ISAC systems, this study introduces STAR-RIS technology to overcome the limitations of normal RISs in terms of deployment and coverage range. As such, a STAR-RIS-assisted UAV-ISAC system framework is formulated, leveraging the wide coverage capabilities and flexible deployment advantages of both STAR-RISs and UAVs to enhance the overall system performance. To address security issues, the average secrecy rate (ASR) is set as the optimization objective, with the aim of improving the security and anti-eavesdropping capabilities of the communication system. The multi-agent twin delayed deep deterministic policy gradient (MATD3) algorithm is adopted to jointly optimize the UAV trajectory, beamforming, and the amplitude and phase adjustment of the STAR-RIS reflection elements, maximizing the ASR while ensuring that relevant system constraints are satisfied. The contributions of this study can be summarized as follows:

(1) This study introduces a framework for a STAR-RIS-assisted UAV-ISAC system, where the UAV operates in three-dimensional space to provide downlink communication services to legitimate ground users. The framework jointly optimizes the UAV-ISAC trajectory, beamforming, and the phase and amplitude of the STAR-RIS reflection elements with the objective of maximizing the ASR while ensuring that the system meets the SINR constraints for communication, sensing, and eavesdropping.

(2) The MATD3 algorithm is employed to optimize the system’s decision-making process, simultaneously optimizing the UAV-ISAC trajectory, beamforming, and STAR-RIS configuration. This optimization maximizes the ASR of the system, demonstrating the effectiveness of the proposed algorithm.

The remainder of this paper is organized as follows: Section 2 presents the system model. Section 3 focuses on the system model of the STAR-RIS-assisted UAV-ISAC network and the problem of maximizing the ASR. Section 4 describes the simulation results, and Section 5 provides this paper’s conclusion.

3. The Proposed Algorithm

We define the optimization problem based on maximizing the system’s ASR as an MDP model and propose the STAR-RIS-assisted UAV-ISAC based on the MATD3 algorithm (STAR-RIS-MATD3), which jointly designs the UAV-ISAC trajectory, transmits the beamforming matrix, and adjusts the amplitude and phase of the STAR-RIS. The details of the MDP model and MATD3 are described in the following sections.

3.1. Markov Decision Process

We propose a STAR-RIS-assisted UAV-ISAC system. In this system, UAV-ISAC and STAR-RIS operate as independent agents, each autonomously selecting its next action based on the current environmental state. Each agent follows an MDP, which is mathematically defined as a triplet $(s, a, r)$ . The state space s and action space a describe all possible states and actions of the UAV-ISAC and STAR-RIS, while the reward function reflects the reward value r associated with taking a specific action in a given state.

Therefore, the state, action, and reward of each agent at time step t are described as follows:

(1) $State$ s: Considering the time-varying nature of the channel, the state includes the relevant CSI information. Therefore, the state of UAV-ISAC at time t is defined as $s_{U A V, t} = (x_{U A V}^{(t)}, y_{U A V}^{(t)}, H_{U A V}^{(t)}, G^{(t)}, h_{d, k}^{(t)}) .$ The UAV-ISAC state information includes the coordinates of the UAV-ISAC, $G^{(t)}$ (representing the channel state between the UAV-ISAC and STAR-RIS), and $h_{d, k}^{(t)}$ , which represents the channel state between the UAV-ISAC and the legitimate ground users. The state of STAR-RIS at time t is defined as $s_{S T A R – R I S, t} = (x_{U A V}^{(t)}, y_{U A V}^{(t)}, H_{U A V}^{(t)}, G^{(t)}),$ where the STAR-RIS state information includes the coordinates of the UAV-ISAC and the channel state between the UAV-ISAC and STAR-RIS.

(2)

Action

a: First, the action of the UAV-ISAC is

$a_{U A V, t} = {w_{k}^{(t)}}_{k \in K}, {w_{r, e}^{(t)}}_{e \in E}, Δ x_{U A V}^{(t)}, Δ y_{U A V}^{(t)}, Δ H_{U A V}^{(t)},$

(18)

where $w_{k}^{(t)} = β_{k}^{(t)} e^{i θ_{k}^{(t)}}, w_{r, e}^{(t)} = β_{r, e}^{(t)} e^{j θ_{r, e}^{(t)}}$ represent the beamforming expressions for the communication and sensing components of UAV-ISAC, which include the amplitude $β_{k}^{(t)}, β_{r, e}^{(t)}$ and phase shift $θ_{k}^{(t)}, θ_{r, e}^{(t)} .$

Furthermore, the STAR-RIS consists of reflection and transmission components, and its action space is defined as

$a_{RIS, t} = {β_{T / R, m}^{(t)}}_{m \in M}, {θ_{T / R, m}^{(t)}}_{m \in M} .$

(19)

(3)

Reward

r: The reward aligns with our goal of maximizing the ASR. Therefore, the reward function is defined as

$r_{t} = {ASR}^{(t)} = \frac{1}{E} \sum_{k = 1}^{K} \sum_{e = 1}^{E} (R_{k}^{(t)} - R_{e, k}^{(t)}) .$

(20)

3.2. ASR Maximization Algorithm Based on MATD3

To address the optimization problem presented in this study, we propose an optimization scheme for maximizing the ASR of the STAR-RIS-assisted UAV-ISAC system based on the MATD3 algorithm, as shown in Figure 2. This algorithm adopts a distributed architecture and incorporates the TD3 algorithm, which avoids the high complexity associated with centralized training and effectively prevents Q-value overestimation, compared to DDPG. To achieve this, we utilized the MATD3 algorithm to jointly optimize the UAV-ISAC trajectory, the transmit antenna’s beamforming matrix, and the amplitude and phase of the STAR-RIS reflection elements, with the goal of maximizing the ASR while satisfying the system constraints. In this approach, the UAV-ISAC and STAR-RIS are modeled as independent agents that interact with the environment through distributed collaboration, gradually learning optimal strategies and further improving the system’s flexibility and efficiency.

Specifically, at each time step t, an agent i will take an action

a_{i, t}

based on the current state

s_{i, t}

using an actor policy network:

$a_{i, t} (s_{i, t}) = c l i p (π_{ϕ_{i}} (s_{i, t}) + ε, a_{min}, a_{max}), ε \sim N (0, σ),$

(21)

where $ε$ denotes Gaussian noise, and $α_{\max}$ and $α_{\min}$ represent the maximum and minimum values of the action space that the agent can choose, respectively. After the agent takes the action $a_{i, l}$ based on the current state $s_{i, t}$ , the agent receives the next state $s_{i, t + 1}$ and reward $r_{i, t}$ from the environment. Then, the previous tuple $(s_{i, t}, a_{i, t}, s_{i, t + 1}^{'}, r_{i, t})$ is stored in the replay buffer B.

Once enough information is stored in the buffer, the agent will perform mini-batch training by sampling N transitions

(s_{i, t}, a_{i, t}, s_{i, t + 1}^{'}, r_{i, t})

from the buffer. To compute the target value, the agent must use the actor-target policy to select the action

a_{i, t}^{'}

in the next state

s_{i, t}^{'}

, and apply clip noise to ensure that the policy is never stagnant by selecting an overestimating action, as shown in Equation (22). Then, Equation (23) is used to calculate the target value

y_{i}

. The agent employs both twin critic-target networks to evaluate the global value of the subsequent state and action, choosing the one with the lowest estimate.

$a_{i}^{'} (s_{i}^{'}) = c l i p (π_{ϕ_{i}} (s_{i}^{'}) + c l i p (ε, - c, c), a_{min}, a_{max}), ε \sim N (0, σ),$

(22)

$y_{i} = r + γ min_{θ_{i, j = 1, 2}} Q_{θ_{i, j}^{'}} (s_{i}^{'}, a_{i, t}^{'}) .$

(23)

Once the agent has its target value

y_{i}

, it will update each of its critic networks

θ_{i, j = 1, 2}

by calculating the mean squared error of these batches of observations and applying a step of gradient descent, as shown in Equation (24).

$θ_{i, j} \leftarrow arg min_{θ_{i . j = 1, 2}} \frac{1}{N} \sum {(y_{i} - Q_{θ_{i, j}} (s_{i}, a_{i}))}^{2} .$

(24)

The agent will update its actor policy

ϕ_{i}

using one of the critic networks at each time step e, which is an update interval to estimate the Q-value of the state and the action selected by the policy. Then, we apply a step of gradient ascent to push the parameters of the neural network

ϕ_{i}

in the direction of maximum growth of the Q-value, as shown in Equation (25):

$\nabla_{ϕ_{i}} J (ϕ_{i}) = \frac{1}{N} \sum \nabla_{a_{i}} Q_{θ_{i, 1}} (s_{i}, a_{i}) | a_{i} = π_{ϕ_{i}} (s_{i}) \nabla_{ϕ_{i}} π_{ϕ_{i}} (s_{i}) .$

(25)

Once the actor networks are updated, then the target networks are updated using Equation (26):

$\begin{matrix} ϕ_{i}^{'} \leftarrow τ ϕ_{i} + (1 - τ) ϕ^{'} \\ θ_{i, j = 1, 2}^{'} \leftarrow τ θ_{i, j} + (1 - τ) θ_{i, j}^{'} \end{matrix} .$

(26)

Through continuous training cycles until the training period t reaches T, the UAV-ISAC and STAR-RIS gradually learn the optimal strategy and eventually make decisions that maximize the ASR while meeting the system’s objective requirements. The detailed algorithm is provided in Algorithm 1.

Algorithm 1: ASR Maximization Algorithm based on MATD3

The proposed MATD3 structure consists of the following components. Let N, LA, and LB represent the mini-batch size, the number of layers in the deep neural network (DNN), and the size of each layer, respectively. Additionally, the UAV-ISAC agent has a state space dimension of a and an action space dimension of b, while the STAR-RIS agent has a state space dimension of c and an action space dimension of d. We assume a total of $T_{e p}$ episodes, where each agent uses $T_{s t}$ steps per episode for training. Therefore, the overall time complexity can be expressed as $O (T_{ep} \cdot T_{st} \cdot N ((a + b + c + d) L B + L A \cdot L B^{2}))$ .

Source link

Jianwei Wang www.mdpi.com

Greenberg News

Deep Reinforcement Learning-Based Secrecy Rate Optimization for Simultaneously Transmitting and Reflecting Reconfigurable Intelligent Surface-Assisted Unmanned Aerial Vehicle-Integrated Sensing and Communication Systems

1. Introduction

3. The Proposed Algorithm

3.1. Markov Decision Process

3.2. ASR Maximization Algorithm Based on MATD3

Greenberg

1. Introduction

3. The Proposed Algorithm

3.1. Markov Decision Process

3.2. ASR Maximization Algorithm Based on MATD3

Related Posts

Applied Microbiology, Vol. 5, Pages 123: Biological Management of Soil-Borne Pathogens Through Tripartite Rhizosphere Interactions with Plant Growth-Promoting Fungi

Sustainability, Vol. 17, Pages 9853: Astrotourism as Social Innovation for Peripheral Territories: Pathways for Sustainable Development Under Dark Skies

Birds, Vol. 6, Pages 59: Genetic Identity of the Red-Legged Partridge (Alectoris rufa, Phasianidae) from the Island of Madeira

Greenberg