Unified Normalizing Flow for Unsupervised Multi-Class Anomaly Detection


Consider an observed data variable x X , and its true distribution is denoted as p X ( x ) . Given a simple prior distribution, p Z , and a bijective transformation, f : X Z , where Z p Z ( z ) . p Z is also termed as the target distribution. With the variable transformation formula [52], we can deduce the following:

p X ( x ) = p Z ( f ( x ) ) det f ( x ) x ,

where f ( x ) x is the Jacobian of f at x and det f ( x ) x is the absolute value of the Jacobian determinant. Thus, the key to learning the distribution of the variable X lies in finding a bijective transformation that satisfies the aforementioned mapping while its Jacobian determinant is tractable. Normalizing flow seeks such a transformation through the composition of a series of base transformations. These base transformations are invertible and have tractable Jacobian determinants, ensuring that the final composite transformation retains these desirable properties. Suppose f 0 , f 1 , , f k 1 are k base transformations, where f is defined as the composition of them, then we have:

f ( x ) x = f 0 ( x ) x × f 1 ( f 0 ( x ) ) ( f 0 ( x ) ) × × f k 1 ( f k 2 f k 3 f 0 ( x ) ) ( f k 2 f k 3 f 0 ( x ) ) ,

det f ( x ) x = det f 0 ( x ) x × det f 1 ( f 0 ( x ) ) ( f 0 ( x ) ) × × det f k 1 ( f k 2 f k 3 f 0 ( x ) ) ( f k 2 f k 3 f 0 ( x ) ) ,

where ∘ represents composition. Let θ be the parameters of f, which are estimated or learned to transform X to Z, and p θ ( x ) denote the estimated distribution with x = f 1 ( z ; θ ) . We employ the forward KL divergence to measure the distance between the true distribution p X ( x ) and p θ ( x ) . Then the optimization objective for θ can be written as follows with Formula (1):

D K L ( p X ( x ) p θ ( x ) ) = E x p X ( x ) log p θ ( x ) + E x p X ( x ) log p X ( x ) = E x p X ( x ) log p Z ( f ( x ; θ ) ) × det f ( x ; θ ) x + const .

NICE [50] and RealNVP [51] design such base transformations with so-called coupling layers for image generation. The transformations derived from coupling layers yield triangular Jacobian matrices whose determinants are the products of their diagonal elements. Next, we will introduce two types of coupling layers used for anomaly detection tasks. Let x R C × H × W be the extracted feature map, where C denotes the number of channels and H × W indicates the size. Given d < C , we split x along the channel dimension into two parts, x 1 : d and x d + 1 : C . The output y from a coupling layer is similarly divided into two components, y 1 : d and y d + 1 : C , along the channel dimension with the input, x. The coupling layers are classified into additive and affine coupling layers based on the distinct coupling laws for x 1 : d and x d + 1 : C .
For additive coupling layers, the two parts, y 1 : d and y d + 1 : C , of y are obtained as follows:

y 1 : d = x 1 : d , y d + 1 : C = x d + 1 : C + t ( x 1 : d ) ,

where the transformation, t, is learned via neural networks such as convolutional networks.

For affine coupling layers, the two components, y 1 : d and y d + 1 : C , of y are derived as follows:

y 1 : d = x 1 : d , y d + 1 : C = exp ( s ( x 1 : d ) ) x d + 1 : C + t ( x 1 : d ) ,

where s and t represent the scale and translation transformation, respectively. Similarly, the transformations s and t are typically learned via convolutional networks.

Normalizing flow models can be constructed by stacking a series of coupling layers. In anomaly detection tasks, the target distribution is generally set to a multivariate standard normal distribution, as is the case in this study. Suppose the batch size is N, the loss function for normalizing flow models with Formula (4) is as follows:

L ( θ ) = 1 N i = 1 N z i 2 2 2 log det f ( x i ; θ ) x i ,

where z i = f ( x i ; θ ) stands for the i-th sample in a batch.



Source link

Jianmei Zhong www.mdpi.com