Unified Normalizing Flow for Unsupervised Multi-Class Anomaly Detection

Greenberg December 10, 2024 in News - 3 Minutes

Consider an observed data variable

x \in X

, and its true distribution is denoted as

p_{X} (x)

. Given a simple prior distribution,

p_{Z}

, and a bijective transformation,

f : X \to Z

, where

Z \sim p_{Z} (z)

.

p_{Z}

is also termed as the target distribution. With the variable transformation formula [52], we can deduce the following:

$\begin{matrix} p_{X} (x) = p_{Z} (f (x)) |\det (\frac{\partial f (x)}{\partial x^{⊤}})|, \end{matrix}$

(1)

where $\frac{\partial f (x)}{\partial x^{⊤}}$ is the Jacobian of $f$ at x and $|\det (\frac{\partial f (x)}{\partial x^{⊤}})|$ is the absolute value of the Jacobian determinant. Thus, the key to learning the distribution of the variable X lies in finding a bijective transformation that satisfies the aforementioned mapping while its Jacobian determinant is tractable. Normalizing flow seeks such a transformation through the composition of a series of base transformations. These base transformations are invertible and have tractable Jacobian determinants, ensuring that the final composite transformation retains these desirable properties. Suppose $f_{0}, f_{1}, \dots, f_{k - 1}$ are k base transformations, where f is defined as the composition of them, then we have:

$\begin{matrix} \begin{matrix} \frac{\partial f (x)}{\partial x^{⊤}} = \frac{\partial f_{0} (x)}{\partial x^{⊤}} \times \frac{\partial f_{1} (f_{0} (x))}{\partial {(f_{0} (x))}^{⊤}} \times \dots \times \frac{\partial f_{k - 1} (f_{k - 2} \circ f_{k - 3} \circ \dots \circ f_{0} (x))}{\partial {(f_{k - 2} \circ f_{k - 3} \circ \dots \circ f_{0} (x))}^{⊤}}, \end{matrix} \end{matrix}$

(2)

$\begin{matrix} \begin{matrix} \det (\frac{\partial f (x)}{\partial x^{⊤}}) = \det (\frac{\partial f_{0} (x)}{\partial x^{⊤}}) \times \det (\frac{\partial f_{1} (f_{0} (x))}{\partial {(f_{0} (x))}^{⊤}}) \times \dots \times \det (\frac{\partial f_{k - 1} (f_{k - 2} \circ f_{k - 3} \circ \dots \circ f_{0} (x))}{\partial {(f_{k - 2} \circ f_{k - 3} \circ \dots \circ f_{0} (x))}^{⊤}}), \end{matrix} \end{matrix}$

(3)

where ∘ represents composition. Let $θ$ be the parameters of f, which are estimated or learned to transform X to Z, and $p_{θ} (x)$ denote the estimated distribution with $x = f^{- 1} (z; θ)$ . We employ the forward KL divergence to measure the distance between the true distribution $p_{X} (x)$ and $p_{θ} (x)$ . Then the optimization objective for $θ$ can be written as follows with Formula (1):

$\begin{matrix} \begin{matrix} D_{K L} (p_{X} (x) ∣ ∣ p_{θ} (x)) & = - E_{x \sim p_{X} (x)} (log p_{θ} (x)) + E_{x \sim p_{X} (x)} (log p_{X} (x)) \\ = - E_{x \sim p_{X} (x)} (\log (p_{Z} (f (x; θ)) \times |\det (\frac{\partial f (x; θ)}{\partial x^{⊤}})|)) + const . \end{matrix} \end{matrix}$

(4)

NICE [50] and RealNVP [51] design such base transformations with so-called coupling layers for image generation. The transformations derived from coupling layers yield triangular Jacobian matrices whose determinants are the products of their diagonal elements. Next, we will introduce two types of coupling layers used for anomaly detection tasks. Let

x \in R^{C \times H \times W}

be the extracted feature map, where

C

denotes the number of channels and

H \times W

indicates the size. Given

d < C

, we split x along the channel dimension into two parts,

x_{1 : d}

and

x_{d + 1 : C}

. The output y from a coupling layer is similarly divided into two components,

y_{1 : d}

and

y_{d + 1 : C}

, along the channel dimension with the input, x. The coupling layers are classified into additive and affine coupling layers based on the distinct coupling laws for

x_{1 : d}

and

x_{d + 1 : C}

.

For additive coupling layers, the two parts,

y_{1 : d}

and

y_{d + 1 : C}

, of y are obtained as follows:

$\{\begin{matrix} y_{1 : d} & = x_{1 : d}, \\ y_{d + 1 : C} & = x_{d + 1 : C} + t (x_{1 : d}), \end{matrix}$

(5)

where the transformation, t, is learned via neural networks such as convolutional networks.

For affine coupling layers, the two components,

y_{1 : d}

and

y_{d + 1 : C}

, of y are derived as follows:

$\{\begin{matrix} y_{1 : d} & = x_{1 : d}, \\ y_{d + 1 : C} & = \exp (s (x_{1 : d})) ⊙ x_{d + 1 : C} + t (x_{1 : d}), \end{matrix}$

(6)

where s and t represent the scale and translation transformation, respectively. Similarly, the transformations s and t are typically learned via convolutional networks.

Normalizing flow models can be constructed by stacking a series of coupling layers. In anomaly detection tasks, the target distribution is generally set to a multivariate standard normal distribution, as is the case in this study. Suppose the batch size is N, the loss function for normalizing flow models with Formula (4) is as follows:

$L (θ) = \frac{1}{N} \sum_{i = 1}^{N} (\frac{{∥z_{i}∥}_{2}^{2}}{2} - \log |\det (\frac{\partial f (x_{i}; θ)}{\partial x_{i}^{⊤}})|),$

(7)

where $z_{i} = f (x_{i}; θ)$ stands for the i-th sample in a batch.

Source link

Jianmei Zhong www.mdpi.com

Greenberg

Learn More →

Related Posts

Electronics, Vol. 14, Pages 2806: Designing for Dyads: A Comparative User Experience Study of Remote and Face-to-Face Multi-User Interfaces

Cancers, Vol. 17, Pages 2319: Early Predictive Markers and Histopathological Response to Neoadjuvant Endocrine Therapy in Postmenopausal Patients with HR+/HER2&minus; Early Breast Cancer

Diagnostics, Vol. 15, Pages 1762: NF-&kappa;B as an Inflammatory Biomarker in Thin Endometrium: Predictive Value for Live Birth in Recurrent Implantation Failure

Greenberg

Cancers, Vol. 17, Pages 2319: Early Predictive Markers and Histopathological Response to Neoadjuvant Endocrine Therapy in Postmenopausal Patients with HR+/HER2− Early Breast Cancer

Diagnostics, Vol. 15, Pages 1762: NF-κB as an Inflammatory Biomarker in Thin Endometrium: Predictive Value for Live Birth in Recurrent Implantation Failure