2.1.1. Data Acquisition and Exploration Process
Knowledge acquisition from the measurements utilizing non-destructive testing (NDT) methods can be modeled using a framework analogous to the classic CRISP-DM (Cross Industry Standard Process for Data Mining) [
67,
68,
69]. The CRISP-DM model is an iterative algorithm with seven stages and feedback loops.
The initial step entails a comprehensive understanding of the problem and the physical phenomena associated with the employed NDT method. This phase encompasses a thorough literature review and simulation studies.
The next step is the preparation of samples and measurement system setups.
The subsequent step focuses on data comprehension. During this phase, measurements are carried out. Attributes used to represent waveform features are defined and used to extract association rules between sample changes and corresponding changes in the acquired measurement. The rules are extracted through visualization techniques, statistical analyses, and dedicated algorithms such as Apriori [
70]. Additionally, at this stage, measurement results are compared with simulation outcomes. Preliminary analyses often reveal the necessity for adjustments to the laboratory setup and may require repeated measurements, frequently necessitating multiple iterations.
Data preparation is undertaken to facilitate identification using machine learning in the third step. This process may involve outlier removal, normalization, quantization, discretization, curve smoothing, etc. Furthermore, this stage may necessitate revisiting prior steps. For instance, while outlier removal might uncover new previously obscured relationships, curve smoothing can either enhance or obscure these relationships (thereby, it is essential to choose appropriate tools and their configurations).
The modeling stage involves developing the identification model. At this step, attributes are modified and selected. Techniques such as aggregation, filtering, wrappers, or more advanced methodologies like the rough set theory [
71,
72] or Principal Component Analysis (PCA) are applied for these purposes. Next, various classifiers and identification algorithms are evaluated using cross-validation or an expert system.
At the next stage, the researcher possesses a fully operational identification model. This step aims to resolve any outstanding uncertainties and validate the appropriateness of specific solutions and configurations. A negative assessment of the model necessitates a comprehensive reevaluation of all preceding stages, commencing with understanding the method and the characteristics of the object under investigation (as various influencing phenomena may need to be adequately considered).
The final step is model deployment. In this stage, new samples with unknown parameters are identified. The efficacy of this identification process is well established and known as it has been subjected to prior examination.
Figure 4 illustrates the sequential stages of knowledge acquisition and the most frequently encountered feedback loops inherent to this model, which has been adapted to the specifics of NDT.
2.1.3. Features Extraction
The simultaneous identification of three different parameters based on the same data is a complex and challenging issue. Various changes in the structure can exert remarkably similar effects on the measurement waveform. For instance, a change in concrete cover thickness may influence the results similarly to a simultaneous change in the diameter and type of steel.
Extensive research on magnetic and eddy current methods [
7,
44] has shown that a critical factor for effective identification is the method of feature extraction used to determine the attributes that describe the waveform. During the extraction process, special attention should be paid to the following limitations and conditions:
The number of attributes utilized in the model should be minimized to mitigate the curse of dimensionality.
The attributes incorporated into the model must accurately reflect the shape and parameters of the measured waveforms.
The attributes should be independent of one another and exhibit a possibly low correlation. Dependent or highly correlated attributes exacerbate the curse of dimensionality without providing substantial valuable knowledge [
7,
44].
Although determining attributes that describe the waveform is essential for effective identification, this process is often undervalued. The research conducted in this domain followed an iterative approach consistent with the CRISP-DM model. Consequently, six general methods for attribute determination were considered. While each successive method was developed through enhancements of its predecessors, each possessed specific advantages and may be optimal depending on the specifics and goals of the research problem. The evolution of these methods is illustrated in
Figure 5.
The simplest method to extract signal or waveform features involves dividing the domain of the independent variable into equal intervals. Amplitude values (attributes) are read for subsequent positions (equally spaced from each other) on the independent variable. The method’s concept is similar to resampling. The procedure is straightforward to implement, but to accurately reflect the shape of the waveform, numerous attributes must be defined. Past research on both eddy current and magnetic methods [
7,
44] has proven that significant improvements in identification accuracy are achieved by determining the amplitude as a separate attribute. Then, the waveform is normalized, and the features are extracted from the normalized curve. This sequence of actions allows the decoupling of shape attributes from amplitude attributes. However, even better results are obtained by dividing the domain of the amplitude into equal intervals. This modification (in this method, instead of dividing the independent variable on the abscissa axis, the ordinate axis was segmented into equal intervals) enables an automatic increase in the frequency of attribute sampling in regions where the waveform amplitude changes faster (proportionally to the derivative of the waveform), thereby facilitating a more accurate representation of the waveform’s shape. A drawback of this solution is the potential for assigning more than one abscissa axis value to a given amplitude value.
The foremost advantage of all three methods for determining attributes through equal partitioning is their capacity to derive attributes without understanding the association rules between waveforms and structural parameters. Therefore, prior knowledge of the studied process is not a prerequisite. Consequently, these methods can be broadly applied across various contexts and are well-suited for an intermediate step in developing final identification models.
Conversely, a significant limitation of these approaches is the requirement to generate a substantial number of features to accurately represent the waveform’s shape. Many attributes can be beneficial for exploring association rules (facilitating a more nuanced understanding of the relationships between waveform parameters and structural characteristics) but also cause problems with building identification models. As the number of attributes increases, the problem’s dimensionality exponentially escalates the requisite size of the database (the number of records). This phenomenon is commonly referred to as the curse of dimensionality. Furthermore, in the NDT investigations, it is often observed that all shape attributes are highly correlated. A significant reduction in the attribute number is required for these two reasons. Usually, the number should be five at most. Techniques such as aggregation, filtering, wrappers, or more advanced tools like the rough set theory [
71,
72] and Principal Component Analysis (PCA) may be employed to achieve this reduction. However, selecting relevant attributes is inherently time-consuming and highly subjective, requiring expertise and extensive empirical testing to identify appropriate tools effectively [
7,
44].
The process of determining and selecting features of the waveform typically requires a specific compromise between reducing the number of attributes (reduction) and accurately capturing the shape of the waveform. The method of characteristic points can be used to minimize or entirely circumvent the need for reduction. Before determining the final attributes, it is usually helpful to investigate the association rules between the waveforms and the parameters of the studied structure. It can be achieved through mathematical analysis, statistical methods, graphical representations, and algorithms for discovering association rules, such as the Apriori algorithm (market basket analysis) [
70]. Then, characteristic points are defined based on the results of the analyses. The tests can show which parameters of the waveform exhibit variability in response to changes in specific structure parameters. Also, using equal partition methods and reducing the number of attributes can uncover association rules and identify characteristic points.
Particular emphasis is often placed on extreme values (both local and global) and inflection points, along with their coordinates. The number of attributes obtained through the method of characteristic points is typically small, rendering further reduction either straightforward or entirely unnecessary. However, it is essential to note that significant information regarding the shape of the waveform may be lost during this process (resulting in an imprecise representation). Consequently, this method is predominantly effective in standard, predetermined cases. For instance, the method of characteristic points has been applied in studies using the multi-frequency eddy current method (MMFM) [
7,
44].
The approximation method can be considered as an enhancement and extension of the characteristic points method. This approach sometimes allows describing (with high precision) a given curve (waveform or signal) using only a few parameters. Attributes are generated by fitting the mathematical function to the waveform with an approximating function and determining the parameters of that chosen function (the parameter values are considered as the attribute values). This method proves particularly effective when a theoretical description of the process is known or a specific function is fitted to the observed curve. Using universal solutions, such as polynomial functions, often leads to inadequate shape representation or representing the waveform with too many attributes, which subsequently leads to the need for further reduction.
ACO (amplitude–correlation–offset) decomposition is a method (presented in [
7]) that does not require prior knowledge of the studied process or the huge attribute processing and selection stage. The method uses ACO attributes created by comparing the parameters of the tested and the reference waveforms. Created predictors are entirely independent of each other. Defining amplitude (
A) and offset (
O) is usually straightforward. However, accurately reflecting the shape of the waveform presents a more complex challenge, as attributes of this nature can exhibit considerable variability and must be defined explicitly for each study. Building upon insights derived from studies on the approximation method, ACO decomposition characterizes the shape of the investigated waveform through a comparative analysis with a reference curve. This process facilitates the determination of a singular universal shape parameter cross-correlation (
C), thereby enhancing the versatility of the ACO method.
Attributes A and O are essential for effective identification. Parameter C facilitates the detection of various anomalies, including noise, outliers, and completely erroneous measurements (method error). The ACO decomposition retains nearly all the advantages of the approximation method and minimizes the number of features. Moreover, this method obviates the need to define a function that accurately reflects the waveform.
A significant advantage of this method is its universality. Similarly to equal partition methods, it can be applied in nearly any situation (including as an intermediate step in constructing the target model). The only prerequisite is the availability of a reference measurement. Furthermore, ACO can serve both as an attribute extraction method in the classical identification process (where the studied waveform is matched to the most similar class from the training database) and as a pattern recognition method (in which extensive training databases are not required).
In the case of investigating reinforced concrete structures using electromagnetic NDT methods, obtaining a representative database can be challenging or even impossible. The parameters associated with reinforcement are standardized solely in terms of mechanical properties rather than electromagnetic characteristics. The precise composition of reinforcing steel is typically proprietary information maintained by the manufacturer. Consequently, similar reinforcement bars from different producers may yield divergent results in magnetic testing. Additionally, variations in additives incorporated into concrete can further influence these outcomes. Therefore, a database compiled for the specific examination may prove ineffective when applied to other constructions.
ACO decomposition is a referential method. Therefore, a comprehensive database is not required. This reference measurement can be obtained from a specially prepared sample or at a location where it has been confirmed that the object’s parameters and materials align with the established design specifications. The acceptance test can be conducted by assessing how significantly a given measurement diverges from the reference measurement.
2.1.4. Extraction of Associations Rules
Investigating the relationship between the physical parameters of the structure and the parameters of the waveform obtained as a result of the structure testing is a process that is difficult to standardize. Typically, the investigation begins with measurement analysis using various kinds of charts. The next step often involves determining the coordinates and values of the extremes and inflection points, followed by their description using selected statistics. A boxplot can be used for this purpose (an example of a boxplot is presented in
Figure 6).
One of the most valuable tools for uncovering the relationships between physical parameters and the measurement waveform is using algorithms designed explicitly for association rule mining, such as the Apriori algorithm. While more attributes may not be advantageous in the identification process, it facilitates a more comprehensive examination of the association rules and their statistical characterization. Ultimately, this deeper analysis contributes to an enhanced understanding of the underlying processes and aids in developing a more effective identification model [
73,
74,
75,
76,
77].
To investigate association rules, it is necessary to create a database containing comparisons of records in which structures differ by only one parameter (with the smallest possible change). The detected dependencies can belong to three classes: increase (↑), decrease (↓), or no change (-). In general terms, a rule is expressed as follows (1):
BODY (A) refers to the change in a specific parameter of the structure, while
HEAD (B) denotes the change in a specific parameter of the waveform. The quality of the rule is characterized by two percentage indicators:
support and
confidence.
Support is defined as the ratio of the number of records in which the rule occurs to the total number of records in the database (D), which is expressed by Equation (2):
The term
confidence quantifies the conditional probability that the entire rule will be observed, contingent upon the presence of the BODY (A). This parameter can be mathematically represented as follows (3):
The method employed for discovering association rules resembles the classical Apriori algorithm. It is predicated on the observation that interesting rules are derived exclusively from frequent item sets, defined as those for which the support exceeds the minimum support threshold established by the researcher. However, three significant modifications were introduced compared to the classical algorithm:
Set A encompassed only the physical parameters of the structure, while Set B consisted solely of attributes that described the examined waveform.
The length of the BODY was constrained to a single element from set A.
Depending on the architecture of the database, it was permissible to omit support as a limiting criterion.
The implementation of these modifications resulted in the exclusion of a substantial majority of generated rules, thereby retaining only those that elucidated the relationships between variations in specific structural parameters and the resultant changes they induced [
73,
74,
75,
76,
77]. The algorithm utilized for deriving association rules is presented in
Figure 7.
The modifications implemented facilitated a substantially reduced number of potential combinations, decreasing it from 2X−1 to #A×2#B−1, where X = #A + #B represents the total number of parameters. This reduction enhances the algorithm’s efficiency by limiting the search space for association rules.