1. Introduction
2. Methods
This section provides comprehensive descriptions of the methodology workflow employed in this paper. It encompasses the design of individual modules and sub-modules in the process, along with the design of a data pipeline that facilitates communication between these modules.
2.1. Technical Roadmap and Knowledge Graph Construction Process
The construction of a knowledge graph is an iterative process that requires continuous collection, integration, cleansing, and updating of knowledge to improve and enrich the content and quality of the knowledge graph. Meanwhile, technologies from fields such as artificial intelligence, machine learning, and natural language processing are integrated to enhance the efficiency and accuracy of knowledge graph construction. The detailed steps involved in constructing a knowledge graph are as follows:
Step 1: Requirement analysis: Clearly define the goals and requirements for knowledge graph construction, determine the domain scope and knowledge coverage, and gain an understanding of the intended use and functionality of the knowledge graph.
Step 2: Data acquisition: Collect data related to domain knowledge from various sources, including structured data such as databases and tables, as well as unstructured data such as text documents, papers, and webpages.
Step 3: Data preprocessing: Cleanse and preprocess the collected data, which may involve noise removal, format standardization, deduplication, and redundancy elimination.
Step 4: Knowledge extraction: Extract knowledge and relevant information from the preprocessed data, utilizing techniques like natural language processing for tasks such as named entity recognition, relation extraction, and event identification.
Step 5: Knowledge representation: Convert the extracted knowledge and information into a structured form for storage and querying within the knowledge graph. Techniques such as ontology modeling using languages like OWL or RDF can be employed to define concepts, properties, and relationships.
Step 6: Knowledge fusion: Integrate knowledge from different data sources and eliminate conflicts and duplicates, ensuring a consistent representation. Techniques like consistency checking and conflict resolution can be employed for handling inconsistent knowledge.
Step 7: Knowledge storage: Store the constructed knowledge graph in an appropriate knowledge graph storage repository to support subsequent queries, retrieval, and analysis. Commonly used technologies include graph databases and semantic repositories.
Step 8: Knowledge inference: Utilize inference engines to reason and infer knowledge within the knowledge graph, enabling comprehensive and in-depth knowledge discovery. Inference can be based on ontological rules, logical reasoning, and other approaches.
Step 9: Application development: Develop applications and tools based on the constructed knowledge graph to support functions such as knowledge querying, recommendation, and question answering. This may involve utilizing techniques such as natural language processing, machine learning, and data mining.
Step 10: Continuous updating: The knowledge graph requires regular updating and maintenance to accommodate new knowledge, updated data, and changing business requirements. This may include monitoring new literature publications and updates in statistical data, among other sources.
The “pipeline” is an approach that breaks down tasks into multiple independent steps and completes them in a specific order. Analogous to a physical production pipeline, each step has a specific function and role, enabling the entire process to operate efficiently. In the context of automating the construction of an Earth Science knowledge graph, the “pipeline” approach can divide the knowledge graph construction process into distinct stages. By assigning explicit tasks to each stage, this approach improves work efficiency, reduces repetitive labor, and provides a systematic and consistent framework for building an Earth Science knowledge graph.
2.2. Module Design
2.2.1. Human-Assisted Module
The quality of the corpus profoundly impacts deep learning. Thus, it is essential to create a professional and high-quality corpus that aligns with domain-specific characteristics. This serves as a critical foundation for knowledge extraction and the construction of a knowledge graph. To address the domain-specific characteristics of various professional fields, it is necessary to establish a constraint framework that encompasses professional content. During knowledge graph construction, this constraint framework is often achieved through the development of a domain ontology. Therefore, in the design of the human-assisted module, the construction of the domain ontology holds the utmost importance. The human-assisted module is further divided into two modules: domain ontology construction and corpus construction. Depending on the specific circumstances, these sub-modules can be further categorized into seven functional sub-modules, including ontology entity modeling, ontology semantic relation modeling, ontology modeling, data collection, unstructured data extraction and cleansing, corpus annotation, and corpus format transformation. These sub-modules are critical to the human-assisted module.
The structure of the ontology is carefully defined, encompassing concepts, properties, and relationships to illustrate the associations and unique characteristics between ideas. Such definitions are commonly expressed using ontology description languages, enabling the description of inclusion relationships, property values, and related connections. The ontology is then subject to validation and revision through communication with domain experts in geology. Experts typically follow the following steps when validating a domain ontology.
Domain Expert Involvement: Domain experts play a vital role in the validation of domain ontologies. They possess extensive knowledge and familiarity with the domain, enabling them to identify potential issues within the ontology and provide feedback.
Exploration and Evaluation: Domain experts and the ontology construction team collaboratively explore and assess the accuracy and consistency of the ontology. This involves a thorough examination of various levels and concepts within ontology to ensure their alignment with actual domain knowledge.
Real-World Application Testing: To validate the applicability of the ontology in practical scenarios, it can be integrated into relevant applications or systems and undergo real-world testing. Domain experts and users can interact with the system to examine the utility and effectiveness of the ontology.
Within the equation, GOnto represents the ontology specifically designed for the domain of Earth Science. GCon corresponds to the conceptual framework that encompasses Earth Science. GProp encapsulates the diverse properties associated with Earth Science, encompassing mineralization timing, tectonic location, geological structures, spatial distribution, and scale. GRel reflects the intricate relationships among entities within the Earth Science domain, including instance relationships between entities and instances, as well as associations between instances and properties. GRul encompasses the essential rules that establish constraints on the types and combinations of concepts and instances during the construction of Earth Science ontology. GIns represents the mapping mechanism that links concepts to instances, providing concrete instantiations of entities derived from the conceptual framework.
Once the construction of the ontology is complete, standard Semantic Web technologies, including SPARQL queries and reasoning engines, can be employed to facilitate the application of the ontology. Furthermore, regular maintenance and updates of the domain ontology should be conducted over time to effectively capture the evolving domain knowledge. These maintenance and updates may encompass tasks such as incorporating new instances, refining the ontology structure, and expanding its coverage.
The process of maintaining and updating domain ontologies usually requires several iterations of improvement. However, due to the diversity and complexity of geoscience disciplines, the following challenges may be encountered throughout the iterative improvement process:
Domain complexity: The field of Earth Sciences is characterized by diversity and complexity, with intricate relationships between knowledge and concepts. In this context, accurately capturing domain knowledge and relationships is challenging.
Knowledge uncertainty: Knowledge in some disciplines of the Earth Sciences may be uncertain, e.g., vaguely defined or with incomplete relationships. Validating an ontology requires in-depth discussions and decision-making with domain experts to address this uncertainty.
Changing needs: Domain knowledge and concepts may change over time, so ontologies need to adapt and reflect these changes. Therefore, validation and improvement of ontologies requires continuous interaction and feedback with domain experts and users.
After the execution of the ontology construction sub-module, the ontology will be used to inform the main constraints and rules of the corpus creation module, thus participating in the corpus creation process.
Upon the completion of the aforementioned corpus module, adjustments and updates to the corpus can be made as new data and knowledge accumulate.
2.2.2. Automation Module
Once the corpus construction is complete, it is imported into the automation module and used to train a domain-sensitive extraction model. The knowledge extraction process is then applied to identify pertinent information. Following the extraction, similar entities are merged, and the resulting data are imported into a graph database to enable visualization. Within this module, knowledge extraction assumes a paramount role, as it determines the precision of entities, attributes, and their relationships in the knowledge graph. Consequently, the automation module is subdivided into sub-modules, including model training, knowledge extraction, knowledge fusion, and graph construction. These sub-modules can be further refined, particularly in the domains of knowledge extraction and fusion, to incorporate the latest advancements in natural language processing.
2.3. Design of Functional Modules
2.3.1. Model Training Module
Recurrent neural networks (RNNs): RNN is a model that is suitable for handling sequential data. It has the ability to capture the temporal information in the data. The RNN passes the hidden state at each time step to capture the contextual information of the input sequence. However, traditional RNNs often face the problem of vanishing or exploding gradients when dealing with long-term dependencies in the data.
Long short-term memory (LSTM): LSTM is a variant of RNN that is widely used to handle long-term dependencies. It uses gate mechanisms such as a forget gate, an input gate, and an output gate to control the flow of information, which effectively solves the problem of vanishing and exploding gradients.
Gated recurrent unit (GRU): GRU is another variant of RNN, which is similar to LSTM. It simplifies the model’s structure by using fewer gate mechanisms. GRU performs equally well as LSTM in many tasks, but it has fewer parameters and a faster training speed.
Attention mechanism (AT): Attention mechanism is a technique used to address the inconsistent processing ability of the model for different parts of the input sequence. It dynamically weights and aggregates the relevant parts of the sequence at each time step, thus improving the model’s ability to focus on important information.
Pretrained language models (PLMs) are language models that are trained on large-scale unlabeled text data. They learn rich language representations and can be used for various tasks, including knowledge extraction. Traditional pretrained language models include Word2Vec and GloVe, while more recent and powerful models include BERT and the GPT series.
Sequence labeling models are a class of models widely used in knowledge extraction tasks. Among them, conditional random fields (CRFs) is a model that is commonly used for identifying and labeling specific information, such as named entities or entity relationships, from text. Recently, sequence labeling methods that combine pretrained language models have achieved good results.
In comparison, LSTM and GRU have similar performances, but LSTM handles long-term dependencies better. The attention mechanism can improve the model’s focus on the important parts of the input sequence. Pretrained language models provide richer language representations, improving performance in knowledge extraction tasks. Sequence labeling models, combined with CRF and pretrained language models, allow for more accurate entity recognition and relationship extraction.
In order to improve the performance of knowledge extraction, appropriate deep learning models and algorithms need to be selected based on the specific requirements of the task and characteristics of the dataset. Additionally, ensemble methods or transfer learning techniques can be utilized to enhance the effectiveness of knowledge extraction. The model’s architecture, activation functions, loss functions, and optimizers should also be carefully chosen to ensure optimal performance. Model parameters are initialized using either random initialization or pretrained weights. The input data flow through the network layers, undergoing nonlinear transformations and activation functions to generate the final output. Subsequently, the model’s output is compared against the target labels, and the loss function is computed. The computed loss value is then employed to calculate gradients using the backpropagation algorithm, enabling an assessment of each parameter’s impact on the loss function. Finally, the model’s parameters are updated using an optimization algorithm (e.g., stochastic gradient descent) based on the computed gradients. This optimization algorithm adjusts the parameter values considering the gradient direction and learning rate, intending to minimize the loss function. Following each training iteration, the adequacy of the training progress is evaluated, and if deemed satisfactory, the model parameters are output. Otherwise, the model undergoes another iteration.
Subsequently, an independent test dataset is employed to conduct a comprehensive evaluation of the model’s performance in real-world scenarios. The outcomes derived from the test data analysis enable an assessment of the model’s effectiveness and its adherence to the application requirements. Once the requirements are fulfilled, the model can be deployed in a production environment, where continuous monitoring and updates are carried out.
2.3.2. Knowledge Extraction Module
The design of the knowledge extraction module is contingent upon the selection of knowledge extraction methods, as they directly impact the precision of the extraction process. As a result, the knowledge extraction module is the most critical functional component of the automation module. Depending on task requirements, data characteristics, and available resources, knowledge extraction can be classified into two frameworks: the pipeline method and joint extraction.
The pipeline method involves breaking down the knowledge extraction task into multiple subtasks, which are processed sequentially. Each subtask is responsible for extracting specific knowledge, and its output serves as the input for the subsequent task. This approach offers advantages such as independent development and debugging of each subtask, as well as the flexibility to incorporate new subtasks as needed. For instance, a common pipeline method combines named entity recognition (NER) and relation extraction as two subtasks. NER identifies entity types in the text, while relation extraction determines relationships between the entities based on the output of NER. The pipeline method has notable benefits, including a clear structure, modularity, and ease of construction and maintenance.
Joint extraction involves collectively modeling multiple knowledge extraction tasks. This method considers the interdependencies between tasks and addresses conflicts and competition through joint optimization. Graph models, such as conditional random fields (CRFs) or Graph Neural Networks (GNNs), can be utilized to encode and infer relationships among different tasks. Compared to the pipeline method, joint extraction leverages contextual information across different tasks, resulting in improved accuracy and consistency of extraction. However, the challenge here lies in modeling and training complex joint models that encompass cross-task interactions and optimization. The accompanying figure illustrates the design of the corresponding module.
2.3.3. Knowledge Fusion Module
Upon completion of the knowledge extraction process, a substantial amount of knowledge may be susceptible to duplication and redundancy issues, resulting in redundant and bewildering information. Knowledge fusion enables the identification and elimination of duplicate and redundant knowledge, thereby reducing information redundancy and enhancing information utilization efficiency. This facilitates more effective knowledge management and utilization, mitigating wastage of resources and repetition of labor. There are five commonly employed fusion methods:
String matching: String matching methods involve comparing the similarity between entity names or identifiers to facilitate matching and merging. For instance, algorithms like edit distance and Jaccard similarity can be employed to calculate the similarity between entity names, and a predetermined threshold can be set to determine whether they belong to the same entity. It is suitable for situations that require exact or fuzzy matching, such as recognizing and linking named entities, terminology matching, and information extraction. Although string matching algorithms are intuitive and effective, they may face efficiency issues in complex and large-scale scenarios. They are unable to handle semantic and contextual information, which can lead to poor performance when long strings are involved or when the match similarity is low.
Feature vector matching: Feature vector matching methods represent entities as feature vectors and assess the similarity between them. These feature vectors can encompass entity attributes, relationships, contextual information, and other relevant features. Typically, approaches like the bag-of-words model, TF-IDF, and Word2Vec are utilized to generate feature vectors, and similarity measurement techniques such as cosine similarity are used to compare the vectors. By establishing a similarity threshold, it becomes possible to ascertain whether the entities belong to the same category. Feature vector matching is commonly used in knowledge fusion tasks based on feature and similarity measurement. This method is applicable to various fusion scenarios, such as entity alignment, relation extraction, and link prediction in knowledge graphs. This method has flexibility and scalability, and can adapt to different types of features and similarity measurements. However, it is sensitive to the selection of features and similarity calculation methods, and needs to be fine-tuned according to specific tasks.
Context matching: Context-matching methods take into account the surrounding contextual information of entities to establish their identity. More specifically, matching and merging can be accomplished by analyzing co-occurrence patterns, relative positions, syntactic dependency relationships, and other contextual factors in the text. Context matching is commonly used to improve the accuracy of string matching and knowledge fusion by utilizing surrounding contextual information. It can identify the semantic and contextual consistency among strings in specific language environments by considering contextual and contextualized vocabulary. However, this requires that appropriate context windows, contextual information, and matching strategies be designed according to the domain and tasks. Some complex context-matching methods may have higher computational complexity issues.
Graph matching: Graph-matching methods consolidate entities by comparing their relational connections. This method represents entities and their relationships using graph structures and employs graph-matching algorithms to identify similar graphs and subgraphs. Graph-matching algorithms can be based on principles such as subgraph isomorphism and graph isomorphism. Graph-matching methods can capture complex relationships between entities. These are commonly used for entity and relationship matching and alignment tasks within knowledge graphs. They can perform matching and fusion operations by establishing the structure within the knowledge graph and using graph algorithms, such as entity alignment, information propagation, and graph pruning. Graph matching is applied in scenarios including knowledge graph fusion, graph data analysis, and link prediction. Graph matching can comprehensively consider the topological relationships and semantic similarity between nodes. However, in large-scale and highly dynamic graph structures, graph-matching algorithms may face challenges with regard to computational efficiency and scalability.
Machine learning methods: Machine learning methods approach entity merging as classification or clustering problems. By representing entity information as feature vectors and utilizing machine learning algorithms like support vector machines, random forests, and clustering algorithms for classification or clustering, identical entities can be allocated to the same category. Machine learning methods can automatically learn shared features and patterns from data, but their performance relies on the quality and accuracy of the training data. Furthermore, machine learning methods can be combined with other approaches for improved accuracy and robustness in the merging process. Machine learning methods are widely used in knowledge fusion for feature learning, pattern recognition, and decision inference. It can automatically learn and infer the relationships and fusion rules between different knowledge sources through training models. Application scenarios include knowledge graph construction, relation extraction, and knowledge integration. Machine learning methods have the ability to automatically learn and adapt to different data, allowing them to handle complex relationships and patterns. However, machine learning methods require a large amount of annotated data and training time. In the application, it is necessary to consider feature selection, model selection, and the issue of overfitting.
The comprehensive process of a knowledge fusion module consists of the following steps:
Step 1: Data collection and cleaning: Initially, collect entity information from diverse data sources or texts and proceed to clean the data. The cleaning process encompasses eliminating duplicate entities, rectifying spelling errors, addressing aliases and abbreviations, and so on. Data cleaning aims to ensure data consistency and accuracy, laying a reliable foundation for subsequent entity fusion.
Step 2: Entity matching and identification: Match and identify the collected entity information, thereby determining which entities represent the same real-world entity. This can be achieved through methods like similarity calculation and comparing entity attributes. For example, calculating the similarity between entity names allows us to consider them as the same entity when the similarity exceeds a pre-defined threshold.
Step 3: Feature extraction and representation: Extract pertinent features from different data sources or texts for the matched entities. These features can include entity attributes, relationships, contextual information, and more. The objective of feature extraction is to provide specific information and a basis for making judgments in the subsequent entity fusion process.
Step 4: Similarity calculation and threshold setting: Calculate the similarity between entities based on their features. Various measurement methods such as string similarity, vector similarity, and context matching can be employed for similarity calculation. By setting a similarity threshold according to the specific application scenarios and requirements, it becomes possible to determine whether entities belong to the same entity.
Step 5: Collision handling and decision-making: Address collisions between similar entities, i.e., develop strategies to handle entities with similarity values exceeding the threshold. Decisions can be made based on prioritizing information from a certain data source or employing manual review. The decision-making process can be modified and optimized based on the actual situation.
Step 6: Entity merging and integration: Merge and integrate the matched and decided entities to form the final fused entity. This merging process may involve merging entity attributes, relationships, and related operations to ensure the integrity and consistency of entity information.
Step 7: Post-processing and validation: Conduct post-processing and validation on the merged entities to ensure the accuracy and consistency of the fusion results. Post-processing activities may encompass removing redundancies, resolving conflicts, updating attributes, and more. The validation process may involve manual review or validation by domain experts.
The triplets that undergo entity fusion already possess all the essential elements required to form a knowledge graph and serve as the foundational data for constructing intelligent question-answering systems and knowledge recommendation systems based on the knowledge graph.
2.3.4. Knowledge Graph Construction Module
Upon completing knowledge fusion, it becomes imperative to visualize the data stored in the knowledge graph, thereby enhancing the comprehension and exploration of information and relationships within the graph. In the domain of graph visualization research, graph databases serve as database systems for efficient storage and processing of graph-structured data. Concepts, entities, and attributes undergo a transformation into nodes, while relationships between various entities and attributes are represented as edges, forming structured triplets. Unlike conventional relational or document-oriented databases, graph databases place emphasis on relationships (edges) and the topological structure among nodes (entities), with a focus on addressing intricate graph querying and analysis tasks. Prominent graph databases encompass Neo4j (recognized for its high performance, reliability, and robust graph query capabilities), Amazon Neptune (distinguished by its scalability, persistence, and exceptional availability), TigerGraph (providing support for parallel computing), ArangoDB (characterized by its versatility as a multi-model database), and Sparksee (delivering rapid graph query and analysis capabilities). Leveraging the unique attributes of graph databases, knowledge graphs can accomplish the following functionalities:
Efficient query and graph analysis: Graph databases employ query languages (such as SPARQL and Cypher) and graph-based algorithms to facilitate efficient query and analysis operations. This enables the utilization of graph databases as a platform for constructing question-answering systems based on knowledge graphs, facilitating fast graph traversal, relationship path queries, node similarity calculations, and other operations.
Large-scale data processing and horizontal scalability: Graph databases possess the capacity to handle large-scale datasets and support horizontal scalability. Through techniques such as partitioning, replication, and distributed computing, graph databases distribute data and computational workloads across multiple nodes, achieving high performance, availability, and scalability. Consequently, they provide the groundwork for expanding knowledge graphs and enriching knowledge systems, serving as a platform for large-scale knowledge graph sharing.
Visualization and exploratory analysis: Graph databases offer visualization tools and query interfaces to aid users in intuitively comprehending and exploring graph data. These tools can depict the topological relationships of nodes and edges, visualize the outcomes of graph algorithms, and assist users in identifying the hidden patterns and insights within the knowledge graph.
2.3.5. Data Pipeline Design
3. The Construction of a Knowledge Graph for Iron Ore Deposits
In this section, to verify the feasibility and efficiency of the proposed approach and process design, this paper starts with data collection and ontology creation and implements the construction and visualization of a knowledge graph for iron ore deposits according to the described method.
In the equation, C signifies concepts, e denotes instances that correspond to the concepts, and r represents the relationship between concepts and their corresponding instances. In this paper, C stands for a concept, such as a mineral deposit; e stands for an instance of an iron ore deposit, such as the Gongchangling iron mine; and r stands for the mapping relationship between the concept of a mineral deposit and the instance of the Gongchangling iron mine.
In the equation, C1 stands for deposit, C2 stands for zone, C3 stands for ore body, and C4 stands for ore district. Equation R (Ci, Cj) denotes the inclusion relationship. Specifically, mineral deposit C1 encompasses ore zone C2, ore zone C2 encompasses ore body C3, and mineral deposit C1 encompasses ore body C3. These semantic relationships are applicable to triplex mineral deposit entities associated with concepts at various levels.
In the equation, e denotes an instance, property signifies an attribute, and value represents the corresponding attribute value.
Upon completion of the construction of the iron ore deposits ontology, a knowledge corpus specific to iron ore deposits is generated based on the constraints of the ontology. Esteemed Chinese geological journals, including “Acta Petrologica Sinica”, “Journal of Mineral Deposits”, “Mineral Deposits Geology”, and “Geological Review”, serve as data sources. Relevant articles pertaining to iron ore deposits are collected and unstructured data are extracted. Following data cleansing, the Doccano annotation platform is utilized to annotate entities, properties, and semantic relationships based on the iron ore deposits ontology. This process culminates in the creation of a labeled corpus intended for the extraction of iron ore deposits knowledge. The corpus is divided into three subsets: training data, validation data, and data earmarked for extraction, categorized according to their respective purposes.
Following human-assisted operations pertaining to the data, the developed corpus is integrated into the system to construct the PLMs (Pretrained Language Models) + BiLSTM + CRF framework. The PLMs + BiLSTM + CRF modeling framework is a knowledge extraction composite modeling framework that combines pretrained language models (PLMs), long short-term memory (LSTM) networks, and conditional random fields (CRFs). It has distinctive features compared to other models. First, the framework combines multiple models and incorporates their individual features. Pretrained language models have strong contextual understanding. LSTM, as a variant of recurrent neural networks, can process sequential data and capture long-term dependencies. CRF learns the relationships between entities and the transformation patterns of label sequences, optimizing the model outputs to improve the accuracy and consistency of entity boundaries. Since the geoscientific texts have strong syntactic dependencies and contextual relationships, the chosen modeling framework meets the data requirements. Second, the PLMs + LSTM + CRF modeling framework allows end-to-end training and inference. It allows for simultaneous learning of feature representation, sequence modeling, and label prediction, thus reducing the need for feature engineering and simplifying processing steps. Last but not least, the flexibility of the chosen modeling framework combined with the pipeline-based approach described in this paper allows the model to be modified and extended according to the task requirements, highlighting the versatility of the pipeline-based approach investigated in this paper.
In the equation, represents the character vector, incorporating the character information of the entity, with corresponding weight parameter . represents the word vector, which is obtained through fine-tuning a pretrained model such as BERT, encompassing entities from the corpus, with corresponding weight parameter . represents the feature vector within the contextual context c, capturing the contextual information surrounding the current entity, with weight parameter . represents the feature vector of the feature word f within the contextual context c. denotes the total number of words in the context.
4. Discussion
Knowledge graphs are knowledge representation models that utilize semantic associations to organize and represent structured data graphs of domain knowledge. They provide a means of describing the relationships and semantic connections between entities by organizing entities, properties, and relationships into a network graph. The primary objective of knowledge graphs is to capture and express real-world knowledge, aiding individuals in comprehending and effectively utilizing vast amounts of information. When knowledge graphs are utilized in the field of geoscience, they are known as geoscience knowledge graphs. These graphs offer several advantages, including the integration of multiple data sources, knowledge discovery and association, analysis of the Earth system, intelligent search and recommendation capabilities, as well as knowledge sharing and collaboration functionalities. Consequently, geoscience knowledge graphs serve as a powerful tool for research and application within the geoscience field, opening new avenues for the development and application of Earth Science. However, research on geoscience knowledge graphs is still primarily in the theoretical and experimental phase, lacking sufficient practical case applications. Moreover, given the multitude of disciplines, diverse data, and ambiguous data boundaries in the field of Earth Science, the construction of large-scale geoscience knowledge graph platforms, such as Wikipedia, that combine crowd-sourcing and expert knowledge systems will play a crucial role in the future.
The present study presents a “pipeline”-based approach for the automated construction of geoscience knowledge graphs. This approach efficiently generates knowledge graphs of research objects within specific disciplines or sub-disciplines of Earth Science. This study also demonstrates a systematic process for data collection, construction, and visualization of geoscience knowledge graphs. Two sections, human-assisted and automated, are designed, each incorporating various functional modules. The human-assisted section primarily focuses on domain ontology construction and corpus development, with the domain ontology serving as a crucial constraint for corpus development. Construction of domain-specific ontologies requires incorporating expert opinions and considering the characteristics of the domain knowledge system. The knowledge system can be continuously expanded and enriched as research progresses. The generality of the method proposed in this paper is reflected in this section. When applying this method to construct knowledge graphs in other disciplines within the field of Earth Sciences, it is essential to develop discipline-specific ontologies and build a corpus using relevant literature and geological survey reports as data sources and then to proceed to the next section to carry out the subsequent processing. The automated section encompasses model training, knowledge extraction, knowledge fusion, and graph construction. The knowledge extraction model can be designed based on extraction algorithms from the field of natural language processing, thereby enhancing the precision and accuracy of knowledge extraction. To ensure the quality of the knowledge graph, post-extraction verification can be performed.
The approach used in this study for the rapid construction of geoscience knowledge graphs aims to reduce technical barriers and increase the number of domain-specific knowledge graphs in the future. It can provide foundational data support for the development of a comprehensive interdisciplinary knowledge-sharing platform and intelligent question-answering and decision-making systems within the field of Earth Science. It can also be combined with knowledge graph-based algorithms and artificial intelligence techniques to build more advanced AI-driven knowledge systems and serve applications such as resource management, risk assessment, and recommender systems. For example, in critical mineral resource management, it is important to construct a knowledge graph of the entire mineral industry chain from mine supply to smelters and downstream industries. Through knowledge fusion and constructing the knowledge graph of the whole mineral industry chain of key minerals and combining graph representation learning, graph-clustering algorithms (such as spectral clustering and the Louvain algorithm), graph-matching algorithms (such as subgraph isomorphism, graph isomorphism, and graph edit distance), graph-recommending algorithms (such as path-based recommendation and graph convolutional networks), and knowledge graph completion algorithms (such as Rescal and DistMult based on tensor decomposition models) to build a key mineral resource search engine and a key mineral resource question-answering system.
In addition to describing the design of each functional module and data pipeline in the approach, a knowledge graph of iron ore deposits is constructed in this study, focusing on iron ore deposits as the subject. A Chinese pretrained model is selected based on the training corpora, and the LERT + BiLSTM + CRF knowledge extraction model framework is determined through model training outcomes. The expanding number of knowledge nodes in the iron ore deposits knowledge graph reveals significant potential for knowledge mining and discovery within the field of iron ore deposits. The graph integrates and organizes a substantial amount of information related to geological features and deposit types, forming a comprehensive knowledge network. Consequently, researchers can efficiently acquire and analyze knowledge in the field of iron ore deposits, uncovering patterns and correlations. Additionally, integrating the knowledge graph of iron ore deposits with those in other fields presents a promising avenue for interdisciplinary knowledge discovery. Combining knowledge graphs from geology, geophysics, geochemistry, and other fields facilitates the resolution of complex issues such as the genesis of iron ore deposits, mineral resource evaluation, and mineral exploration. This collaboration nurtures a deeper understanding of Earth evolution and deposit models.
The continuous expansion of the iron ore deposits knowledge graph and its integration with knowledge graphs from other disciplines will provide robust support for research, exploration, and development in the field of iron ore deposits. It not only accelerates the accumulation and dissemination of knowledge but also offers significant references for decision-making and technological innovation in related fields. As a result, it drives progress and advancement in the geological resources field.
5. Conclusions
The advancement of natural language processing (NLP) technology has enabled the resolution of various geoscience issues through NLP-based algorithms. Geoscience knowledge graphs, which are interdisciplinary in nature, merging computer science and Earth Science, have aroused immense interest among geologists and computer scientists. This study presents a “pipeline”-based approach to automating the construction of geoscience knowledge graphs, thereby reducing the technical complexities associated with their development. By using an iron ore deposits knowledge graph as an exemplar, the article illustrates the comprehensive process of constructing a geoscience knowledge graph based on the proposed methodology. From this study, the following conclusions can be drawn:
(1) Constructing large-scale geoscience knowledge graphs necessitates the integration of vast instantiated graphs, requiring a blend of crowdsourcing and expert decision-making to establish a data-sharing platform.
(2) Given the interdisciplinary nature of Earth Science, ontology construction should be tailored to the unique characteristics of each discipline, offering suitable constraints for research objects in specific domains.
(3) Geoscience knowledge graphs, as a specific type of knowledge graph, possess organizational and storage capabilities, intelligent search functionalities, automated recommendations, knowledge discovery and analysis capabilities, scalability, and maintainability. They can effectively facilitate the integration of multi-source data in Earth Science, knowledge discovery and correlation, Earth system analysis, intelligent search and recommendations, as well as knowledge sharing and collaboration, among other applications.
(4) The quality of knowledge extraction relies on both the corpus quality and the construction of the model framework. In different domains, it is essential to compare multiple models.
(5) The proposed approach in this paper is efficient at constructing knowledge graphs and significantly simplifies the development process of knowledge graph projects, and can also be combined with algorithms based on knowledge graph applications to realize the efficient construction of geoscience Q&A systems.
Source link
Qiurui Feng www.mdpi.com