Identification of Scientific Texts Generated by Large Language Models Using Machine Learning


1. Introduction

The rapid advance of technology has brought with it new tools, such as large language models, which facilitate various tasks in daily life, but it has also generated multiple challenges that we must face. These models have completely changed the way we interact with information, as well as allowing us to improve different processes. However, despite the fact that their progress requires a large amount of computational, natural and financial resources, the ease with which these technologies can be obtained has generated significant difficulties, especially in the academic and professional environment.

The advancement of LLM models can have a negative impact in several areas, especially in academia. By providing students with these tools, their learning process could be compromised because they may opt for quick and easy solutions instead of gaining a deep understanding of the topics. This could result in basic skills such as writing, spelling, reading comprehension and research techniques being compromised. In addition, one of the main problems is the inappropriate use of texts created by LLMs for personal gain without acknowledging the original author, which increases the risk of plagiarism.

In the near future, more parametrized versions of LLMs are expected to generate texts that more accurately mimic the grammar and writing style of specific authors. As the differences between AI-produced texts and those written by humans become almost imperceptible, it will become increasingly difficult to identify LLM-generated texts. This development will complicate the detection of plagiarism and misuse of these tools, especially in the academic and professional sectors, where originality and authenticity are paramount.

In the long run, it is likely that all the information produced by LLMs will be published on the Internet. Thus, if these models exhibit a high degree of hallucination, it is possible that upcoming models trained with such incorrect information will suffer a reduction in performance. This phenomenon could cause a dilution of information, leading to a decrease in accuracy and depth of attention to detail, negatively impacting the quality of the final answers.

As a precedent, several pieces of research have been carried out linked to the identification of texts produced by artificial intelligence (AI). However, methods such as tagging have not shown encouraging results in the face of challenges such as recursive paraphrasing or machine translation. Therefore, the application of deep-learning algorithms and Transformer-based architectures has been established as one of the most widely used tactics to address this problem.

In addition, since this is a relatively recent field of study, existing datasets from different sources have not been created specifically for this problem. For example, sets such as PAN and HC3 have considerable restrictions. Although the structure of PAN is not appropriate for this purpose, the HC3 set only takes into account a GPT model to generate answers to questions, limiting its comparison to texts written by humans and texts produced by GPT.

In this context, it is essential to develop effective solutions to

  • identify LLM-generated texts with high accuracy;

  • detect covert plagiarism practices using advanced techniques;

  • provide accessible tools for academic and professional institutions.

In this paper, we propose a detection model based on PLN and machine learning. Our approach focuses on

  • the creation of a meticulously designed dataset validated through comprehensive experiments;

  • the implementation of models ranging from classical techniques to Transformer and LLM architectures.

The following section presents a review of work related to text detection and plagiarism identification performed by LLMs. The background is detailed in Section 3. Section 4 explains the methodology. Section 5 covers the experiments and analysis of results. Section 6 presents the constraints encountered during the development of the research. Section 8 presents the conclusions and future works.

2. Theoretical Framework

2.1. Preprocessing Techniques

To fully understand the advances in our research, it is essential to master a variety of concepts related to large language models, natural language processing, and evaluation metrics, such as accuracy, recall, f1 score, and precision. Additionally, it is critical to become familiar with tools such as confusion matrices, t-distributed stochastic neighbor embedding (t-SNE), receiver operating characteristic/area under the curve (ROC/AUC) and principal component analysis (PCA) plots.

Because they prepare the data for more effective analysis or modeling, preprocessing techniques are a crucial stage of text analysis. Several methods are used in this process to convert raw text into a more appropriate form using natural language processing techniques to improve the performance of our machine-learning models.

Tokenization [1], which divides the text into smaller units called tokens, which can be words, phrases or sentences, is a very important technique within PLN. After tokenization, we proceed to remove all those symbols that are not necessary for our analysis; this includes removing punctuation marks, removing unnecessary symbols and converting the text to lowercase. This helps to have less variability in the text and also ensures that there is a uniform treatment of all words, reducing the risk of giving greater importance to symbols that are not likely to contribute anything to the context of the sentence.

Another set of important techniques is the elimination of empty words, which consists of ignoring all those words that do not have much semantic value and that may cause noise in the analysis, and lemmatization, which consists of reducing a word to its lemma or lexical root and stemming, which cuts the word to its base form, although this does not always guarantee that it is the appropriate lexical root.

These techniques are essential to properly debug a textual dataset, thus minimizing errors when using machine-learning models. A diagram illustrating the key steps in performing text preprocessing is shown in Figure 1.

2.2. Overview of Text Vectorization Methods

Text vectorization is an important step within PLN, since it allows us to make vector representations of our texts, and in this way we can provide our machine-learning models with the semantic relationships and contexts necessary for it to perform a specific task. Several methods have been developed to identify the connections between words; below we briefly describe some of the most popular methods.

  • One-hot encoding consists of adding a binary vector to each word, where only one element is 1 (representing the word) and the others are 0. It does not identify the semantic relationships between words [2].
  • Bag of words represents a document as a list of words; it does not take into account the order of these words, and the resulting vector indicates the frequency at which each word appears in the text [3].
  • N-grams expands the bag of words by considering sequences of consecutive words. In this way, it is possible capture information such as the order of the words, which also bring with them the context. These can be uni-grams, bi-grams or tri-grams; this is not limited only to words. They can also be implemented with groups of characters [4].
  • TF-IDF compares the frequency of a word in a document with the frequency of the same word in a collection of documents. Less common words are more important and have less weight [5].
  • Word2Vec generates dense and low-dimensional vectors for each word according to its context. It uses models such as Skip-Gram or CBOW to train embeddings. It generates a low-dimensional vector for each word; if you want to obtain the vector representation of a sentence, you must add the vectors of each word and divide them by the total number of vectors to obtain a normalized vector containing semantic information and the context of the sentence [6].
  • GloVe is an embedding technique that relies on word matches within a large text corpus. It captures semantic patterns using a global matrix of co-occurrences, unlike models such as Word2Vec that train words in close context [7].
  • BERT is a language model based on the Transformer architecture that differs in that it is bidirectional, meaning that it takes into account the preceding and following context of a word within a sentence. Compared to unidirectional models that only process text from left to right, being bidirectional allows it to generate much more accurate contextual insertions. Also, the word masking task, in which some words in the text are hidden and the model tries to predict them, helps BERT to learn deep semantic relationships [8].
  • RoBERTa is an improved version of BERT, as it was created to overcome some limitations of the original model. This new model was trained with a larger amount of data and employs key adjustments to optimize its performance in various PLN tasks; this model eliminates the “predict the next sentence” method, as the researchers found that it did not provide significant improvements. This new model only receives a masking stage, but it is focused to have a greater attention to scale; this means that it has more data, longer sequences and the use of larger minibatches [9].
  • The use of large language models, such as GPT or LLaMA, to create insertions depends on their ability to understand the full context of a text stream. Their Transformer architecture allows these models to process both individual words and their relationship to the rest of the sentence or document. As a result, they produce highly contextualized embeddings, where the meaning of a word depends on the environment in which it is found. This allows the encapsulations to capture complex semantic relationships, representing both the individual meaning of words and the overall context of entire sentences, making them ideal for advanced language processing tasks such as text classification or natural language generation [10].

2.3. Classical Classification Algorithms

  • Logistic regression is a linear classification model that is mainly used in binary problems to predict the probability of belonging to a class; unlike linear regression, it uses a sigmoid function that predicts continuous values. In this way, the output can be transformed to values between 0 and 1; for decision-making, normally, a typical threshold of 0.5 is used. If the probability is greater or equal to this value, the model assigns a positive class, otherwise it gives a negative one. The algorithm is efficient when the relationships are approximately linear, but it can be limited when the relationships are more complex. On the other hand, this model can also be implemented for multiclass classification through approaches such as “one vs. rest” or “softmax regression”; in these, the model predicts the probability that an instance belongs to each of the classes that are available, despite being effective in cases where the relationships are linear. Its simplicity and its ability to handle both binary and multiclass problems make it widely used in different tasks [11].
  • Random forest is a machine-learning algorithm that is based on the creation of multiple decision trees; each of the trees is trained with a random subset of training data, which produces a diversity among the trees. At the time of classifying a new piece of data, each tree generates a prediction and at the end the final model makes the decision by a majority voting system for classification cases, or the other method is an averaging for regression cases. This approach is not very susceptible to overfitting because individual trees are likely to overfit, but this is mitigated by combining many trees. This model is very effective when dealing with data that do not have linear interactions, as it can work with large datasets of many variables and is able to detect or capture complex relationships between features [12].
  • SVMs are a class of powerful classification algorithm that focus mainly on finding an optimal hyperplane that can separate classes in a high dimensionality feature space. The main idea is to maximize the distance between the hyperplane and the points closest to it. These are known as support vectors; having a greater margin can lead to greater confidence in the classification. When talking about nonlinear problems, SVMs use the kernel tool, which allows mapping the data to a higher dimensional feature space, where the classes can be linearly separable; there are several kernels, among which are linear, polynomial and radial basis function (RBF). This algorithm is good in high dimensionality spaces but can be computationally expensive, especially when working with large datasets [13].
  • The KNN algorithm is a supervised classification model that is mainly based on the similarity of instances; in order to classify new data, the model looks for the closest neighbors to that data within the feature space and assigns the most common class among the neighbors. In order to determine the distance between points, the Euclidean distance is mainly used, although, depending on the nature of the data, other metrics can be used. It is a very simple and effective model where the decision boundaries are complex and nonlinear. It has the main disadvantage of being sensitive to the scale of the features, so it is necessary to apply a normalization process before its application. Also, its performance is affected when there are large datasets; however, KNN is useful when a quick solution is required and there is no parametric model [14].

2.4. Deep-Learning Models

Some of the neural network architectures that were implemented in the development of the project are described below.

  • Fully connected neural networks are the most basic type of neural network; each of the neurons of a layer is completely connected to each neuron of the next layer and the information is propagated in a unidirectional way, from the input to the output, without having any kind of feedback. They can be implemented to solve classification or regression problems. One of their main limitations is that they do not capture the spatial or temporal relationships of the data, which translates into problems to bring good performance in complex problems, such as the analysis of sequences or large images. Despite having a simple architecture, they can become powerful for tasks where the input relationships are linear [15].
  • RNN is a type of neural network that has the ability to process data sequences such as text or time series; the rRNN has cyclic connections different from fully connected networks, which allows them to maintain a memory of previous inputs. The reason for this is that they can be useful when modeling temporal dependencies; in each time step, the RNN receives an input and modifies its hidden state based on the input and the current hidden state. This type of neural network usually has problems of gradient fading or splashing; this can make learning difficult when dealing with long data sequences, although they can be useful for tasks such as sequence analysis or machine translation [16].
  • LSTM is a type of network is a variant of RNNs but is designed primarily to mitigate the problem of gradient fading when implementing long sequences. This type of network uses a special memory architecture; these memories are composed of cells that can remember and forget information over time. This allows them to capture long-term dependencies more effectively than traditional RNNs. This type of network is widely implemented in sequential tasks such as language modeling, text generation, sentiment analysis and time-series prediction, but despite being computationally more expensive and more complex, LSTMs have proven to be significantly more effective in most sequential problems [16].
  • Transformer architecture is an innovative solution presented in 2017 to overcome the limitations that RNNs and LSTMs have, especially in natural language processing and sequence processing tasks. Its main improvements are the attention mechanisms; these allow each part of the input to influence every other part, regardless of the position of the sequence. With that, the need to process data sequentially can be eliminated. This allows for much greater parallelism in training and data processing. This new architecture has proven to have far superior performance than previous architectures for machine translation, language modeling and text generation. Some models that have revolutionized the NLP field, such as BERT or GPT, have their operating principles in the Transformer architecture [17].

2.5. Evaluation Metrics and Visualization Techniques

In this section, we describe some evaluation metrics that are important to verify the adequate training of our models, along with some visualization techniques used in the analysis of classification models [18].
  • Precision is the proportion of instances correctly classified as positive among all instances that were classified as positive. It is a useful metric when the cost of false positives is high. The formula to calculate it is:

    Precision = T P T P + F P ,

    where T P are the true positives and F P are the false positives.

  • Recall measures the ability of the model to correctly identify positive instances among all true positive instances. It is particularly important when the cost of false negatives is high. The formula to calculate it is:

    where T P are the true positives and F N are the false negatives.

  • F1-Score is the harmonic mean between precision and recall and is useful when there is a balance between false positives and false negatives. The F1-Score provides a single metric that balances these two aspects:

    F 1 Score = 2 × Precision × Recall Precision + Recall ,

  • Accuracy measures the proportion of correct predictions among all predictions made. It is useful in balanced datasets but can be misleading in unbalanced datasets:

    Accuracy = T P + T N T P + T N + F P + F N ,

    where T N are the true negatives.

  • The confusion matrix is a table that shows the predictions of the model against the original labels; these are broken down into true positives, true negatives, false positives and false negatives. This matrix helps to analyze the performance of our model with each of the classes and better understand the types of errors they are making [19].
  • PCA (Principal Component Analysis) is a dimensionality reduction technique that can transform the data into a new space with fewer dimensions while preserving as much variance as possible. It is mainly implemented to visualize high-dimensional data so that the main features are highlighted in a two-dimensional or three-dimensional plane [20].
  • t-SNE (T-distributed Stochastic Neighbor Embedding) focuses primarily on preserving local relationships between instances, which makes it particularly useful for visualizing data clusters or embeddings in tight spaces. This is another dimensionality reduction technique used for visualization, especially effective for high-dimensional data [21].
  • The ROC (receiver operating characteristic) curve shows the relationship between the true positive rate (TPR) and the false positive rate (FPR) for different decision thresholds. An area under the curve (AUC) of 1 indicates a perfect model, while an AUC of 0.5 indicates a random model [22].

2.6. LLM Implementation Methods

There are several techniques for training and applying large language models. Four key methods are described below, along with their main advantages and disadvantages.

  • Prompt Engineering consists of designing prompts in a precise way to guide the language model to generate the most appropriate responses. When using pre-trained LLMs without the need to change their parameters, this technique is particularly useful. Advantages include the fact that it does not require additional training or large computational capacity and that it is fast and efficient for specific tasks. However, its customization for more complex or specific tasks may be limited, and its effectiveness depends on the capability of the LLM [23].
  • Fine tuning involves taking a previously trained model and retraining it with a specific dataset. It allows the weights of the model to be adjusted to improve its performance on particular problems. Advantages include the ability to create models that are highly tailored to specific tasks, improving accuracy and performance and being flexible for a wide range of applications. However, disadvantages include the fact that it requires a high quality dataset and significant computational resources, and it can be costly in terms of time and processing [24].
  • RAG (Retrieval-augmented generation) combines information retrieval techniques with text generation. First, relevant information is retrieved from a database or search engine, and then the LLM generates text based on that information. Advantages include increasing the accuracy of the LLM by relying on up-to-date and relevant information, improving answers to specific queries and reducing the dependency on model size. However, the disadvantages are that it requires additional systems for information retrieval, which complicates the architecture, and can increase latency in the generation process [25].
  • The creation of LLM from scratch implements the initial training of a language model using a large amount of unstructured data without prior training. The design of the model architecture, the selection of training data and the configuration of hyperparameters are all components of this process. Advantages include full control over model design and training, allowing for innovative or extremely customized models for specific needs. However, disadvantages include being very expensive and requiring large amounts of computational resources, storage and time, as well as being complex and requiring a great deal of expertise in language modeling and optimization.

In order to train or adjust a large language model on a local computer, it is necessary to implement some techniques that allow us to reduce the size of the weights of the network and also allow us to modify only a small part of the entire network; below we explain each of these techniques.

  • Quantization is a technique implemented to reduce the size of deep-learning models and also allows us to accelerate their inference, instead of representing the weights of a model with floating point numbers, which are a type of data that consume more memory and require more computation time. Quantization converts this type of data into one of lower precision, for example, converting them into 8-bit data.

  • LoRA is a technique that is used to train models that were already pre-trained without the need to adjust all the parameters of the model; instead, LoRA introduces a low-rank matrix that is trained while the model weights are kept fixed so that only a small number of parameters are trained instead of the whole model, which can considerably reduce the training time and also reduces the amount of computational resources needed. This technique is a great advantage when large language models need to be retrained.

  • QLoRA combines both quantization and LoRA; that is, this technique applies quantization to the low-rank matrices that are introduced during the fitting process. This allows one to further reduce the size and complexity of the models. QLoRA is a very useful technique when limited computational resources are available.

3. Related Work

In the work developed by Sadasivaan et al. [26], a critical problem related to Type I and Type II errors is highlighted. Type I errors occur when LLM-generated texts are misclassified as human-written, while Type II errors occur when human-written texts are mislabeled as LLM-generated. The authors argue that improving detector robustness against Type I errors often leads to an increase in Type II errors, revealing an inverse relationship between the two types of error. Furthermore, a low Type I error can have serious consequences, such as falsely accusing a human of plagiarism, which can damage their professional or academic reputation.
A significant challenge related to Type I and Type II errors is highlighted in Vinu Sankar Sadasivaan’s study [26]. When texts created by large language models (LLMs) are incorrectly classified as being written by humans, Type I errors occur. On the other hand, Type II errors occur when texts created by humans are incorrectly labeled as created by LLMs. According to the authors, when attempts are made to decrease Type I errors, Type II errors often increase, demonstrating an inverse relationship between the two types of error. In addition, they warn that a low number of Type I errors can have serious consequences, such as unfairly accusing someone of plagiarism, which could negatively affect their academic or professional reputation.

In addition, the paper demonstrates that current models are vulnerable to adversarial attacks such as recursive paraphrasing, despite the use of technologies such as watermarking, deep-learning and zero-trigger methods. Although human studies show that this type of paraphrasing only slightly reduces text quality, these attacks can confuse detectors and increase Type I errors. The authors conclude that to avoid misuse of these models, an ideal detector should be able to accurately identify AI-created texts. However, they caution that the high cost associated with misidentification makes the practical application of these detectors unreliable and may even become infeasible.

Wu et al. [27] conducted a thorough investigation into the current state of LLM detectors, examining the drawbacks of existing detectors and proposing several research directions for future work. Initially, they mention that current LLM detectors face two major problems: The first is the model augmented degradation (MAD) phenomenon, which mainly involves the risk of models being trained with erroneous knowledge published online, leading to repeated use of texts and reduced quality in generated texts. The second problem is that the models may provide false information, as they determine only the probability of the subsequent word without understanding the correctness of the information.

Wu et al. point out that there are currently three very active research areas related to LLM detection: the implementation of watermarking techniques, deep-learning methods and the use of LLMs as detectors. Some of the future work they propose includes creating detectors trained with more robust datasets and developing detectors suitable for resource-limited environments.

Research by Kumar et al. [28] presents an innovative detector based on the DistilBERT Transformer architecture. DistilBERT is a smaller and more efficient version of the bidirectional encoder representations from the Transformer (BERT) model, chosen due to limited computational resources. The authors note that LLMs demonstrate remarkable text generation abilities, producing grammatically correct information with a coherent writing style; however, they cannot ensure the accuracy of the information provided, a phenomenon known as hallucination.

The authors conducted experiments on two datasets: “DAIGT-V3”, which includes twenty thousand essays written by humans and twenty thousand created by large language models, and “LLM – Detect AI Generated Text”, which contains student essays and texts created by various LLM models. It is crucial to note that neither of these datasets is protected against adversarial attacks.

The binary classification model they used demonstrated 100 percent accuracy in detecting texts created by LLMs and 90 percent accuracy in detecting texts written by humans, with recall rates of 84 percent and 90 percent, respectively. While the model performs well overall, it has a tendency to misclassify texts that were written by humans. Kumar concludes that while DistilBERT is highly capable of identifying texts created by LLMs, there is still room for improvement in the way human-written texts are classified. According to the study, DistilBERT could be useful in ensuring the quality of datasets used in a variety of applications.

According to Capobianco et al. [29], large language model detectors are crucial for maintaining academic integrity and benefiting society. The paper examines various models, including BERT and RoBERTa, for the binary classification of LLM and human texts. Experiments were conducted on the HC3 corpus, which contains 24,322 questions and corresponding answers from both human and LLM sources.

The authors trained different models of the BERT architecture. These models were separated into two sets; the first set froze its parameters, while the second was with its parameters not frozen. The results they reported ranged from 88 to 100%. On the other hand, the RoBERTa model had a slightly lower accuracy than the BERT model, but this model is much better when having a larger amount of data. The study concludes that LLM models have a positive impact on society; however, it is crucial to ensure that these tools are used responsibly for the benefit of the entire society.

Table 1 compares several studies on large linguistic model generated text (LPM) detection, focusing on the datasets, models used and main approaches. It includes work by Sadasivaan et al., Wu et al., Kumar et al. and Capobianco et al. along with a new proposal. Highlights include datasets such as HC3 and new ones, the use of models such as Transformers (BERT and RoBERTa), classical algorithms and watermarking techniques. The table also describes unique approaches, such as type 1 and 2 error detection, classification of GPT-generated texts, and identification of texts from multiple GPLs, showing the originality of each study.

4. Methodology

The methodology we implemented for our research is separated by different stages, which consist mainly of the creation of our datasets, experimentation with a variety of models and finally the analysis of the results.

4.1. Formation and Preprocessing of Linguistic Corpus

The current datasets used to train the text detection models created by LLM are limited in diversity and focus mainly on computer science areas, which makes the models less effective for other disciplines. To solve this problem, we developed code that connects to an Arxiv API, the purpose of which is to collect scientific articles from a variety of fields, such as physics, medicine, electronics, and communications, and extract text summaries using the Nougat model [30], which organizes the content of PDF files. Dataset creation was carried out in two stages, with the aim of prioritizing the thematic and stylistic diversity of the texts.
Figure 2 presents an outline of the procedure for constructing the dataset. In this situation, papers from different fields such as medicine, computer science, astronomy, physics and mathematics were incorporated. Finally, the database consisted of 1550 abstracts equally distributed among the different categories of paper. After processing the texts with LLMs, the final result consisted of 7750 texts, equally distributed between those generated by LLMs and those written by individuals. This technique was chosen because, when constructing a dataset, it is crucial to maintain a balance between classes to prevent overfitting to a given class and also to achieve a better extension of the problem.
The first stage consists of extracting the abstract from the PDF files, we focused mainly on obtaining a very diverse representation of texts, ensuring that the dataset covers a variety of topics and writing styles. The second stage consists mainly of eliminating all those articles that were published after 2017 since in this year the Transformer architecture started to be implemented and the first models of this class, such as BERT or RoBERTa, were presented. In this way, we can guarantee that our detection models are not trained with possible texts that were generated by models with Transformer architecture, which gives us a more solid and objective basis. Table 2 shows the structure of our datasets.
Once all human-written texts have been obtained, we proceed to download and install Ollama. This software enables the installation of various large language models. We installed Llama3 [31] and LLaMA2 [32], both with 7 billion parameters, and gemini [33] and LLaVA [34], also with 7 billion parameters. With the models installed, we proceed to create a prompt that includes an instruction and a summary. We enter it into the LLM and with this we generate a new paraphrased text labeled with the name of the model. The following is an example of the instruction that is entered into the large language model for the generation of the new texts.

Instruction:

Summarize the article. Do not generate any additional text, just provide the summary.

Summary:

The article discusses how artificial intelligence (AI) is transforming the educational sector. It focuses on the use of AI-based tools to personalize learning, improve teaching through automation and provide more efficient access to educational content. Additionally, it addresses the ethical and social challenges that may arise from the integration of these technologies, such as the potential gap between students with access to AI and those without.

In this study, for the creation of instructions and the application of constraints, the LangChain library was used in conjunction with Ollama. LangChain enabled the unification and coordination of several elements of the system, which favored the creation of complex work processes. Specifically, for instructions and constraints, a template was used within LangChain, which facilitated the organized and adaptable definition of the requirements of each task. This method allowed customizing the requests to Ollama, ensuring that the responses produced were accurate and adapted to the constraints set, such as avoiding the creation of additional text and providing only the summary required.

Once we have all our new texts generated by the different LLMs, we perform the dataset cleaning stage, eliminating unnecessary symbols and applying one-hot encoding for the labels, as well as tokenization, stopword removal, lemmatization and spell checking of the texts. The process we implemented for our dataset is presented in Figure 3.
At the end of the LLM text generation, we add these texts to our dataset, so the dataset looks like that in Table 3.
For our first dataset we have 7750 texts proportionally distributed to ensure a balanced representation of each class, i.e., 1550 are human texts and 1550 are for each of the LLMs we implemented for text generation. At this point is when we also generate our second dataset, which includes various attacks, such as recursive paraphrasing and translation from one language to another. The columns mentioned in Table 2 are also present in this dataset, but we add recursive paraphrasing and translation, so our dataset becomes much larger and therefore more time consuming to process and train in the machine-learning models. Figure 4 presents the development process for the creation of the second dataset.

4.2. Embeddings Generation

With the new datasets we developed, we implemented several vectorization and embedding generation techniques in order to train our machine-learning models; these techniques range from the most basic ones, such as TF-IDF vectorization, to more modern embedding generation methods, such as LLM embeddings. Each of these embeddings vary in length depending on the model we implemented; traditional techniques such as TF-IDF generate very large vectors but with limitations in capturing relationships between words. On the other hand, word2vec, GloVe, BERT [35], RoBERTa [36] and other LLM models produce more compact, but contextually richer embeddings. The first step was to store all embeddings in a single dataset; however, computational limitations led to the decision to separate them by model in order to optimize the loading of the sets. Figure 5 presents the whole process that was carried out for the generation of our feature vectors and Table 4 presents the sizes of the embeddings of each of the LLMs implemented and of the models with Transformer architecture, such as BERT or RoBERTa.

4.3. Implementation of Classification Algorithms

Once we completed the initial dataset, we trained different machine-learning classification models, starting with classical algorithms such as logistic regression, support vector machines, decision trees and k-nearest neighbors. For each of the experiments, we considered both a simple validation using different training and testing percentages, mainly 70% train 30% test, 80% train 20% test and 90% train 10% test percentages, and we also implemented cross-validation with different numbers of folds, mainly 4, 6, 8 and 10 folds, the main goal being to identify the configuration that would best optimize our metrics. Upon completion of training each of the models with our dataset, we generated the confusion matrix and PCA plots to visualize the distribution of the data and their classification. These experiments were repeated using different embeddings, performing a total of 9 experiments for each classification model.

Once we trained the basic models, we implemented deep-learning algorithms, starting with fully connected neural networks and LSTM networks. We experimented with different layer configurations, learning rates, epochs, activation functions and validation percentages, in addition to using dropout layers in each experiment to avoid overfitting to the training-set data. In the same way as the classical models, classification metrics were extracted, along with their respective confusion matrices and PCA and t-SNE diagrams. Figure 6 and Figure 7 present the training and validation process of basic machine-learning algorithms and deep-learning networks.

After training the basic and deep-learning models, we continued with the fine tuning of the BERT and RoBERTa models using reduced versions such as DistilBERT and DistilRoBERTa. Adjustments were made to the model parameters and the number of epochs in each experiment. As they are pre-trained models, a large number of epochs is not necessary to obtain good results. The process to perform the fine tuning and testing with these models is that, at the input, we enter the clean text and the model is responsible for making the embeddings and perform the classification. In order to evaluate the results, we extracted the values of the penultimate layer to implement PCA and see the distribution of our data and also allow us to calculate the evaluation metrics, including the confusion matrix.

Once we completed the fine tuning of DistilBERT and DistilRoBERTa, we proceeded to the implementation of language wide models (LLMs) to perform the classification. Initially, we used prompt engineering, a technique that consists of designing a complex prompt that allows the LLM to perform a classification. The retrieve and generate (RAG) technique was also implemented, and the fine tuning of LLaMA2 and LLaMA3 models was carried out, employing LoRA optimization to reduce the size of the values within the network, and these can be adjusted with limited resource systems. Each of the trained models were saved along with their metrics, evaluation graphs and confusion matrices for future testing. The steps that were followed to realize these implementations are shown in Figure 8.

After completing the first stage of training, it was shown that models with good metrics are affected by various attacks, such as paraphrasing with tools like QuillBot, and their performance decreases. A larger dataset was used that includes attacks such as recursive paraphrasing and language translation. For this new set of experiments using classical evaluation metrics, confusion matrix and ROC/AUC curves, we can state that if the model does not correctly recognize any class, its area under the curve will always be less than 0.5 or very close to 0.5. In the following section, we present the most outstanding results and perform an analysis of the results we obtained.

5. Experiments and Analysis of Results

The results obtained show considerable feasibility in the performance of the classification models as a function of the different vectorization and embedding techniques employed. Although the TF-IDF vectorization method did not generate results high enough for it to be considered as a good classifier, due to its simplicity it makes it an option to consider, even though its performance was modest.

The LSTM network that was trained with Word2Vec embeddings with 1000 epochs and with a learning rate of 0.0001 showed a low performance. Although its results were the best when implementing the Word2Vec dataset and despite its robustness and ease in handling text sequences, this model fails to correctly capture the complexity present in the dataset. With this result, we can determine that the choice of both the embeddings and the architecture of the model was not optimal, since the embeddings failed to effectively capture the context and the relationships between the words in the texts.

The random forest model when trained with GloVe embeddings, as well as the LSTM model with Word2Vec embeddings, also failed to perform optimally, so these experiments show that the combination of these techniques is not the most effective, even though these approaches and architectures provided some of the best results among the different classification models evaluated with GloVe embeddings.

In contrast, models using more complex embeddings, such as BERT and RoBERTa, demonstrated much better performance. The results were exceptional when BERT embeddings were combined with logistic regression. This indicates that, for effective classification, it is essential to use embeddings that more accurately show the semantic and syntactic relationships of the texts. Furthermore, the outstanding results of combining RoBERTa with a support vector machine (SVM) and 9-fold cross-validation demonstrated the ability of Transformer-based architectures to capture semantic and syntactic relationships in texts, giving it a high quality classification.

In addition, the LLM-created embeddings were exceptional. Combining basic machine-learning classification algorithms with LLM-created embeddings without fine tuning gave positive results. When fine-tuned, the DistilBERT and DistilRoBERTa models also showed good results.

Figure 9, Figure 10 and Figure 11 show the confusion matrix, the PCA plot and the t-SNE plot of the model with the lowest classification performance. It is clear that the creation of adequate embeddings is essential for classes to be linearly separable, indicating that when classes are more differentiated basic classification models can become more useful. However, Figure 12, Figure 13 and Figure 14 show the confusion matrix, the PCA plot and the t-SNE plot of the SVM model where embeddings of the LLM LLaVA were implemented. Meanwhile, Figure 15, Figure 16 and Figure 17 show the confusion matrix, PCA plot and t-SNE plot of distilRoBERTa implementing fine tuning.

6. Limitations

Although this study focused primarily on identifying texts produced by large language models, there are several constraints that need to be identified. First, our dataset, although varied in terms of topics, writing styles, and fields of knowledge, was constrained by existing computing resources. These restrictions impacted the magnitude of the data processed and the complexity of the models we were able to build. Consequently, the results of this research may not fully reflect the wider range of possible text generation scenarios with different LLMs or the wider scenario of possible applications.

Another significant restriction is that this research focuses only on the identification of texts produced by LLMs in English, without taking into account texts in other languages. Language variety is a crucial element in text identification, and models trained only in English may not be as effective in identifying texts produced in other languages due to variations in grammar, syntax and style.

Given these constraints, future work will focus on creating a larger and more varied dataset that includes texts in other languages, considers adversarial attacks, and uses more sophisticated computational resources. This will facilitate the development of more accurate and scalable models for the identification of texts produced by LLMs in real contexts.

7. Impact and Applicability

The incorporation of text identification models created by LLMs into plagiarism detection systems can have a significant impact on education and academic research. Currently, there is an increase in the misuse of automatic text generation technologies, which raises serious ethical concerns. Conventional plagiarism detection systems, which are mainly based on precise text matching, fail to detect texts created by broad language models, enabling students or creators to present artificially created content as if it were their own. The application of models capable of recognizing these texts can address this gap, thus ensuring greater integrity in academia.

By integrating these sophisticated models into plagiarism detection systems, not only plagiarized texts could be detected but also those created by artificial intelligence, which would allow academic institutions to establish a more precise differentiation between human work and that produced by artificial intelligence. This could be particularly beneficial in virtual education platforms and in scientific studies where the use of large language models is constantly expanding. In addition, optimized detection systems could be merged with current content review tools, simplifying the work of teachers and academics to ensure that the work displayed is unique and ethical.

The social impact of this implementation is significant. Not only does it help combat plagiarism but it also fosters a more ethical and transparent learning environment. As language models become more sophisticated, the ability to differentiate them from human-written texts becomes a crucial element in maintaining trust in education and research. Furthermore, the responsible use of these technologies can be a starting point for new academic policies that promote integrity in the use of AI-based tools, striking a balance between innovation and ethics in academia.

8. Conclusions and Future Work

A key conclusion of this study is that it is not essential to fine-tune complex and resource-intensive models to effectively detect texts generated by large language models (LLMs). The results show that, by choosing appropriate embeddings, simpler and computationally efficient classification models can be employed while maintaining a high level of detection accuracy.

This implies that, by leveraging advanced embeddings that accurately reflect the semantic and syntactic features of texts, the ability of simpler models to differentiate between machine-generated and human-written texts can be greatly enhanced. This approach allows for a significant decrease in the computational costs associated with fine-tuning large models, without sacrificing classification accuracy.

This approach not only facilitates the implementation of models under resource constraints but also broadens access to advanced detection tools in the field of natural language processing. As a result, it opens the possibility of adopting more sustainable and cost-effective solutions, promoting an optimal balance between performance and computational requirements. We believe that our work represents a significant contribution towards the efficient classification of texts generated by large linguistic models.

In future research, we plan to implement a more robust dataset, performing training and embedding generation in a manner similar to that presented in this paper. Currently, we are working on this phase, but due to the larger size and complexity of the new dataset the training and validation time of the models has increased significantly, which also requires a greater use of computational resources. In the end, the models described in this study will be evaluated to determine which one offers the best performance, also considering its ability to handle adversarial attacks.



Source link

David Soto-Osorio www.mdpi.com