Who Will Author the Synthetic Texts? Evoking Multiple Personas from Large Language Models to Represent Users’ Associative Thesauri


1. Introduction

“Just get more data!” has been the universal principle for improving deep learning (DL) neural network (NN) models for over a decade. Indeed, although the effect of training dataset size on the models’ quality is believed to be logarithmical [1], it has been quite reliable even if diminishing, provided of course that the NN architectures were sufficiently advanced. However, in certain domains, we are about to hit the limit of the naturally available data. This mostly concerns the data created with the participation of humans—foremost, coherent texts [2]—while, e.g., astronomy with its potentially unlimited data-generating space remains relatively safe from the data scarcity. As for the recently booming Large Language Models (LLMs), the sizes of pre-training corpora datasets, which are the largest of all dataset types, currently (in 2024) reach the order of 1014 bytes or 1012 tokens [3] (Table 1). Meanwhile, the amounts of high-quality language data (research papers, books) are estimated as several trillion (1012) tokens [2].
Since the advent of deep learning models about a decade ago, their sizes (and thus the required amount of training data), have been increasing at an unprecedented rate, far exceeding the new information production. For instance, during the most recent 5 years, frontier LLMs have grown from about 340 million parameters (BERT) to 175 billion parameters (GPT-3) to over a trillion parameters (GPT-4) [2,4]. This corresponds to an annual growth of about 400%. There are different estimations for the growth rate of online information and data, but most commonly they state that the volume doubles every 1 or 2 years, corresponding to a growth of just 41–100%. Thus, for Natural Language Processing (NLP) there are predictions that “models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032” [5] and that “training dataset size may soon be limited by the amount of text data available on the internet” [6].
Correspondingly, generated content-based text augmentation approaches are becoming increasingly popular [7], and reliance on the output of LLMs appears to be a natural way around this forthcoming “data wall” [8]. However, besides certain technical and organizational challenges in obtaining such added “surrogate” or “synthetic” data (see e.g., in [9]), conceptual problems with its effectiveness are reported. Foremost, straightforward training with synthetic data is known to lead to “model collapse”—i.e., degradation of its performance with each model-data feedback iteration [10,11]. Generally, a very careful consideration in adding surrogate data to the training sets is required [12], as “either model collapse or stability could be possible depending on the fraction of generated data used at each retraining step” [13]. For instance, it has been demonstrated that adding as little as 0.1% of repeating data 100 times resulted in a model achieving only one-half of the accuracy compared to the non-duplicated content [2].
Indeed, a very important requirement towards synthetic data is that its variance matches the one of real-world datasets [14], and “raw” LLM-generated data were found to have lower lexical diversity compared to real data [11]. Current approaches for making synthetic data more realistic (and thus causing less degradation in models) range from mathematical [12,15] to linguistic ones, which seek to bend the LLMs’ output, generally via specifically constructed prompting. Within the latter direction, a natural approach is imitating the diversity of billions human authors that have been creating written texts throughout the history of humankind. Recent work in the field by Tencent AI Lab proposes a “persona-driven” data synthesis methodology, which involves inferring/evoking artificial personalities from LLMs and populating a Persona Hub with a billion of them [16]. To the extent of our knowledge, derived from the related publications that came to our attention, their main focus is on the approaches for evoking the multitude of personas. For instance, a specialized prompting pattern language is being developed [17], which incorporates such features as job descriptions and titles, roles and scenarios, backgrounds and motivations, languages and cultural contexts, etc. Meanwhile, less attention is paid to the actual diversity of the output—e.g., in [16], the evaluation of the diversity is done for math problems, but not for generated texts—and the “personal characteristics” that affect it. Particularly, we were unable to come across research publications concerned with the diversity of the evoked personas’ “thesauri” and its comparison to the one we would expect for different human authors of natural language texts.

In our paper, we put forward and verify several hypotheses related to LLMs’ ability to imitate human-like differences in associative thesauri and explore certain demographic factors (age, gender) that affect them. For this end, we set up an associative experiment with 50 human participants and two LLM-based services, ChatGPT and YandexGPT, each imitating four personas (evoked characters). For the five stimuli words from the domain of Academia/Education, each of the 58 “subjects” supplied 50 associations (in Russian), which we compared using cosine similarity and Mahalanobis distance. Our subsequent statistical analysis suggests that while the different humans’ associations were indeed significantly different, this was not the case for the artificial personas evoked from the LLMs.

The preliminary results of the experiment have been reported at the international conference Internet and Modern Society (IMS-2024) in June 2024 [18]. The current paper features additional data collected from YandexGPT and from real humans, and presents a statistical analysis of the semantic similarity-measured differences.
The remainder of our paper is structured as follows. In Section 2, we provide a brief overview of the related work and put the methods applied in our study into perspective. Section 3 describes the experiment with ChatGPT, YandexGPT, and the 50 human participants. In Section 4, we present the analysis of the effects of the extrapersonal and demographic factors on the semantic similarity-based differences between the associations. In the final section of the paper, we summarize and discuss our findings and outline the plan for further work.

4. Results

4.1. Descriptive Statistics

Ten associations were generated for each persona and each stimulus by each service, so a total of 400 = 10 * 5 * 4 * 2 lexical units were collected, reflecting the thesauri of the LLM-evoked personas, and organized into 20 lists. Tables S1 and S2 in Supplementary Materials present the associations produced by GPT-4 and YandexGPT Pro, respectively (the online Appendix is at https://github.com/gorovuha/personGPT/tree/main). For the human associators, 2500 = 10 * 5 * 50 lexical units were collected and organized into 250 lists.
In Table 2, we provide an example of the associations that had the highest MD value of all the list pairs supplied by ChatGPT. Interestingly, they “belong” to the same persona, the Graduate, while the stimulus words were center and science.
In Table 3, we provide examples of associations provided by some of the human participants. The associations between user ID 37—science and ID 4—education had the lowest CS value of all the pairs, whereas the associations between ID 37—science and ID 25—university had the highest MD value of all the pairs. This is interesting, since the ID 37’s associations for science appear to be rather conventional. For the sake of illustration, we also provide the associations provided by different humans for the same stimuli.

4.2. The Distance Measures

The mean CS amounted to 0.760 (SD = 0.131) for ChatGPT, 0.737 (SD = 0.069) for YandexGPT, and 0.693 (SD = 0.080) for the human associators. Table 4 presents the means and SDs for cosine similarity (CS) between the associative lists per the stimuli words, for the two considered LLMs and the human participants.

The mean Mahalanobis distances amounted to 211.1 (SD = 51.1) for ChatGPT, 517.0 (SD = 175.8) for YandexGPT, and 167.2 (SD = 23.4) for the human associators. Pearson correlation with the CS measures for the respective pairs of lists was highly significant for the human associators (r31125 = −0.947, p < 0.001), but not significant (at α = 0.05) for either ChatGPT or YandexGPT.

The effect of Is_diff_stimulus on CS was found to be highly significant for ChatGPT (F1,188 = 6.87, p = 0.009), YandexGPT (F1,188 = 30.9, p < 0.001), and the human associators (F1,31123 = 531.1, p < 0.001). As expected, the similarities for the association lists provided for the same stimuli words were always higher than for different stimuli: considerably so for ChatGPT (0.817 vs. 0.749) and YandexGPT (0.797 vs. 0.726), while somehow less obvious for the more numerous human associators (0.714 vs. 0.688).

The effect of Is_diff_stimulus on MD was found to be significant for YandexGPT (F1,188 = 5.29, p = 0.02) and the human associators (F1,31123 = 581.8, p < 0.001). The effect was not significant for ChatGPT (p = 0.208), though the mean distance for the same stimuli (200.4) was expectedly lower that for the different stimuli (213.2).

4.3. Qualitative Analysis of the Thesauri

4.3.1. For the LLMs

The qualitative analysis of the associations produced by the language models yields the following highlights:

  • For the university stimulus, the associations mostly focused on knowledge acquisition and career progression. The association words such as knowledge, teachers, lectures, and career suggest that LLMs relate universities to the educational process and preparation for future employment. This is probably explained by the fact that most of the available texts with the strong presence of the stimuli word were produced by university staff members.

  • Similarly, for the program stimulus, the associations emphasized the structural and procedural aspects of academic life, such as timetable, subjects, courses, and credits, reflecting the focus on the formal requirements and organization of the education.

  • For the science stimulus, the associations were more technical and included words like experiment, theory, and research, which suggest that the available texts that concern scientific activities are generally aimed at a higher education level, at least graduate degree.

  • The different personas exhibited seemingly varying trends in their responses. For instance, the Enrollee focused primarily on terms that related to studies and future career opportunities, such as library, courses, and diploma. The Graduate’s associations had slightly different emphasis: although there were terms like exams and modules, some more research-related aspects of a graduate university life were mentioned, such as laboratory. Meanwhile, the Parent brought in associations that were focused on the enrollment and the formal steps of an academic process, rather than the content: competition and choice.

  • Almost all the associations were paradigmatic and illustrated a wide range of semantic relations based on similarity or contiguity.

4.3.2. For the Human Associators

The highlights of the qualitative analysis for the human associators are as follows:

  • Across nearly all their demographic groups, there was a strong emphasis on terms related to education and research: knowledge, lecture, diploma, research, experiment, theory, etc. This suggests a shared conceptual framework among participants and confirms their relevance to academic or scientific environments, just like for the LLMs’ personas.

  • There was a balance between practice-oriented associations, like laboratory or practice, and theory-oriented ones, such as theory or hypothesis.

While many associations were common between the different professional or educational occupation fields, there was also some specificity:

  • The participants whose background or occupation was in Science and Technology (e.g., IT or Mathematics) often mentioned terms like experiment or research study, which seemed to reflect their closer engagement with empirical work;

  • The participants from administrative or more humanitarian fields exhibited fewer of such specialized terms, instead supplying associations that were more focused on general education and academic work processes;

  • Most associators noticed the contradiction between their linguistic intuition and the instructions which they had to follow in the course of experiments: they were expected to produce paradigmatic associations, but in fact they tended to think of syntagmatic associations first.

4.4. The Effect of the Personal Factors on the Associative Thesauri

Table 5 presents the means and SDs for cosine similarity (CS) between the associative lists output by the LLMs as the evoked personas.

Since the effect of Is_diff_stimulus was found to be significant for both ChatGPT and YandexGPT, we excluded the 30 cases with Is_diff_stimulus = 0 (understandably, they were possible only for Is_diff_person = 1) from the analyses performed for both models. Similarly, 6125 cases with Is_diff_stimulus = 0 were excluded from the analysis of the CS data obtained for the human participants.

Table 6 presents the results of ANOVA tests for the effect of the factors on the cosine similarity for the two LLMs and the human participants (the p-values significant at α = 0.05 are highlighted). Is_diff_person was analyzed separately, while Is_diff_gender and Is_diff_age_group were analyzed together in a univariate model (the interaction between these two factors was not significant). Notably, while none of the differences were significant for the LLM-evoked personas (including the interaction between the age and gender factors), all the factors had significant effects on CS for the human associators.

The mean cosine similarity for the associations produced by different humans (Is_diff_person = 1) was 0.688 (SD = 0.078) compared to 0.705 (SD = 0.084) for the same person. Humans of different genders (Is_diff_gender = 1) had a mean CS of 0.686 (SD = 0.84), compared to 0.689 (SD = 0.074) for the same genders. Whereas for humans of different age groups, CS was 0.689 (SD = 0.076), it was unexpectedly lower, at 0.687 (SD = 0.082), for representatives of the same age group.

Table 7 presents the results of the same analysis for the effect of the factors on the Mahalanobis distance (MD) for the two LLMs and the human participants (the p-values significant at α = 0.05 are highlighted). Similarly, the effect of Is_diff_person was significant for the human participants (mean Mahalanobis distance of 168.9 for different humans vs. 163.8 for the same person), but not for ChatGPT or YandexGPT. Unlike in the analysis performed for CS, here Is_diff_gender was not a significant factor.

5. Discussion and Conclusions

As the training datasets of real texts are nearing depletion [6], computational linguistics and Natural Language Processing are turning to data augmentation via synthetic data generation. It is believed that in order for the latter to be beneficial for LLMs’ development, the diversity of synthetic texts must be ensured, imitating the one resulting from the difference of human authors. While billion-scale repositories of such artificial “authors” are being developed [16], it remains unclear whether these “personas” are indeed as diverse linguistically as the human beings.

In our project, we came up with the descriptions of four personas—typical visitors to university websites—inspired by the techniques suggested by A. Cooper in HCI: with various name, age, gender, marital status, location, education, current job, future plans, etc. The four personas devised in our study correspond to the classical groups of external target users (i.e., not considering the intranet users) for a university website: undergraduate and graduate applicants, their parents, and young researchers (job-seekers in academia).

We then utilized the five words most frequently encountered in the texts on the websites as the stimuli for two LLM-based services: GPT-4(o) and YandexGPT Pro that were prompted to imitate the associations of the personas. The same procedure was done in an associative experiment with 50 human participants of different ages, genders, and majors. We used cosine similarity and Mahalanobis distance semantic measures to evaluate differences in humans’ and LLM-evoked personas’ associations. The outcome of the hypotheses testing done in our study is as follows.

H1-1: Accepted and H1-2: Accepted. Both the LLMs (ChatGPT: p = 0.009; YandexGPT: p < 0.001) and human participants (p < 0.001) in our study produced significantly different associations for different stimuli, measured with CS. This confirms the validity of our experimental setup and the expected LLMs capabilities to differentiate between different linguistic tokens. The only exception was the MD measure for ChatGPT (p = 0.208), but even then the mean distance for the same stimuli was expectedly lower that for the different stimuli.

H2-1: Rejected and H2-2: Accepted. The associations produced by different personas were not significantly different compared to the ones produced by the same persona. At the same time, the difference was expectedly highly significant (p < 0.001) for the human participants, which reinforces the validity of our study. This is the main finding of our study, which was supported with both the CS and MD measures of semantic similarity.

H3-1: Rejected and H3-2: Partially Accepted. None of the considered demographic factors (gender and age) had significant effect on the similarity of the associations produced by the LLMs acting as the personas. Meanwhile, for the human associators the difference in gender measured with CS was highly significant (p = 0.002), whereas measured with MD it was not (p = 0.649). The factor of age was also significant for the humans only, but the similarities were lower for the participants of the same age group. The reason for this unexpected finding might be our somehow arbitrary rule for defining the age group (the difference of 8 years or more), which did not consider the generations with respect to a cultural/linguistic gap.

The highlights of the qualitative analysis of our data are in line with the existing knowledge in the field. The apparent differences in the associations produced by LLMs understandably stem from the special features and biases found in the texts—in our case, related to education/academia—and seem to be semantically restricted. For the human associators, the differences rather reflect their backgrounds, careers, and personal experiences. Also, unlike the human participants who leaned towards syntagmatic associations, the LLM-based personas almost always generated paradigmatic ones.

Our findings suggest that the textual data (associations) that we were able to extract from the LLMs in our experiment do not match the corresponding human-produced data, at least in their individual (author-bound) diversity. Existing studies agree that overcoming limitations of synthetic training data is crucial in preventing the models’ collapse [46] and ensuring the proper positive scaling effect on the models’ performance [47]. This was demonstrated in a wide range of real-world scenarios: from synthetic data generation with GPT-FL and PAC-GPT in networking technologies [48] to facial recognition systems’ training [49] to synthetic tabular data generation for artificial “students” with CTGAN, GPT2, DistilGPT2, and DialoGPT [50]. Among the problematic characteristic of synthetic data that presumably can be fixed via advanced prompt engineering, diversity is often named first [51,52], before biases, representativeness, efficiency, etc. However, with respect to authorship diversity, the problem has not yet been resolved neither in the Humanities [35] nor in the Social Sciences [52].
The results presented in our work have certain limitations. Foremost, we only used a handful of personas and did not experiment with different ways to describe them in contextual prompts to LLMs. Although the number and diversity of personas could have been increased with relative technical ease (e.g., by adding a journalist who seeks tech-related news on the website), we decided it would be artificial and foreign towards the university website content and academia language domain. Instead, we decided to stick to the maximum representativeness in terms of HCI, since it is the context prompt-specification paradigm that we explore in our current work. We would like to note that while the human participants were also relatively homogenous, their associations did differ significantly, unlike the personas’ ones. In our opinion, this reinforces the validity of our experiment design. To support the similarity between the groups, we did not employ the factor of cultural background difference in the current study: all the human participants are Russian by origin, whereas the personas’ descriptions mention the names of Russian cities. The qualitative analysis of the associations provided by the human participants suggests that the major source of individual differences in them is clearly the emotional load, which in turn stems from the previous life experience (see Table 3, especially the unconventional “Queue” and “Circus” associations for the education stimulus word). Having this in mind, one should expect that even longer specifications for the personas, with detailed bios and values, would make the contextual prompts more effective in creating artificial authors with diverse thesauri and writing styles.

Second, we only experimented with two LLM-based services, and only with prompts and outcomes in Russian. Our further research plans involve running the associative experiment with representatives of different nations and cultures, and in the English language. Third, we only considered two differentiating demographic factors for our human participants and the personas. In the future, we also plan to study the effect of majors and occupational fields (the professional background factor) more closely. As for the dependent variables, we would like to note that although there was expected high negative correlation (p = −0.947) between cosine similarity and Mahalanobis distance for humans, there were no significant correlations for either of the models. Further research might be needed to explore whether this lack of semantic measures’ uniformity is specific for the synthetic linguistic output.

Finally, our study has been performed for the stimuli and associations belonging to a rather specific domain: university websites and academia. We cannot yet be sure how far our findings can generalize for different LLMs, languages and domains, or context-specification prompt techniques.

Still, we believe that the results of our experiment and certain methodological insights obtained in our study might be of interest to both linguists and machine learning researchers and engineers. Further refinement of the persona-driven approach might be necessary to make LLMs produce synthetic training texts of appropriate quality.



Source link

Maxim Bakaev www.mdpi.com