Differentiate ChatGPT-generated and Human-written Medical Texts

Background: Large language models such as ChatGPT are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the Internet. However, medical texts such as clinical notes and diagnoses require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to healthcare and the general public. Objective: This research is among the first studies on responsible and ethical AIGC (Artificial Intelligence Generated Content) in medicine. We focus on analyzing the differences between medical texts written by human experts and generated by ChatGPT, and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. Methods: We first construct a suite of datasets containing medical texts written by human experts and generated by ChatGPT. In the next step, we analyze the linguistic features of these two types of content and uncover differences in vocabulary, part-of-speech, dependency, sentiment, perplexity, etc. Finally, we design and implement machine learning methods to detect medical text generated by ChatGPT. Results: Medical texts written by humans are more concrete, more diverse, and typically contain more useful information, while medical texts generated by ChatGPT pay more attention to fluency and logic, and usually express general terminologies rather than effective information specific to the context of the problem. A BERT-based model can effectively detect medical texts generated by ChatGPT, and the F1 exceeds 95%.


Background
Since the advent of pretrained language models, such as GPT [1] and bidirectional encoder representations from transformers (BERT) [2], in 2018, transformer-based [3] language models have revolutionized and popularized natural language processing (NLP).More recently, large language models (LLMs) [4,5] have demonstrated superior performance on zero-shot and few-shot tasks.Among LLMs, ChatGPT is favored by users due to its accessibility as well as its ability to produce grammatically correct and human-level answers in different domains.Since the release of ChatGPT in November 2022 by OpenAI, it has quickly gained significant attention within a few months.It has been widely discussed in the NLP community and other fields since then.
To balance the cost and efficiency of data annotation and train an LLM that better aligns with user intent in a helpful and safe manner, researchers used reinforcement learning from human feedback (RLHF) [6] to develop ChatGPT.RLHF uses a ranking-based human preference data set to train a reward model with which ChatGPT can be fine-tuned by proximal policy optimization [7].As a result, ChatGPT can understand the meaning and intent behind user queries, which empowers ChatGPT to respond to queries in the most relevant and useful way.In addition to aligning with user intent, another factor that makes ChatGPT popular is its ability to handle a variety of tasks in different domains.The massive training corpus from the internet endows ChatGPT with the ability to learn the nuances of human language patterns.ChatGPT seems to be able to successfully generate human-level text content in all domains [8][9][10][11][12].
However, ChatGPT is a double-edged sword [13].Misusing ChatGPT to generate human-like content can easily mislead users, resulting in wrong and potentially detrimental decisions.For example, malicious actors can use ChatGPT to generate a large number of fake reviews that damage the reputation of high-quality restaurants while falsely boosting the reputation of low-quality competitors.This is an example that can potentially harm consumers [14].
When using ChatGPT, some potential risks need to be considered.First of all, it may limit human creativity.ChatGPT has the ability to debug code or write essays for college students.It is important to consider whether ChatGPT will generate unique creative work or simply copy content from their training set.New York City public schools have banned ChatGPT.
What is more, ChatGPT has the ability to produce a text of surprising quality, which can deceive readers, and the end result is a dangerous accumulation of misinformation [15].StackOverflow, a popular platform for coders and programmers, banned the use of ChatGPT-generated content because the average rate of correct answers from ChatGPT is too low and could cause significant harm to the site and the users who rely on it for accurate answers.

Development of Language Models
The transformer-based language models have demonstrated a strong language modeling ability.Generally speaking, transformer-based language models are divided into 3 categories: encoder-based models (eg, BERT [2], Roberta [16], and Albert [17]), decoder-based models (eg, GPT [1] and GPT2 [18]), and encoder-decoder-based models (eg, Transformers [3], BART [19], and T5 [20]).In order to combine biomedical knowledge with language models, many researchers have added biomedical corpus for training [21][22][23][24][25]. Alsentzer et al [26] fine-tuned the publicly released BERT model on the Medical Information Mart for Intensive Care (MIMIC) data set [27] and demonstrated good performance on natural language inference and named entity recognition tasks.Lee et al [28] fine-tuned BERT on the PubMed data set, and it performed well on biomedical named entity recognition, biomedical relation extraction, and biomedical question-answering tasks.Based on the backbone of GPT2 [18], Luo et al [29] continued pretraining on the biomedical data set and showed superior performance on 6 biomedical NLP tasks.Other innovative applications include ClinicalRadioBERT [30] and SciEdBERT [31].
In recent years, decoder-based LLMs have demonstrated excellent performance on a variety of tasks [9,11,32,33].Compared with previous language models, LLMs contain a large number of trainable parameters; for example, GPT-3 contains 175 billion parameters.The increased model size of GPT-3 makes it more powerful than previous models, boosting its language ability to near human levels in medical applications [34].ChatGPT belongs to the GPT-3.5 series, which is fine-tuned based on RLHF.Previous research has shown that ChatGPT can achieve a passing score equivalent to that of a third-year medical student on a medical question-answering task [35].
ChatGPT has also demonstrated a strong understanding of high-stakes medical domains, including specialties such as radiation oncology [33].Medical information typically requires rigorous validation.Indeed, false medical-related information generated by ChatGPT can easily lead to misjudgment of the developmental trend of diseases, delay the treatment process, or negatively affect the life and health of patients [36].
However, ChatGPT lacks the knowledge and expertise necessary to accurately and adequately convey complex scientific concepts and information.For example, human medical writers cannot yet be fully replaced because ChatGPT does not have the same level of understanding and expertise in the medical field [37].To prevent the misuse use of ChatGPT to generate medical texts and avoid the potential risks of using ChatGPT, this study focuses on the detection of ChatGPT-generated text for the medical domain.We collected both publicly available expert-generated medical content and ChatGPT-generated content through the OpenAI interface.This study seeks to answer 2 questions: (1) What is the difference between medical content written by humans and that generated by ChatGPT? (2) Can we use machine learning methods to detect whether medical content is written by human experts or ChatGPT?
In this work, we make the following contributions to academia and industry: • We construct 2 data sets to analyze the difference between ChatGPT-generated and human-generated medical text.We have published these 2 data sets to facilitate further analysis and research on ChatGPT for researchers.

•
In this paper, we conducted a language analysis of medical content written by humans and that generated by ChatGPT.From the analysis results, we can grasp the difference between ChatGPT and humans in constructing medical content.

•
We built a variety of machine learning models to detect text samples generated by humans and ChatGPT and explained and visualized the model structures.
In summary, this study is among the first efforts to qualitatively and quantitatively analyze and categorize differences between medical text generated by human experts and artificial intelligence-generated content (AIGC).We believe this work can spur further research in this direction and provide pathways toward responsible AIGC in medicine.

Data Set Construction
To analyze and discriminate human-and ChatGPT-generated medical texts, we constructed the following 2 data sets: • Medical abstract data set: This original data set came from the work of Schopf et al [38] and involves digestive system diseases, cardiovascular diseases, neoplasms, nervous system diseases, and general pathological conditions.

•
Radiology report data set: This original data set came from the work of Johnson et al [27], and only a subset of radiology reports were selected to build our radiology report data set.
Both the medical abstract and radiology report data sets are in English.We sampled 2200 text samples from the medical abstract and radiology report data sets as medical texts written by humans.In order to guide ChatGPT to generate medical content, we adopted the method of text continuation with demonstration instead of rephrasing [14] or query [39] with in-context learning because text continuation can produce more human-like text.The prompts used to generate medical abstract and radiology report data sets are shown in Figure 1.We used 2 different prompts to generate ChatGPT texts.In order to avoid the influence of ChatGPT randomness, we generated 2 groups of texts for each prompt.We randomly selected a sample (excluding the sample itself) from the data set as a demonstration.Finally, we obtained medical abstract and radiology report data sets containing 11,000 samples.According to the 2 different prompts and 2 different random groupings, these 11,000 samples can form 4 groups of data, each containing the same 2200 samples written by humans and 8800 samples generated by ChatGPT with one of the prompts and one of the random groups.

Linguistic Analysis
We performed linguistic analysis of the medical content generated by humans and ChatGPT, including vocabulary and sentence feature analysis, part-of-speech (POS) analysis, dependency parsing, sentiment analysis, and text perplexity.
Vocabulary and sentence feature analysis illuminates the differences in the statistical characteristics of the words and sentences constructed by humans and ChatGPT when generating medical texts.We used the Natural Language Toolkit [40] to perform POS analysis.Dependency parsing is a technique that analyzes the grammatical structure of a sentence by identifying the dependencies between the words of the sentence.We applied CoreNLP (Stanford NLP Group) [41] for dependency parsing and compared the proportions of different dependency relationships and their corresponding dependency distances.We applied a pretrained sentiment analysis model [42] to conduct sentiment analysis for both the medical abstract and radiology report data sets.Perplexity is often used as a metric to evaluate the performance of a language model, with lower perplexity indicating that the language model is more confident in its predictions.We used the BioGPT [29] model to compute the perplexity of the human-written and ChatGPT-generated medical text.

Detecting ChatGPT-Generated Text
Text content generated by the LLM has become popular on the internet.Since most of the content generated by LLMs is text with a fixed language pattern and language style, when a large number of generated text content appears, it will not be conducive to human active creation and can cause panic if incorrect medical text is generated.We used a variety of methods to detect medical texts generated by ChatGPT to reduce the potential risks to society caused by improper or malicious use of language models.First, we divided the medical abstract and radiology report data sets into a training set, test set, and validation set at a ratio of 7:2:1, respectively.Then, we used a variety of algorithms to train the model with the training set, selected the best model parameters through the validation set, and finally calculated the metrics using the test set.The following models were used: • Perplexity-classification (Perplexity-CLS): As text written by humans usually has higher text perplexity than that generated by ChatGPT, an intuitive idea was to find an optimal perplexity threshold to detect medical text generated by ChatGPT.This idea is the same as GPTZero [43], but our data is medical-related text, so we used BioGPT [29] as a language model to calculate text perplexity.We found the optimal perplexity threshold of the validation set and calculated the metrics on the test set.
• Classification and Regression Trees (CART): CART is a classic decision tree algorithm that tree uses the Gini index as the measure of feature division.We vectorized the samples through term frequency-inverse document frequency, and for convenience of visualization, we set the maximum depth of the tree to 4.
• XGBoost [44]: XGBoost is an ensemble learning method, and we set the maximum depth for base learners as 4 and vectorize the samples by term frequency-inverse document frequency.
In addition, we analyzed the CART, XGBoost, and BERT models to explore which features of the text help to detect text generated by ChatGPT.

Ethical Considerations and Data Usage
In this study, we evaluated the proposed method on two medical datasets: medical abstracts describing patients' conditions and radiology reports from the MIMIC-III dataset.Both datasets are extracted from publicly available sources.According to Luo et al [29], the free texts (including radiology reports) in the MIMIC-III dataset have been deidentified in accordance with Health Insurance Portability and Accountability Act (HIPAA) standards, using an existing, rigorously evaluated system [46].Using publicly available and fully deidentified data for research purposes aligns with the waiver of human subjects protection issued by the Department of Health and Human Services (45 CFR 46.104) [47], which states that studies utilizing publicly available, anonymized data may not require formal ethics approval.The Institutional Review Board of Mass General Brigham negates the necessity for review for research exempted under 45 CFR 46.104 [48].The datasets collected were strictly used for research purposes limited within this work, focusing on method development and validation without compromising individual privacy.In conclusion, this research adheres to the ethical guidelines and policies set forth by the Institutional Review Board of Mass General Brigham, ensuring that all data usage is responsible, respectful of privacy, and within the bounds of academic research.

Linguistic Analysis
We conducted linguistic analysis of 2200 human-written samples and 8800 ChatGPT-generated samples from the medical abstract and radiology report data sets.

Vocabulary and Sentence Analysis
As shown in Table 1, from the perspective of statistical characteristics, the main differences between human-written medical text and medical text generated by ChatGPT involved the vocabulary and stem.Human-written medical text vocabulary size and the number of stems were significantly larger than those of ChatGPT-generated medical text.This suggests that the content and expression of medical texts written by humans are more diverse, which is more in line with the actual patient situation, while texts generated by ChatGPT are more inclined to use commonly used words to express common situations.

Part-of-Speech Analysis
The results of POS analysis are shown in Table 2. ChatGPT used more words from the following categories: noun, singular or mass; determiner; noun, plural; and coordinating conjunction.ChatGPT used fewer cardinal digits and adverbs.
Frequent use of nouns (singular or mass and plural) tends to indicate that the text is more argumentative, showing information and objectivity [49].The high proportion of coordinating conjunctions and determiners in ChatGPT-generated text indicated that the structure of the medical text and the relationship between causality, progression, or contrast was clear.At the same time, a large number of cardinal digits and adverbs appeared in medical texts written by humans, indicating that the expressions were more specific rather than general.For example, doctors will use specific numbers to describe the size of tumors.

Dependency Parsing
The results of dependency parsing are shown in Table 3 and Table 4.As shown in Table 3, the comparison of dependencies exhibited similar characteristics to the POS analysis, where ChatGPT used more determiner, conjunct, coordination, and direct object relations while using fewer numeric modifiers and adverbial modifiers.For dependency distance, ChatGPT had obviously shorter conjuncts, coordinations, and nominal subjects, which made the text generated by ChatGPT more logical and fluent.

Sentiment Analysis
The results of sentiment analysis are shown in Table 5.Most of the medical texts written by humans or those generated by ChatGPT had neutral sentiments.It should be noted that the proportion of negative sentiments in text written by humans was significantly higher than that in text generated by ChatGPT, while the proportion of positive sentiments in text written by humans was significantly lower than that in text generated by ChatGPT.This may be because ChatGPT has added a special mechanism to carefully filter the original training data set to ensure any violent or sexual content is removed, making the generated text more neutral or positive.

Text Perplexity
The results of text perplexity are shown in Figure 2. It can be observed that for both medical abstract and radiation report data sets, the perplexity of text generated by ChatGPT was significantly lower than that of text written by humans.ChatGPT captures common patterns and structures in the training corpus and is very good at replicating them.Therefore, the text generated by ChatGPT has relatively low perplexity.Humans can express themselves in a variety of ways, depending on the intellectual context, the condition of the patient, and other factors, which may make BioGPT more difficult to predict.Therefore, human-written text had a higher perplexity and wider distribution.
Through the above analysis, we identified the main differences between the human-written and ChatGPT-generated medical text as the following: (1) medical texts written by humans were more diverse, while medical texts generated by ChatGPT were more common; (2) medical texts generated by ChatGPT had better logic and fluency; (3) medical texts written by humans contained more specific values, and text content was more specific; (4) medical texts generated by ChatGPT were more neutral and positive; and (5) ChatGPT had lower text perplexity because it is good at replicating common expression patterns and sentence structures.

Detecting ChatGPT-Generated Text
The results of detecting ChatGPT-generated medical text are shown in Table 6.The results shown in Table 6 are the average of the accuracy across the 4 groups.Compared with similar works [14,39] for detecting ChatGPT-generated content, our detection performance showed much higher accuracy.Since Perplexity-CLS is an unsupervised learning method, it was less effective than other methods.XGBoost integrates the results of multiple decision trees, so it worked better than CART with a single decision tree.The pretrained BERT model easily recognized differences in the logical structure and language style of medical texts written by humans and those generated by ChatGPT, thus achieving the best performance.7 and 8. Comparing Figure 3 and Table 7, we can see that the decision XSL • FO RenderX tree nodes are similar.For example, in the medical abstract data set, "further," "outcomes," "highlights," and "aimed" are important features of the CART and XGBoost models.
In addition to visualizing the global features of CART and XGBoost, we also used the transformers-interpret toolkit [50] to visualize the local features of the samples, and the results are shown in Figure 4.For BERT, conjuncts were important features for detecting ChatGPT-generated text (eg, "due to," "therefore," and "or").In addition, the important features of BERT were similar to those of XGboost.For example, "evidence," "findings," and "acute" were important features in the radiology report data set for detecting medical text generated by ChatGPT.

Principal Results
In this paper, we focused on analyzing the differences between medical texts written by humans and those generated by ChatGPT and designed machine learning algorithms to detect medical texts generated by ChatGPT.The results showed that medical texts generated by ChatGPT were more fluent and logical but had low information content.In contrast, medical texts written by humans were more diverse and specific.Such differences led to the potential discriminability between these two.
ChatGPT simply imitates human language and uses general information content, which makes it challenging to generate text on personalized treatment and conditions with high intersubject heterogeneity.Such an issue may potentially lead to decreased patient care quality throughout the whole clinical workflow.For the purpose of medical education, AIGC has led to much awareness and concerns over its possible misuse.Students and trainees could use ChatGPT for assignments and exams.In addition, using such tools can hinder the students' learning process, especially at the current stage, where curriculum design has not been updated accordingly [51].Finally, as more patients rely on internet searches to seek medical advice, it is important to mark the AIGC, especially that related to medicine, with "Generated by AIGC" labels.By doing so, we can further deal with potential issues in ChatGPT-generated text caused by system-wide errors and algorithm biases, such as the "hallucination effect" of generative modeling and outdated information sources.
In order to mitigate and control the potential harm caused by medical AIGC, we developed algorithms to identify content generated by ChatGPT.Although ChatGPT can generate human-like text, due to the differences in language style and content, the text written by ChatGPT can still be accurately detected by designing machine learning algorithms, and the F 1 score exceeded 95%.This study provides a pathway toward trustworthy and accountable use of LLMs in medicine.

Limitations
This paper is dedicated to analyzing the differences between medical texts written by humans and those generated by ChatGPT.We developed various machine-learning algorithms to distinguish the two.However, our work has some limitations.First, this paper only analyzes medical abstracts and radiology reports; however, there exist various other types of medical texts, and these 2 types of medical texts are just examples.Second, ChatGPT is a model that can handle multiple languages, but the data sets we used were only in English.Additionally, we only used ChatGPT as an example to analyze the difference between medical texts generated by an LLM and medical texts written by humans; however, more advanced LLMs, such as GPT-4 and other open-source models, have emerged.It will be part of our future work to analyze more language styles generated by other LLMs and summarize their general language construction rules.

Conclusions
In general, for artificial intelligence (AI) to realize its full potential in medicine, we should not rush into its implementation but advocate for its careful introduction and open debate about its risks and benefits.First, human medical writers will be responsible for ensuring the accuracy and completeness of the information communicated and for complying with ethical and regulatory guidelines.However, ChatGPT cannot be held responsible.Second, training an LLM requires a huge amount of data, but the quality of the data is difficult to guarantee, so the trained ChatGPT is biased.For example, ChatGPT can provide biased output and perpetuate sexist stereotypes [52].Third, use of ChatGPT may lead to private information leakage.This may be because the LLM remembers personal privacy information in the training set [53].What is more, the legal framework must be considered.Who shall be held accountable when an AI doctor makes an inevitable mistake?ChatGPT cannot be held accountable for its work, and there is no legal framework to determine who owns the rights to AI-generated work [15].

RenderX
The medical field is a field related to human health and life.We provided a simple demonstration to identify ChatGPT-generated medical content, which can help reduce the harm caused to humans by erroneous and incomplete ChatGPT-generated information.Assessing and mitigating the risks associated with LLMs and their potential harm is a complex and interdisciplinary challenge that requires combining knowledge from various fields to drive the healthy development of LLMs.

Figure 1 .
Figure 1.Prompts for building the ChatGPT-generated medical abstract and radiology report data sets.

Figure 2 .
Figure 2. Text perplexity of human-written and ChatGPT-generated (A) medical abstracts and (B) radiology reports.
c BERT: bidirectional encoder representations from transformers.

Figure 3
Figure 3 presents the visualization of the CART model of the 2 data sets.Through the decision tree with depth 4, the text generated by ChatGPT was detected well.We calculated the contribution of each feature of the XGBoost model, and the top 15 most important features are shown in Tables7 and 8. Comparing Figure3and Table7, we can see that the decision

Figure 3 .
Figure 3. Visualization of the CART model for the (A) medical abstracts and (B) radiology reports data sets.CART: classification and regression trees.

Figure 4 .
Figure 4. Visualization of the features of the samples for the (A) medical abstracts and (B) radiology reports data sets using BERT.BERT: bidirectional encoder representations from transformers.

Table 1 .
Vocabulary and sentence analysis of human-and ChatGPT-generated text in the medical abstract and radiology report data sets.Total number of unique word stems across all samples. b

Table 4 .
Top 20dependency distances comparison between human-written and ChatGPT-generated text in the medical abstract and radiology report data sets.
a Not in the top 20 dependency distances.

Table 5 .
Sentiment comparison between human-written and ChatGPT-generated text in the medical abstract and radiology report data sets

Table 6 .
Results of detecting ChatGPT-generated medical text in the medical abstract and radiology data sets.

-CLS a , mean (SD)
b CART: classification and regression trees.

Table 7 .
Important features of the medical abstract data set.

Table 8 .
Important features of the radiology reports data set.