You are viewing the site in preview mode

Skip to main content

A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm

Abstract

Background

Despite significant advances in AI-driven medical diagnostics, the integration of large language models (LLMs) into psychiatric practice presents unique challenges. While LLMs demonstrate high accuracy in controlled settings, their performance in collaborative clinical environments remains unclear. This study examined whether LLMs exhibit conformity behavior under social pressure across different diagnostic certainty levels, with a particular focus on psychiatric assessment.

Methods

Using an adapted Asch paradigm, we conducted a controlled trial examining GPT-4o’s performance across three domains representing increasing levels of diagnostic uncertainty: circle similarity judgments (high certainty), brain tumor identification (intermediate certainty), and psychiatric assessment using children’s drawings (high uncertainty). The study employed a 3 × 3 factorial design with three pressure conditions: no pressure, full pressure (five consecutive incorrect peer responses), and partial pressure (mixed correct and incorrect peer responses). We conducted 10 trials per condition combination (90 total observations), using standardized prompts and multiple-choice responses. The binomial test and chi-square analyses assessed performance differences across conditions.

Results

Under no pressure, GPT-4o achieved 100% accuracy across all domains. Under full pressure, accuracy declined systematically with increasing diagnostic uncertainty: 50% in circle recognition, 40% in tumor identification, and 0% in psychiatric assessment. Partial pressure showed a similar pattern, with maintained accuracy in basic tasks (80% in circle recognition, 100% in tumor identification) but complete failure in psychiatric assessment (0%). All differences between no pressure and pressure conditions were statistically significant (P <.05), with the most severe effects observed in psychiatric assessment (χ²₁=16.20, P <.001).

Conclusions

This study reveals that LLMs exhibit conformity patterns that intensify with diagnostic uncertainty, culminating in complete performance failure in psychiatric assessment under social pressure. These findings suggest that successful implementation of AI in psychiatry requires careful consideration of social dynamics and the inherent uncertainty in psychiatric diagnosis. Future research should validate these findings across different AI systems and diagnostic tools while developing strategies to maintain AI independence in clinical settings.

Trial registration

Not applicable.

Peer Review reports

Introduction

Despite significant advances in medical diagnostics, diagnostic errors remain a persistent challenge in healthcare, particularly in psychiatry where the inherent complexity and subjective nature of assessment creates unique vulnerabilities [1,2,3,4]. Studies consistently show that psychiatric diagnoses face substantial reliability challenges, with considerable variability among clinicians [4,5,6]. This variability among experts poses a challenge for AI systems trained on datasets that may reflect diagnostic inconsistencies and contradictions. This diagnostic uncertainty not only affects human clinicians but also poses unique challenges for AI systems, which must navigate this inherent ambiguity while being trained on datasets that reflect these underlying uncertainties.

While various interventions have been developed to improve diagnostic accuracy, including structured assessment tools and clinical decision support systems [78], the integration of artificial intelligence (AI) into psychiatric practice presents both unprecedented opportunities and distinct challenges [9]. The promise of AI to enhance psychiatric diagnosis is particularly compelling given the field’s ongoing struggle with diagnostic reliability and the recognized impact of cognitive biases on clinical judgment [10, 11].

Large language models (LLMs) have demonstrated promising clinical application capabilities, including diagnostic reasoning, patient interviewing, and therapeutic dialogue [9,10,11,12]. However, integrating these models also presents significant challenges, including algorithmic biases and ethical considerations that require careful examination. Psychiatric diagnosis presents unique challenges for AI systems compared to other medical domains. While conditions like tumors or fractures have clear physical manifestations, psychiatric disorders often present with overlapping symptoms, complex contextual factors, and significant individual variations [13,14,15,16,17]. This inherent complexity has been reflected in recent clinical trials: despite impressive performance in controlled settings [18], LLM integration into diagnostic practice has shown unexpected limitations [19]. This discrepancy between standalone and collaborative performance suggests that the success of AI in diagnostic practice may be particularly sensitive to implementation contexts and social dynamics.

Diagnostic certainty in medical decision-making lies on a continuum ranging from tasks with clear, objective criteria to those requiring complex clinical judgment [20, 21]. At one end, visual perception tasks, like identifying basic shapes or patterns, demonstrate high inter-rater reliability and clear ground truth. Moving along this continuum, radiological diagnoses, while complex, still rely on physical manifestations that can be systematically categorized and validated through objective measures [e.g., 22, 23]. Psychiatric diagnosis, however, sits at the far end of this spectrum, characterized by inherent uncertainty and diagnostic complexity which stems from multiple factors: symptom overlap between different disorders, the dynamic nature of mental states, and the crucial role of contextual factors in interpretation [24,25,26]. Studies have consistently shown that even experienced psychiatrists frequently disagree on diagnoses, with inter-rater reliability varying significantly across different psychiatric conditions [27,28,29]. This diagnostic uncertainty not only affects human clinicians but also poses unique challenges for AI systems, which must navigate this inherent ambiguity while being trained on datasets that reflect these underlying uncertainties [30, 31].

One potentially crucial factor in these interactions is social conformity, a phenomenon first studied systematically by Solomon Asch [32]. In medical contexts, it has been well documented that conformity effects and hierarchical influences impact diagnostic accuracy and patient safety [33,34,35,36,37,38]. The current study examines whether LLMs exhibit varying degrees of conformity behavior across the diagnostic certainty spectrum, with particular focus on psychiatric assessment. Using Asch’s experimental paradigm, we investigate AI decision-making under social pressure in three distinct domains: basic visual perception, radiological assessment, and psychiatric assessment. We propose two hypotheses. First, we hypothesize that LLMs will demonstrate conformity behavior by modifying their responses under social pressure, even when this contradicts their initial accurate assessment. Second, we hypothesize that this conformity effect will systematically increase as diagnostic certainty decreases, with psychiatric assessments showing the highest vulnerability to social influence. These hypotheses build on established findings that human decision-makers are more susceptible to social influence under conditions of uncertainty [39, 40] and that AI systems may inherit and potentially amplify these behavioral patterns [41,42,43].

Understanding these patterns is crucial for developing implementation guidelines for AI-driven diagnostic tools in psychiatric practice, particularly given the need to balance the potential benefits of AI assistance against the risks of automated decision-making in complex clinical scenarios where independent clinical judgment is essential.

Methods

Study model

This study utilized GPT-4o (by OpenAI), which was selected from available LLMs at the time of data collection (September 2024) as it demonstrated superior visual processing capabilities. The model’s advanced visual perception abilities were essential for this study which required interpretation of geometric shapes, medical images, and clinical drawings. All interactions were conducted through the standard chat interface, maintaining default settings without modifications to parameters such as temperature or top-k. The model’s responses were limited to multiple-choice selections to ensure trial standardization.

Study design

We adapted Asch’s (1951) conformity experiment paradigm [32] to examine peer pressure effects in AI. The study employed a 3 × 3 factorial design with three task domains representing increasing levels of diagnostic uncertainty (circle similarity judgments– high certainty; brain tumor identification– intermediate certainty; and psychiatric assessment– high uncertainty) and three peer pressure conditions (no pressure, full pressure, and partial pressure).

Tasks progressed from basic visual perception to complex clinical assessments. Each task presented three response options with predetermined correct answers based on expert consensus. Peer pressure conditions varied systematically: baseline (no pressure) presented direct multiple-choice questions; full pressure included five consecutive incorrect peer responses; and partial pressure combined correct and incorrect peer responses in randomized order.

We conducted each trial in an independent chat session to prevent carryover effects. The study included 10 trials per task–condition combination (90 total observations). All trials used standardized prompts, with peer responses presented sequentially in pressure conditions.

Ethics

This study was approved by the institutional review board of Max Stern Yezreel Valley College (approval number YVC EMEK 2023-77). The study protocol complied with all relevant institutional guidelines. As the study analyzed only LLM outputs without human participants, informed consent was not required.

Materials

The study protocol included three assessment tasks and standardized prompting sequences. All responses were restricted to single letter selections (A, B, or C) to ensure standardization across conditions and tasks.

Assessment tasks

Three tasks assessed different domains representing increasing levels of diagnostic uncertainty:

  1. (1)

    Basic visual perception: A circle discrimination task requiring identification of the circle most similar to a target circle from three options (high certainty, baseline).

  2. (2)

    Medical image analysis: A diagnostic task involving brain tumor identification from medical imaging with three options: meningeal tumor, renal cell carcinoma, or no pathology (intermediate certainty). Images were sourced from validated medical imaging datasets [44].

  3. (3)

    Psychiatric assessment: Analysis of children’s house drawings from the house-tree-person (HTP) projective test for emotional state indicators, based on established assessment protocols [45], with options including depression, ADHD, or typical development (high uncertainty).

Prompting sequences

Each task used standardized prompting sequences to maintain consistency across trials while systematically varying social pressure conditions. Table 1 presents the complete prompting sequences used for each task domain and condition. These prompts were designed to precisely control the presentation of peer responses while maintaining ecological validity through domain-appropriate professional contexts (participants for basic visual tasks, doctors for medical imaging, and psychiatrists for psychiatric assessment).

The implementation followed strict procedural guidelines for each trial, as detailed in Table 2. All visual stimuli were presented before the prompting sequence, and each trial was conducted in an independent chat session to prevent carryover effects. In pressure conditions, responses were presented sequentially with mandatory confirmation between each response to ensure proper social information processing.

Data analysis

We used the binomial test to analyze the effect of peer pressure on GPT-4o’s decision-making across the diagnostic certainty spectrum. This helped to verify that the outcomes under the three pressure conditions were not random, thus confirming that GPT-4o’s performance without social influence was reliable. A chi-square test was performed to assess the significance of differences in accuracy between the no pressure, full pressure, and partial pressure conditions, with particular attention to how these effects varied across the diagnostic certainty spectrum. A p-value of less than 0.05 was considered statistically significant. The statistical analyses were performed using SPSS Statistics version 28.

Table 1 Complete prompting sequences
Table 2 Implementation procedures

Results

High certainty domain: circle recognition

Under no pressure, GPT-4o achieved 100% accuracy (10/10 correct responses). Performance significantly exceeded chance (binomial test, P <.001; expected chance performance: 3.33 ± 1.49 correct responses). Under full peer pressure, accuracy decreased to 50% (5/10 correct responses; P =.136 vs. chance). With partial peer pressure, accuracy was 80% (8/10 correct responses; P =.030 vs. chance). Accuracy differed significantly between no pressure and full pressure conditions (χ²₁=4.27, P =.039) but not between no pressure and partial pressure conditions (χ²₁=0.56, P =.46).

Intermediate certainty domain: brain tumor identification

Under no pressure, GPT-4o achieved 100% accuracy (10/10 correct responses; P =.001 vs. chance). Accuracy decreased to 40% under full pressure (4/10 correct responses; P =.227 vs. chance) but remained at 100% under partial pressure (10/10 correct responses; P =.001 vs. chance). Performance differed significantly between no pressure and full pressure conditions (χ²₁=5.95, P =.015). No statistical comparison was necessary between no pressure and partial pressure conditions due to identical performance.

High uncertainty domain: psychiatric assessment

Under no pressure, GPT-4o demonstrated 100% accuracy (10/10 correct responses; P =.001 vs. chance). Under both full and partial pressure conditions, accuracy dropped to 0% (0/10 correct responses; P =.017 vs. chance for both conditions). Performance differences were highly significant between no pressure and both full pressure conditions (χ²₁=16.20, P <.001) and partial pressure conditions (χ²₁=16.20, P <.001).

Summary of key findings

GPT-4o exhibited distinct performance patterns that varied systematically with diagnostic certainty (Fig. 1). Under no pressure, the system achieved perfect accuracy (100%) across all three domains. However, performance degradation under social pressure increased markedly with diagnostic uncertainty, with the most pronounced effects observed in psychiatric assessment.

The impact of peer pressure varied systematically across the diagnostic certainty spectrum (Fig. 2). Under full pressure, accuracy declined moderately in the high certainty task (50% in circle recognition), more substantially in the intermediate certainty task (40% in tumor identification), and completely in the high uncertainty psychiatric assessment (0%). Partial pressure showed a similar pattern: relatively high accuracy maintained in high certainty tasks (80% in circle recognition) and intermediate certainty tasks (100% in tumor identification) but complete failure in psychiatric assessment (0%).

All results under no pressure conditions were statistically non-random (P <.001), with chi-square tests confirming significant differences between no pressure and full pressure conditions across all domains (P <.05). The most severe conformity effects were observed in psychiatric assessment, where both full and partial pressure led to complete deterioration of performance (P <.001).

Fig. 1
figure 1

Overall system performance under different pressure conditions. Pie charts display the aggregate percentage of correct and incorrect responses across all three clinical domains (N = 30 per pressure condition). Performance declined from 100% accuracy under no pressure to 30% under full pressure, with partial pressure showing intermediate effects. Colors represent correct (green) and incorrect (orange) responses

Fig. 2
figure 2

Performance accuracy of GPT-4o across three clinical assessment domains under different peer pressure conditions. The graph shows the number of correct and incorrect responses (N = 10 per condition) in circle recognition, brain tumor identification, and psychiatric assessment tasks under no, full, and partial pressure conditions. The decreased accuracy under pressure conditions was most pronounced in psychiatric assessment, where full and partial pressure resulted in complete failure (0% accuracy)

Discussion

This study examined the influence of social pressure on AI diagnostic accuracy across three domains. While the system achieved 100% accuracy under neutral conditions across all domains, exposure to social pressure led to a performance decline ranging from 50% in circle recognition tasks through 40% in tumor identification to complete failure (0%) in psychiatric assessment. This pattern persisted under partial social pressure, where performance remained high in basic tasks (80–100%) but maintained complete failure in psychiatric assessment.

Analysis of these findings reveals two significant patterns: first, AI systems’ tendency toward conformity and second, the intensification of this conformity in psychiatric assessment. According to the first pattern, AI systems demonstrate substantial sensitivity to social pressure in medical decision-making, exhibiting behavioral patterns comparable to those documented in human medical teams [46,47,48,49,50,51]. However, it is important to note that the underlying mechanisms driving conformity in AI may differ from those in human teams and warrant further investigation. This expands our understanding of the challenges of integrating these systems into clinical settings. While previous research has focused on evaluating AI’s diagnostic capabilities in laboratory conditions [18], our findings emphasize the need to understand how these systems function within the social context of a medical team. This finding aligns with recent research indicating a significant gap between AI performance in laboratory conditions versus clinical work environments [19]. The phenomenon of conformity in AI systems raises significant challenges for their integration into medical teams. Previous research has shown that conformity patterns and hierarchical influences in medical teams are a central factor in medical errors [38, 46]. Adding an AI system with an inherent tendency toward conformity has the potential to amplify these patterns and weaken the team’s ability to reach independent, evidence-based decisions [38], highlighting the need for careful implementation and mitigation strategies. Particularly concerning is the system’s tendency to align with majority opinion even when it contradicts its initial accurate assessment. This pattern is reminiscent of how social pressure influences decision-making in medical teams [35, 37, 38].

The second pattern identified, namely, enhanced conformity in psychiatric assessment, reflects a structural challenge in integrating AI into mental health. While AI has shown promising diagnostic capabilities in certain psychiatric contexts and controlled settings [9, 17], the system’s complete failure under social pressure, specifically in psychiatric diagnosis, appears related to the unique nature of assessment in this field. Unlike other medical domains where diagnostic criteria often rely on biological markers or clear imaging findings, psychiatric diagnosis is characterized by higher levels of uncertainty and requires complex clinical judgment [1,2,3,4]. This uncertainty, particularly with subjective tools like projective tests, raises questions about the nature of diagnostic truth itself in such ambiguous cases, where consensus among experts might contribute to the construction of diagnostic reality [52]. This uncertainty, which challenges even human clinicians [5, 6], emerges as a significant vulnerability for AI systems.

The extreme sensitivity of AI systems to social pressure in psychiatric assessment may reflect deeper challenges in existing psychiatric diagnostic methods [7]. The significant variation in psychiatric diagnosis among clinicians [13,14,15] means that AI systems are trained on datasets containing inherent contradictions and inconsistencies. When multiple experts disagree on diagnoses for similar presentations, the very concept of “diagnostic ground truth” becomes fundamentally ambiguous [53,54,55]. This insight suggests the need to examine not only how we develop and implement AI systems in psychiatry but also the existing diagnostic frameworks in the field [55, 56].

Limitations

Several important limitations should be considered when interpreting our findings. First, our results are specific to GPT-4o and may not generalize to other AI architectures. It is also possible that the conformity patterns observed were influenced by the nature of GPT-4o’s vast training data, which may reflect human social dynamics and biases. Second, while our experimental design controlled for many variables, it may not fully capture the complexity of real-world clinical interactions, particularly the subtle social dynamics that emerge in teams. Third, our sample size, while sufficient for detecting large effects, may have limited our ability to identify more subtle conformity patterns.

Of particular note is our choice of diagnostic tools. While we selected commonly used tools– medical imaging for tumor identification for radiological assessment and house-tree-person (HTP) drawings for psychiatric assessment– these represent only a small subset of available diagnostic methods. Different assessment tools might yield different patterns of AI conformity. Therefore, the reliance on a limited scope of diagnostic tools restricts the generalizability of our findings, highlighting the need for further investigation with a broader range of methods.

Additionally, the pressure conditions were simulated through text-based interactions, which may differ substantially from the multifaceted social cues present in clinical settings. This limitation is particularly relevant given the complex nature of team dynamics in psychiatric practice. Furthermore, a key limitation is the absence of real-world clinical testing involving multidisciplinary teams, which would introduce additional social and contextual factors not fully captured in this simulated environment.

Future directions

Our findings emphasize the need for a comprehensive research agenda examining both AI conformity and its implications for psychiatric diagnosis across multiple assessment methods. Foremost, there is a need to validate our findings using a broader range of psychiatric diagnostic tools. While our study focused on HTP drawings, future research should examine AI conformity patterns across various assessment approaches, including structured clinical interviews, standardized psychological tests, and other projective techniques.

Building on this foundational validation work, research should investigate the mechanisms through which diagnostic uncertainty influences AI decision-making in psychiatric contexts. This investigation should span different types of uncertainty inherent in various psychiatric assessment methods, from interpretation of projective tests to analysis of structured clinical data.

Longitudinal studies are also needed to examine the interaction between AI systems and clinical teams under real conditions. These studies should systematically vary both the AI systems used and the diagnostic tools employed, providing a more comprehensive understanding of how different assessment methods might influence AI conformity patterns. Such research should examine not only technical performance metrics but also the impact on clinical decision-making processes and quality of care across different assessment contexts.

In addition, comparative studies across different psychiatric subspecialties, each employing its unique diagnostic tools and methods, could provide valuable insights into how domain-specific factors influence AI conformity. This broader perspective could help identify which combinations of AI systems and assessment tools are most resistant to social pressure effects and inform more robust implementation strategies.

Conclusions

This initial study demonstrates that AI systems may exhibit conformity patterns that could impact their effectiveness in clinical settings, particularly in psychiatry. While these findings raise important considerations for current implementation approaches, they should be interpreted as preliminary evidence that requires validation across different AI systems, diagnostic tools, and clinical contexts.

Our results suggest that successful implementation of AI in psychiatry will require careful attention to both the technical aspects of AI integration and the fundamental characteristics of psychiatric diagnosis. The observed pattern of increased conformity in psychiatric assessment, while noteworthy, needs to be validated using a wider range of psychiatric assessment tools and methods.

These preliminary findings invite further investigation into how AI systems interact with the inherent uncertainties of psychiatric diagnosis. The observed conformity patterns underscore the critical need for careful consideration of AI integration into clinical workflows. Beyond technical validation, successful implementation necessitates the development and adoption of structured frameworks and guidelines that explicitly address potential risks such as conformity effects. Exploring hybrid decision-making models, where AI serves as a calibrated support tool rather than an autonomous authority, and developing training protocols designed to promote independent AI judgment will be crucial steps in harnessing AI’s potential while mitigating the identified vulnerabilities and ensuring patient safety. Future research should explore whether similar patterns emerge across different diagnostic tools and clinical contexts while also examining potential strategies for maintaining AI system independence in collaborative clinical settings.

The path forward requires not only technological refinement but also careful consideration of how these tools can best serve the complex needs of psychiatric practice. This includes developing robust validation methods, understanding the limitations of current approaches, and ensuring that implementation strategies account for the unique challenges of psychiatric assessment.

Data availability

Availability of data and materials: The brain tumor images used in this study are available in the dataset published by Prakash et al. (2023) [doi:10.1038/s41598-023-41576-6]. The children’s drawings used in this study are available in the dataset published by Wang et al. (2023) [doi:10.3390/e25091350]. The circle similarity judgment stimuli and the complete dataset of LLM responses from the current study are available from the corresponding author upon reasonable request.

References

  1. Hossain SQ. Complexity analysis of approaching clinical psychiatry with predictive analytics and neural networks [Internet]. arXiv [preprint]. 2019. Available from: https://arxiv.org/abs/1905.12471v1

  2. Redmond P, Graber ML, Singh H, et al. Contributors to diagnostic error or delay in the acute care setting: A survey of clinical stakeholders. BMJ Qual Saf. 2022;31(8):591–600. https://doiorg.publicaciones.saludcastillayleon.es/10.1136/bmjqs-2021-013809.

    Article  Google Scholar 

  3. Croskerry P, Campbell SG, Petrie DA. The challenge of cognitive science for medical diagnosis. Cogn Res Princ Implic. 2023;8(13):1–14. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s41235-022-00460-z.

    Article  Google Scholar 

  4. Vally ZI, Khammissa RAG, Feller G, et al. Errors in clinical diagnosis: a narrative review. J Int Med Res. 2023;51(8):1–10. https://doiorg.publicaciones.saludcastillayleon.es/10.1177/03000605231162798.

    Article  Google Scholar 

  5. Young ME, Thomas A, Lubarsky S, et al. Mapping clinical reasoning literature across the health professions: a scoping review. BMC Med Educ. 2020;20(1):107. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12909-020-02012-9.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Daniel M, Rencic J, Durning SJ, et al. Clinical reasoning assessment methods: a scoping review and practical guidance. Acad Med. 2019;94(6):902–12. https://doiorg.publicaciones.saludcastillayleon.es/10.1097/ACM.0000000000002618.

    Article  PubMed  Google Scholar 

  7. Ranji SR. Large Language models—misdiagnosing diagnostic excellence? JAMA Netw Open. 2024;7(10):e2440901. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jamanetworkopen.2024.40901.

    Article  PubMed  Google Scholar 

  8. Templin T, Perez MW, Sylvia S, et al. Addressing 6 challenges in generative AI for digital health: a scoping review. PLOS Digit Health. 2024;3(5):e0000503. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pdig.0000503.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Avula VC, Amalakanti S. Artificial intelligence in psychiatry, present trends, and challenges: an updated review. Arch Ment Health. 2024;25(1):85–90. https://doiorg.publicaciones.saludcastillayleon.es/10.4103/amh.amh_167_23.

    Article  Google Scholar 

  10. Tikhomirov L, Semmler C, McCradden M, et al. Medical artificial intelligence for clinicians: the lost cognitive perspective. Lancet Digit Health. 2024;6(8):e589–94. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/S2589-7500(24)00095-5.

    Article  CAS  PubMed  Google Scholar 

  11. Fellowes S. Establishing the accuracy of self-diagnosis in psychiatry. Philos Psychol. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1080/09515089.2024.2327823.

    Article  Google Scholar 

  12. Močnik S, Smrke U, Mlakar I, et al. Beyond clinical observations: a scoping review of AI-detectable observable cues in borderline personality disorder. Front Psychiatry. 2024;15:1345916. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fpsyt.2024.1345916.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Hallyburton A, St John W, Woods C. Diagnostic overshadowing: an evolutionary concept analysis on the misattribution of physical symptoms to pre-existing psychological illnesses. Int J Ment Health Nurs. 2022;31(3):561–73. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/inm.12961.

    Article  Google Scholar 

  14. Andreassen OA, Hindley GF, Frei O, et al. New insights from the last decade of research in psychiatric genetics: discoveries, challenges and clinical implications. World Psychiatry. 2023;22(1):4–24. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/wps.21034.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Hofvander B, Rydén G, Czajkowski N, et al. Overlap of autism spectrum disorder and borderline personality disorder: a systematic review and meta-analysis. J Pers Disord. 2023;37(1):23–47. https://doiorg.publicaciones.saludcastillayleon.es/10.1521/pedi.2023.37.1.23.

    Article  Google Scholar 

  16. Garb HN. Race bias and gender bias in the diagnosis of psychological disorders. Clin Psychol Rev. 2021;90:102087. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.cpr.2021.102087.

    Article  PubMed  Google Scholar 

  17. Luo C, Lei Y, Zhu Q, et al. Deep learning applications for the classification of psychiatric disorders using neuroimaging data: systematic review and meta-analysis. Neurosci Biobehav Rev. 2021;127:657–69. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.neubiorev.2021.02.003.

    Article  Google Scholar 

  18. Katz U, Cohen E, Shachar E, et al. GPT versus resident physicians– a benchmark based on official board scores. NEJM AI. 2024;1(5):1–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1056/AIdbp2300192.

    Article  Google Scholar 

  19. Goh E, Gallo R, Hom J, et al. Large Language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open. 2024;7(10):e2440969. https://doiorg.publicaciones.saludcastillayleon.es/10.1001/jamanetworkopen.2024.40969.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Huda AS. The medical model in mental health: an explanation and evaluation. Oxford: Oxford Academic; 2019. pp. 120–43. Chapter eight, Reliability of diagnosis.

    Google Scholar 

  21. Pies RW. How objective are psychiatric diagnoses? (guess again). Psychiatry (Edgmont). 2007;4(10):18–22.

    PubMed  Google Scholar 

  22. Meng Z, Chen C, Zhu Y, et al. Diagnostic performance of the automated breast volume scanner: a systematic review of inter-rater reliability/agreement and meta-analysis of diagnostic accuracy for differentiating benign and malignant breast lesions. Eur Radiol. 2015;25(10):3638–47. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/s00330-0.

    Article  PubMed  Google Scholar 

  23. Vosoughi F, Menbari Oskouie I, Rahimdoost N, et al. Intrarater and inter-rater reliability of tibial plateau fracture classifications: systematic review and meta-analysis. JBJS Open Access. 2024;8(1):e2300181. https://doiorg.publicaciones.saludcastillayleon.es/10.2106/JBJS.OA.23.00181.

    Article  Google Scholar 

  24. Lane R. Expanding boundaries in psychiatry: uncertainty in the context of diagnosis-seeking and negotiation. Sociol Health Illn. 2020;42(1):69–83. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/1467-9566.13044.

    Article  PubMed  Google Scholar 

  25. Matuszak J, Piasecki M. Inter-rater reliability in psychiatric diagnosis. Psychiatric Times. 2012;29(10).

  26. Cooper JE. Diagnostic processes and the classification of mental disorders. Acta Psychiatr Scand. 1982;65(Suppl 295):16–23. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/j.1600-0447.1982.tb00889.x.

    Article  Google Scholar 

  27. First MB, Spitzer RL, Gibbon M, et al. The structured clinical interview for DSM-III-R personality disorders (SCID-II). Part II: multi-site test-retest reliability study. J Pers Disord. 1995;9(2):92–104. https://doiorg.publicaciones.saludcastillayleon.es/10.1521/pedi.1995.9.2.92.

    Article  Google Scholar 

  28. Maffei C, Fossati A, Agostoni I, et al. Interrater reliability and internal consistency of the structured clinical interview for DSM-IV Axis II personality disorders (SCID-II), version 2.0. J Pers Disord. 1997;11(3):279–84. https://doiorg.publicaciones.saludcastillayleon.es/10.1521/pedi.1997.11.3.279.

    Article  CAS  PubMed  Google Scholar 

  29. Lobbestael J, Leurgans M, Arntz A. Inter-rater reliability of the structured clinical interview for DSM-IV Axis I disorders (SCID I) and Axis II disorders (SCID II). Clin Psychol Psychother. 2011;18(1):75–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/cpp.693.

    Article  PubMed  Google Scholar 

  30. Eskandar K. Artificial intelligence in psychiatric diagnosis: challenges and opportunities in the era of machine learning. Debates Em Psiquiatria. 2024;14:1–16. https://doiorg.publicaciones.saludcastillayleon.es/10.25118/2763-9037.2024.v14.1318.

    Article  Google Scholar 

  31. Barzilay R, Israel N, Krivoy A, et al. Predicting affect classification in mental status examination using machine learning face action recognition system: a pilot study in schizophrenia patients. Front Psychiatry. 2019;10:288. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fpsyt.2019.00288.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Asch SE. Effects of group pressure upon the modification and distortion of judgments. In: Allen RW, Porter LW, Angle HL, editors. Organizational influence processes. New York: Routledge; 2016. pp. 295–303.

    Google Scholar 

  33. Stevens EL, Hulme A, Salmon PM. The impact of power on health care team performance and patient safety: a review of the literature. Ergonomics. 2021;64(8):1072–90. https://doiorg.publicaciones.saludcastillayleon.es/10.1080/00140139.2021.1906454.

    Article  PubMed  Google Scholar 

  34. Essex R, Kennedy J, Miller D, et al. A scoping review exploring the impact and negotiation of hierarchy in healthcare organisations. Nurs Inq. 2023;30(4):e12571. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/nin.12571.

    Article  PubMed  Google Scholar 

  35. Sydor DT, Bould MD, Naik VN, et al. Challenging authority during a life-threatening crisis: the effect of operating theatre hierarchy. Br J Anaesth. 2013;110(3):463–71. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/bja/aes396.

    Article  CAS  PubMed  Google Scholar 

  36. McMaster E, Phillips C, Broughton N. Righting the wrongs of traditional medical hierarchy. Anaesthesia. 2015;70(11):1119–29. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/anae.13352.

    Article  Google Scholar 

  37. Beament T, Mercer SJ. Speak up! Barriers to challenging erroneous decisions of seniors in anaesthesia. Anaesthesia. 2016;71(11):1332–40. https://doiorg.publicaciones.saludcastillayleon.es/10.1111/anae.1354.

    Article  CAS  PubMed  Google Scholar 

  38. Calhoun AW, Boone MC, Porter MB, et al. Using simulation to address hierarchy-related errors in medical practice. Perm J. 2014;18(2):14–20. https://doiorg.publicaciones.saludcastillayleon.es/10.7812/TPP/13-124.

    Article  PubMed  PubMed Central  Google Scholar 

  39. Sheer J, Patel D, Johnson E, et al. Drivers and influence of social conformity on decision making in human-AI teams. SSRN. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.2139/ssrn.4966041.

    Article  Google Scholar 

  40. Orloff MA, Chung D, Gu X, et al. Social conformity is a heuristic when individual risky decision-making is disrupted. PLOS Comput Biol. 2024;20(12):e1012602. https://doiorg.publicaciones.saludcastillayleon.es/10.1371/journal.pcbi.1012602.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Hadar-Shoval D, Asraf K, Shinan-Altman S, et al. Embedded values-like shape ethical reasoning of large Language models on primary care ethical dilemmas. Heliyon. 2024;10:e38056. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.heliyon.2024.e38056.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Hadar-Shoval D, Asraf K, Mizrachi Y, et al. Assessing the alignment of large Language models with human values for mental health integration: cross-sectional study using Schwartz’s theory of basic values. JMIR Ment Health. 2024;11:e55988. https://doiorg.publicaciones.saludcastillayleon.es/10.2196/55988.

    Article  PubMed  PubMed Central  Google Scholar 

  43. Hadar Souval D, Haber Y, Tal A, et al. Transforming perceptions: exploring the multifaceted potential of generative AI for people with cognitive disabilities. JMIR Neurotech. 2025;4:e64182. https://doiorg.publicaciones.saludcastillayleon.es/10.2196/64182.

    Article  Google Scholar 

  44. Prakash BV, Kannan AR, Santhiyakumari N, et al. Meningioma brain tumor detection and classification using hybrid CNN method and RIDGELET transform. Sci Rep. 2023;13:14522. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41598-023-41576-6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Wang H, Zhang J, Huang Y, et al. FBANet: transfer learning for depression recognition using a feature-enhanced bi-level attention network. Entropy (Basel). 2023;25(9):1350. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/e25091350.

    Article  PubMed  Google Scholar 

  46. Schmutz J, Manser T. Do team processes really have an effect on clinical performance? A systematic literature review. Br J Anaesth. 2013;110(4):529–44. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/bja/aes513.

    Article  CAS  PubMed  Google Scholar 

  47. Künzle B, Zala-Mezö E, Kolbe M, et al. Substitutes for leadership in anaesthesia teams and their impact on leadership effectiveness. Eur J Work Organ Psychol. 2010;19(5):505–31. https://doiorg.publicaciones.saludcastillayleon.es/10.1080/13594320902986170.

    Article  Google Scholar 

  48. Pasarakonda S, Grote G, Schmutz JB, et al. A strategic core role perspective on team coordination: benefits of centralized leadership for managing task complexity in the operating room. Hum Factors. 2021;63(5):506–31. https://doiorg.publicaciones.saludcastillayleon.es/10.1177/0018720820906041.

    Article  Google Scholar 

  49. Ackerhans S, Huynh T, Kaiser C, et al. Exploring the role of professional identity in the implementation of clinical decision support systems—a narrative review. Implement Sci. 2024;19(1):11. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13012-024-01339-x.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Samhammer D, Roller R, Hummel P, et al. Nothing works without the Doctor: physicians’ perception of clinical decision-making and artificial intelligence. Front Med. 2022;9:1016366. https://doiorg.publicaciones.saludcastillayleon.es/10.3389/fmed.2022.1016366.

    Article  Google Scholar 

  51. Liberati EG, Ruggiero F, Galuppo L, et al. What hinders the uptake of computerized decision support systems in hospitals? A qualitative study and framework for implementation. Implement Sci. 2017;12:113. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13012-017-0644-2.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Ophir Y. ADHD is not an illness and ritalin is not a cure: A comprehensive rebuttal of the (alleged) scientific consensus. World Scientific; 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1142/12752.

  53. Yan WJ, Ruan QN, Jiang K. Challenges for artificial intelligence in recognizing mental disorders. Diagnostics (Basel). 2023;13(1):2. https://doiorg.publicaciones.saludcastillayleon.es/10.3390/diagnostics13010002.

    Article  Google Scholar 

  54. Urkin B, Parnas J, Raballo A, et al. Schizophrenia spectrum disorders: an empirical benchmark study of real-world diagnostic accuracy and reliability among leading international psychiatrists. Schizophr Bull Open. 2024;5(1):sgae012. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/schizbullopen/sgae012.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Rokham H, Pearlson G, Abrol A, et al. Addressing inaccurate nosology in mental health: A multi label data cleansing approach for detecting label noise from structural magnetic resonance imaging data in mood and psychosis disorders. Biol Psychiatry Cogn Neurosci Neuroimaging. 2020;5(5):456–67. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.bpsc.2020.05.008.

    Article  Google Scholar 

  56. Hyman SE. The diagnosis of mental disorders: the problem of reification. Annu Rev Clin Psychol. 2010;6:155–79. https://doiorg.publicaciones.saludcastillayleon.es/10.1146/annurev.clinpsy.3.022806.091532.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

D.H.S., Z.E., K.G., and Y.H. conceptualized and designed the study. D.H.S. wrote the main manuscript text. Z.E., Y.H., and K.G. performed critical revisions and validation of the methodology. A.Y. assisted with data collection. D.P. performed the statistical analyses. K.A. conducted the advanced statistical analyses. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Dorit Hadar Shoval.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the institutional review board of Max Stern Yezreel Valley College (approval number YVC EMEK 2023-77). The study protocol complied with all relevant institutional guidelines. As the study analyzed only LLM outputs without human participants, informed consent was not required.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shoval, D.H., Gigi, K., Haber, Y. et al. A controlled trial examining large Language model conformity in psychiatric assessment using the Asch paradigm. BMC Psychiatry 25, 478 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12888-025-06912-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12888-025-06912-2

Keywords