GPTZero, a leading AI detection startup founded by Edward Tian made some wakeup call findings on these citation inaccuracies. Their study was centered on submissions to the highly selective Conference on Neural Information Processing Systems (NeurIPS) held in San Diego last month. Having scanned all 4,841 accepted papers, GPTZero found a total of 100 hallucinated citations across 51 of these 4,841 documents.
The ensuing investigation confirmed that these citations were fraudulent. These results raise troubling questions about largelanguage model (LLM) trustworthiness in the academic publishing context. The total number of citations—tens of thousands—dilutes the impact of the 100 errors found, rendering them statistically irrelevant at best. The meanings behind these findings go beyond the data.
As AI researchers publish their findings at top-tier venues such as NeurIPS, we fail to recognize that this is a lot at stake. If only 1% of papers cite bad references that is a huge problem. This uncertainty undermines the public trust in the integrity of all scholarly publishing in machine learning and artificial intelligence. GPTZero also recently flagged a paper from May 2025 entitled “The AI Conference Peer Review Crisis.” This paper seeks to redress those same harms in one of the top conferences, NeurIPS.
Despite the findings, NeurIPS representatives maintain that “even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves is not necessarily invalidated.” This view is more cognizant of the need to get citations right. Yet, it illustrates that citation accuracy is not the only measure of the value of the research shared.
As many as 100 hallucinated citations have found their way into 51 different papers. This presents a considerable issue for practitioners who want to make sure LLMs are used responsibly. The glowing reputations of today’s leading AI professionals rest on their capacity to do so. If even experts find it difficult to maintain accurate citations, it does not bode well for other users who are more likely to encounter such situations.
TechCrunch published the findings and then fanned the flames of a public outcry. The one-hour discussion underscored how academic integrity is being challenged in the era of AI. The increasing dependence on large language models (LLMs) underscores the need for rigorous scholarly publishing standards. These standards are crucial for public confidence in the scientific enterprise.






Leave a Reply