EHU CRG events are via MS Teams, and are free to attend. The MS Teams link is sent the day before the event. If you have problems registering, or have any questions, please contact the organiser, Costas Gabrielatos ([email protected]).

—————————————————————————————————————————————————————

MEETING #15 Friday 7 March 2025, 3:15-4:30 pm
CLICK HERE TO REGISTER
Topic: LLMs
Yannis Korkontzelos (Edge Hill University)
Detecting Text Generated by Large Language Models: A novel statistical technique to address paraphrasing

Abstract

Controlling illegitimate usage of AI in a multitude of educational and professional contexts requires automated systems able to detect text generated by Large Language Models (LLMs) and to distinguish it from human writing samples. Current techniques perform well, unless the text has been automatically paraphrased. In this talk, we will discuss research jointly conducted with Mr Amir Amini. We will start with exploring paraphrasing; how much it can diminish the accuracy of detectors of AI-generated text and we will explain why.

We will identify a property of the probability functions in large language models (LLMs) that can be useful for detecting LLM-generated text, even after paraphrasing. Then, we embed it in a state-of-the-art detector, DetectGPT (Mitchell et al., 2023), to form a new technique for detecting text generated by a particular LLM. We will discuss experiments ad results that demonstrate that this technique is more robust against paraphrasing attacks compared to recently introduced techniques, including DetectGPT and LogRank.

References

E. Mitchell, K. Yoon-Ho Alex Lee, A. Khazatsky, C.D. Manning, and C. Finn. DetectGPT: Zero-shot machine-generated text detection using probability curvature. arXiv (Cornell University), 2023.

Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 111–116. Association for Computational Linguistics, 2019.

X. Hu, P.-Y. Chen, and T.-Y. Ho. Radar: Robust ai-text detection via adversarial learning. Advances in Neural Information Processing Systems, 2023.

Y. Li, Q. Li, L. Cui, W. Bi, L. Wang, L. Yang, S. Shi, and Y. Zhang. Deepfake text detection in the wild. arXiv (Cornell University), 2023.

C. Opara. Styloai: Distinguishing ai-generated content with stylometric analysis. In International Conference on Artificial Intelligence in Education. Springer Nature Switzerland, 2024.

J. Su, T. Zhuo, D. Wang, P. Nakov, and CSIRO’s Data61. DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv preprint, arXiv:2306.05540, 2023.

Y. Zhou and J. Wang. Detecting AI-generated texts in cross-domains. In Proceedings of the ACM Symposium on Document Engineering 2024, pages 1–4, 2024.

—————————————————————————————————————————————————————

MEETING #16 Friday 2 May 2025, 2-3 pm
Topic: LLMs and Lexical Priming Theory
Michael Pace-Sigge (University of Eastern Finland)
Large-Language-Model Tools and the Theory of Lexical Priming: Where technology and human cognition meet and diverge

Abstract

This paper revisits Michael Hoey’s Lexical Priming Theory (2005) in the light of recent discussions of Large Language Models as forms of machine learning (commonly referred to as AI), which have been the centre of a lot of publicity in the wake of tools like OpenAI’s ChatGPT or Google’s BARD/Gemini. Historically, theories of language have faced inherent difficulties, given language’s exclusive use by humans and the complexities involved in studying language acquisition and processing. The intersection between Hoey’s theory and Machine Learning tools, particularly those employing Large Language Models (LLMs), has been highlighted by several researchers. Hoey’s theory relies on the psychological concept of priming, aligning with approaches dating back to Ross M. Quillian’s 1960s proposal for a “Teachable Language Comprehender.” The theory posits that every word is primed for discourse based on cumulative effects, a concept mirrored in how LLMs are trained on vast corpora of text data.

This paper tests LLM-produced samples against naturally (human-)produced material in the light of a number of language usage situations, investigates results from A.I. research and compares the results with how Hoey describes his theory. While LLMs can display a high degree of structural integrity and coherence, they still appear to fall short of meeting human-language criteria which include grounding and the objective to meet a communicative need.

References

Hoey, M. (2005). Lexical Priming. London: Routledge.

Hoey, M. (2009). Corpus-driven approaches to grammar. In: Römer, U. & Schulze, R: Exploring the lexis-grammar interface. Amsterdam/Philadelphia: John Benjamins.pp. 33-47.

Pace-Sigge, M. & Sumakul, T. (2022). What Teaching an Algorithm Teaches When Teaching Students How to Write Academic Texts. In Jantunen, Jarmo Harri, et al. Diversity of Methods and Materials in Digital Human Sciences. Proceedings of the Digital Research Data and Human Sciences DRDHum Conference 2022.

Quillian, R. M.  (1967). Word concepts: A theory and simulation of some basic semantic capabilities. Behavioural Science, 12(5), 410-430.  https://doi.org/10.1002/bs.3830120511

Tools

Brezina, V. & Platt, W. (2023) #LancsBox X, Lancaster University, http://lancsbox.lancs.ac.uk.

Google [2023] (2024). BARD/Gemini. https://BARD.google.com/chat

OpenAI. [2022] (2024) ChatGPT.(GPT 3.5)  https://chat.openai.com/

Scott, M. (2023). WordSmith Tools version 8, Stroud: Lexical Analysis Software.