EHU CRG events are held via MS Teams, and are free to attend. Times are GMT. Registration opens three weeks before each event. The MS Teams link is sent to registered participants the day before the event. If you have any questions, contact the organiser: Costas Gabrielatos ([email protected]).
—————————————————————————————————————————————————————
MEETING #17 Thursday 13 November 2025, 3-5 pm (GMT)
CLICK HERE TO REGISTER. If you have problems registering, email [email protected].
Topic: LLMs and Corpus Tools
Mark Davies (English-Corpora.org, USA)
Integrating information from AI / LLMs into English-Corpora.org
In March 2025 I released seven detailed studies that discuss how well the predictions from LLMs (Large Language Models) match the actual data from large, well-known, publicly-accessible corpora (like those from English-Corpora.org). The seven detailed studies dealt with word frequency, phrase frequency, collocates, comparing words (via collocates), genre-based variation, historical variation, and dialectal variation.
But it’s probably not a question of “either/or with corpora and AI; rather it is probably an issue of “and/with”. Why not take the strengths of AI / LLMs and integrate them right into the corpus interface? As the comparisons between corpora and AI/LLMs indicate, what LLM are really good at is classifying and explaining linguistic data.
So as of September 2025, English-Corpora.org allows users to combine the depth and reliability of corpus data with the analytic power of LLMs like GPT, Gemini, Claude, Perplexity, Llama, Mistral, and DeepSeek. With just one click, the corpus can send collocates, frequency patterns, phrase lists, or concordance lines to an LLM via an “API call”, and then the LLM instantly groups, explains, and interprets the data, and returns that to the corpus. These AI-powered insights appear directly in the interface, alongside the original corpus results (while still keeping it very clear which is the corpus data, and which are the AI categorizations or analyses).
The following are the types of analyses / categorizations that are now available to end users:
- Classifying and categorizing collocates, such as collocates of cap or identity
- Classifying and categorizing phrases, such as soft NOUN
- Comparing two words (via collocates), such as quandary vs predicament
- Comparing genres, time periods, and dialects (two sections), such as chain + NOUN (fic / acad), ADJ women (1800s / now), or ADJ scheme (US / UK)
- Comparing genres, time periods, and dialects (all sections), such as soft NOUN (genres), ADJ food (historical), or *ism words (dialects)
- Comparing genres, time periods, and dialects (charts), such as the “like construction” (genres), need NEG (historical), or soft day (dialects)
- Analyzing KWIC/concordance lines, such as the patterns for fathom or naked eye (including collocations, semantic prosody, syntactic patterns, and pragmatic functions)
- Generating words and phrases for topics and concepts, such as: climate change, famous actresses, or female jobs in 1800s
- Generating words and phrases via translations, such as German sowohl alt als jung, Russian финансовое состояние, or Korean중요한 사안
- Generating words and phrases to find “more natural” phrases, such as make a photo (perhaps from Japanese 写真を撮る), pleasing scenery, or tough idea
Users can also seamlessly move from one LLM to another, they can see the results in any one of 30 different languages, and they can create a simple “AI profile” (e.g. learner, teacher, translator, or linguist), which helps the AI to provide even more customized and helpful results.
English-Corpora.org already has the most widely used online corpora. But with these new AI-powered features, the corpora should be even more useful for teachers, learners, and researchers.
—————————————————————————————————————————————————————
MEETING #18 Thursday 8 or Friday 9 January 2026, 2-4 pm
Topic: Corpus Tools
Pavel Rychlý (Masaryk University, Czech Republic)
TITLE TBC
—————————————————————————————————————————————————————
MEETING #19: Friday 6 March 2026, 10:00-11:30 am
Topic: LLMs, Corpus Linguistics, and Language Learning
Peter Crosthwaite (University of Queensland, Australia)
From induction to generation: Advances in combining corpus approaches to language learning with LLMs
—————————————————————————————————————————————————————
