All events are via MS Teams, and are free to attend.
Registration for each event opens two weeks in advance.
The MS Teams link will be sent the day before the event.

MEETING #11: Thursday 29 February 2024
Topic: Corpus Methodology
Matteo Di Cristofaro (University of Modena and Reggio Emilia, Italy)
One dataset, many corpora: Problems of scientific validity in corpora and corpus-derived results


Click here to register (free). Registration closes on Wednesday 28 February 11am (GMT)



Corpus linguistics has, since its inception, recognised the relevance of digital technologies as a major driving force behind corpus techniques and their (r)evolution in the study of language (cf. Tognini-Bonelli 2012). And yet, while both corpus linguistics and digital technologies have frequently benefited from each other (the case of NLP/NLU is one such macro example), their pathways have often diverged. The result is a disconnect between corpus linguistics and digital data processing whose effects directly impinge on the ability to analyse language through software tools. A disconnect becoming more and more relevant as corpus linguistics is being applied to vast amounts of data obtained from manifold sources – including a wide array of social media platforms, each one with its unique linguistic and technical peculiarities.

As the ground-truth of an ever-increasing number of language studies, corpora must be able to correctly treat and represent such peculiarities: e.g. the dialogic dimension of comments or forum posts; the presence (and potential subsequent normalisation) of spelling variations; the use of hashtags and emojis. Failing to do so, the corpus-derived results will likely present researchers with a falsified view of the language under scrutiny.

What is at stake is not the ability to “count” what is in a corpus, but rather whether what is being counted is or is not a feature present in the original data – of which the corpus should be a faithful representation.

The presentation is consequently devoted to tackling digital technicalities, i.e. “those notions and mechanisms that – while not classically associated with natural language – are i) foundational of the digital environments in which language production and exchanges occur and ii) at the core of the techniques that are used to produce, collect, and process the focus of investigation, that is, digital textual data.” (Di Cristofaro 2023:5). One such example is represented by character encodings: although at the “core” of the whole corpus linguistics enterprise (cf. McEnery and Xiao 2005; Gries 2016:39,111) – since they allow written language to be processed by a computer and understood by humans -, these are often overlooked at all stages of corpus compilation and analysis, potentially leading linguists to involuntarily tampering with the data and its linguistic contents.

Starting from practical examples, the presentation discusses the implications that digital technicalities have on corpora and their analyses – or rather, what happens when they are not properly treated – while outlining (also in the form of Python scripts and practical tools) potential new pathways that a “digital-aware” perspective of corpus linguistics can open up.


Di Cristofaro, Matteo. Corpus Approaches to Language in Social Media. Routledge Advances in Corpus Linguistics. New York: Routledge, 2023.

Gries, Stefan Th. Quantitative Corpus Linguistics with R: A Practical Introduction. 2nd ed. New York: Routledge, 2016.

McEnery, Tony, and Richard Xiao. ‘Character Encoding in Corpus Construction’. In Developing Linguistic Corpora: A Guide to Good Practice, edited by Martin Wynne, 47–58. Oxford: Oxbow Books, 2005.

Tognini Bonelli, Elena. ‘Theoretical Overview of the Evolution of Corpus Linguistics’. In The Routledge Handbook of Corpus Linguistics, edited by Anne O’Keeffe and Michael McCarthy, 14–27. Routledge Handbooks in Applied Linguistics. Milton Park, Abingdon, Oxon ; New York: Routledge, 2012.

MEETING #12: Thursday 25 April 2024
Sylvia Jaworska (University of Reading, UK)
Human vs Machine: A critical evaluation of the usefulness of topic modelling vs a corpus-assisted approach to discourse