MEETING #7: Thursday 30 March 2023, 2-4 pm (UK time)

Topic: Corpus Tools & Corpus Processing

Mike Scott (Lexical Analysis Software & Aston University),

News Downloads and Text Coverage: Case Studies in Relevance

Presentation video and slides


Thank goodness it is now quite easy for anyone with the relevant permissions (and patience) to download thousands of text files from online databases such as those of LexisNexis and Factiva. After downloading there are numerous issues to be handled in checking the format of the text, cleaning out remnants of HTML, handling references to images, formulae, reader comments, sorting them by date or source etc.

The problem to be addressed here, however, chiefly concerns a) duplicate contents and b) relevance to the user’s research aims.

News texts in particular suffer increasingly from duplication as minor changes are mad, or as a news story grows hour by hour.

Also, anyone who has looked at such searches will have noticed that some texts are really centrally concerned with the issue being studied, for example the brewing of beer, or the characters in Middlemarch, but others merely make a passing mention, such as “Peter always liked his beer” in an obituary or “most informants reportedly had never read Middlemarch” in a survey of hobbies and interests.

In this presentation, using WordSmith Tools 8.0,  I shall attempt to quantify both the degree of content duplication in three sets of text, and the degree of central relevance to a theme.   


After teaching English for many years in Brazil and Mexico, Mike Scott moved to Liverpool University in 1990, teaching initially Applied Linguistics generally but then specialising in Corpus Linguistics. In 2009 he moved to Aston University. In the early 1980s he learned computer programming and began to develop corpus software: MicroConcord (1993 with Tim Johns), WordSmith Tools 1996, which is now in version 8. His current field is corpus linguistics, sub-field corpus linguistics programming.

MEETING #6: Thursday 2 March 2023, 2-4 pm (UK time)

Topic: Manual Annotation in Discourse-Oriented Corpus Studies

1. Katia Adimora (Edge Hill University): Annotating Mexican immigration discourses


Discourses of Mexican and American newspapers about Mexican immigration to the US during Trump’s presidency were pragmatically annotated according to the attitudes they express about different semantical aspects of immigrants and immigration, such as border, families, and crime.

Firstly, by deploying the Mexican immigration corpus (in Spanish), American immigration corpus (in English) and corpus tool Sketch Engine (SE),  the researcher conducted the search for concordances for the search terms ‘immigration’, ‘immigrant’ and ‘inmigración’ and ‘inmigrante’, respectively. Secondly, the random samples of fifty concordances for each term were extracted. Concordances were transferred to the word table and manual annotated according to the pragmatic perspective they express towards immigrants.  After all 200 instance were annotated and no additional attitudes were identified, the annotation scheme with codes for the expressed attitudes and their definitions were created. Attitudes towards immigration were classified into three levels:

For example, second level ‘attitude positive-anti-Trump’ with a code (Att_ pos-aT) includes instances where the main sentiment is positive towards immigration, which is shown via opposition towards Trump’s anti-immigration policies:

In Fountain Hills, Trump blamed immigrants in the country illegally for “so many killings, so much crime.” He then went after rival presidential candidates Ted Cruz and John Kasich, saying his approach to illegal immigration was tougher than theirs.

2. Dan Malone (Edge Hill University)

A lone wolf from the ISIS pack: Hunting discourses through manual annotation


The process of manual corpus annotation, where researchers add interpretive information to corpus data, is a valuable tool for systematically analysing linguistic or semantic features in a corpus. A core aspect of manual annotation is the annotation scheme – a set of guidelines for labelling corpus content, including annotation categories, definitions, and examples (McEnery & Hardie 2012: 90).

In this talk, I will introduce the manual annotation approach and annotation scheme I developed for analysing the representations of the nexus between lone-wolf terrorists and the extremist groups ISIS and al Qaeda in the British press. The underpinning goal of this annotation scheme is to systematically reveal discourse prosodies, in other words, the implicit and explicit attitudes (Stubbs 2001: 66) towards the lone-wolf terrorist.

The context for this talk is my doctoral research project “Constructing the Lone Wolf Terrorist: A corpus-based critical discourse analysis.” The dataset used in this study is The Lone Wolf Corpus, a purpose-built corpus consisting of approximately 8.5 million words and 8,600 texts from UK national newspapers, published between 2000 and 2019 (Malone 2020).

I will describe the iterative process used to develop the annotation scheme, which involved cycles of annotation. I will also provide a detailed explanation of the scheme’s four distinct categories and illustrate each category with examples. The first two categories identify the type of entity represented by the node LONE WOLF, as well as the collocates ISIS and AL QAEDA, and determine whether each entity is portrayed as an active and dynamic force (Van Leeuwen, 2008: 33). The third category denotes the connection between the node and the collocate, while the fourth category outlines the discursive link.

Additionally, I will address the practical challenges that arose during the development of the annotation scheme, namely: (1) the need to avoid top-down categorisation, (2) the difficulty of balancing scheme richness with the intensity of labour required for its application, and (3) ensuring the reliability of the coding process.


Malone, D. (2020). Developing a complex query to build a specialised corpus: Reducing the issue of polysemous query terms. Corpora and Discourse International Conference 2020.

McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press.

Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Oxford: Blackwell.

Van Leeuwen, T. (2008). Discourse and Practice: New Tools for Critical Discourse Analysis. Oxford University Press.


MEETING #5: Thursday 15 December 2022

Topic: Corpus Tools & Semi-Automated Annotation

Martin Weisser (University of Salzburg)

Doing Corpus Pragmatics in DART 3


The Dialogue Annotation and Research Tool (DART) is a freeware tool that makes it possible to annotate large amounts of dialogue data semi-automatically on a number of linguistic levels, as well as post-process and analyse the resulting corpora efficiently using various corpus analysis methods. Perhaps arguably, the most interesting and important levels of analysis produced by the tool from the point of view of corpus pragmatics are the syntactic and the speech-act level, but DART annotations also comprise information about the semantics (topics), semantico-pragmatic (modes; Searle’s IFIDs), surface polarity, (completion) status, and disfluency of units. In this talk, I want to begin by providing a brief overview of the background, genesis and development of the tool. Next, we’ll discuss the different levels of annotation, their potential significance in pragmatics, and how they may work together in determining the pragmatic meaning potential in the form of speech acts. I’ll then briefly illustrate how it’s possible to adapt the resources used for the annotation process to new domains, as well as the steps involved in the annotation process itself. And last, but not least, we’ll explore the different analysis options related to speech acts and other patterns the tool has to offer.


MEETING #4: Friday 8 April 2022

Topics: Corpus Tools & Manual Annotation

Encarnación Hidalgo-Tenorio (University of Granada)
Miguel-Ángel Benítez-Castro (Universidad de Zaragoza)

1. Workshop: Manual Annotation with UAM Corpus Tool.

2. Presentation: Analysing Extremism under the Lens of Appraisal Theory.


Appraisal Theory is aimed to understand how social relations are negotiated through alignment, as linguistically realised by the axes of ENGAGEMENT, GRADUATION and ATTITUDE (Martin & White 2005). Of the three subsystems, the latter has attracted more attention so far. ATTITUDE helps classify instances of emotion/al talk through the meaning domains of AFFECT, JUDGEMENT and APPRECIATION. As argued by White (2004) and Bednarek (2009), emotional talk may entail the more indirect expression of emotion by attending to ethical and aesthetic values. Given the omnipresence of affect in language (including Ochs & Schieffelin 1989; Barrett 2017), there is growing consensus about treating AFFECT as a superordinate category, now taken to include the expression of EMOTION (emotional evaluation) and OPINION (ethical and aesthetic evaluation). As emotion permeates all levels of linguistic description (including Alba-Juez & Thompson 2014; Alba-Juez & Mackenzie 2019), and all utterances are produced and interpreted through emotions (Klann-Delius 2015), AFFECT may be enriched through a more explicit focus on affective psychology, thereby proposing more sharply defined categories that may better describe any instance of emotive language (Thompson 2014). This paper shows how Benítez-Castro & Hidalgo-Tenorio’s (2019) more psychologically-driven Appraisal EMOTION sub-system can lead to a user-generated Appraisal scheme allowing a more fine-grained analysis of the complex interplay between (explicit and implicit) EMOTION and OPINION in discourse. To do so, we draw on examples and findings from two research strands we have covered so far: American right-wing populist discourse (Hidalgo-Tenorio & Benítez-Castro 2021b) and Jihadist propaganda (Benítez-Castro & Hidalgo-Tenorio Forthcoming).

Encarnación Hidalgo-Tenorio is Professor in English Linguistics at the University of Granada, Spain. Her main research area is corpus-based CDA, where she focuses on the notions of representation and power enactment in public discourse. She has published on language and gender, Irish studies, political communication, and has also paid attention to the analysis of the way identity is discursively constructed. She has tried to develop, or reconsider, some interesting aspects taken from SFL such as Transitivity, Modality, or Appraisal. Currently, she is working on the lexicogrammar of radicalization. Address for correspondence: Departamento de Filologías Inglesa y Alemana, Facultad de Filosofía y Letras, Campus de Cartuja s/n, 18071, University of Granada, Spain. <[email protected] 

Miguel-Ángel Benítez-Castro is lecturer in English Language at the University of Zaragoza, Spain. His main research interest lies in SFL-inspired discourse analysis, based on corpus-driven methodologies, which he has managed to apply to his general focus on the interface between lexical choice, discourse structure, and evaluation. This is reflected in his previous and ongoing research on shell-noun phrases, on the evaluation of social minorities in public discourse and on the refinement of SFL’s linguistic theory of evaluation. Address for correspondence: Department of English and German Studies, Facultad de Ciencias Sociales y Humanas, Universidad de Zaragoza, Ciudad Escolar, s/n, 44003 Teruel.

MEETING #3: Wednesday 8 February 2022

Topic: Corpus Tools

Andrew Hardie (Lancaster University)

What’s new in CQPweb – 2022 edition

In this informal workshop / presentation, Andrew Hardie will give an overview of the latest new features in CQPweb version 3.3. This includes, most notably, the option for users to upload their own corpora to the system, tagging the data using either CLAWS or TreeTagger – plus the new system that enables other users on the same server to share access to these uploaded corpora. Participants are welcome to try it out “in real time” during the session.

MEETING #2: Wednesday 15 December 2021

Topic: Corpus Tools & Automated Annotation

Paul Rayson (Lancaster University)

Counting words or wording counts?

A wide variety of tools and methods are available across a number of disciplines (e.g. Education, History, Linguistics, Literature, Psychology) for the analysis of text, and many of the techniques (e.g. content analysis, topic modelling, sentiment analysis) rely on counting words. However, words can take different meanings in different contexts, and around 16% of running text counts as semantically meaningful multiword expressions (where the meaning of the whole expression is different from the collection of individual words). In this talk, I will describe what can be achieved by combining methods from computer science (natural language processing) with linguistics (corpus linguistics) to address these issues. The talk will cover the basics of semantic annotation where words and multiword expressions are automatically labelled with coarse-grained word senses using the UCREL Semantic Analysis System (USAS). Then, via a demonstration of the web-based Wmatrix tool, I will show how counting USAS categories and comparing the frequency profiles with those from other texts can be used to quickly gist a text or corpus. Along the way, I will provide some pointers to example case studies in psychology, political discourse analysis, and beyond, describe current research and development on open source USAS multilingual taggers, and provide attendees with pointers for Wmatrix access and further tutorials to follow up later using your own corpora.

Bio note

I am a Professor in Computer Science at Lancaster University, UK and Director of the UCREL interdisciplinary research centre which carries out research in corpus linguistics and natural language processing (NLP). A long term focus of my work is semantic multilingual NLP in extreme circumstances where language is noisy e.g. in historical, learner, speech, email, txt and other CMC varieties. Along with domain experts, I have applied my research in the areas of dementia detection, mental health, online child protection, cyber security, learner dictionaries, and text mining of biomedical literature, historical corpora, and financial narratives. I was a co-investigator of the five-year ESRC Centre for Corpus Approaches to Social Science (CASS) which is designed to bring the corpus approach to bear on a range of social sciences. I’m also a member of the multidisciplinary Institute Security Lancaster, the Lancaster Digital Humanities Hub, and the Data Science Institute.

MEETING #1: Wednesday 10 November 2021

Topic: Constructing Topic-Specific Corpora

1. Dan Malone (Edge Hill University)

Constructing the Lone Wolf Corpus: Using polysemous query terms to compile a topic-specific corpus

This paper is concerned with the process of developing a query to compile a topic-specific corpus from a text database. For a corpus to be topic-specific, its texts must be relevant to the topic(s) for which it was compiled to investigate. However, polysemous query terms are more likely than monosemous query terms to retrieve nonrelevant texts and, therefore, reduce query precision, that is, the ratio of relevant to nonrelevant texts retrieved. More specifically, then, this paper suggests that the issue of polysemous query terms can be addressed through the implementation of a dual-group complex query (hereafter, DGQ).

The motivation for this paper arose while compiling the corpus for my PhD project ‘Constructing the Lone Wolf Terrorist: A corpus-driven study of the British press’. In its actor-based approach, corpus compilation was underpinned by an onomasiological perspective of the connection between lexical items and the concept of ‘the lone wolf terrorist’. According to Geeraerts (2010: 23), “onomasiology takes its starting point in a concept and investigates the different expressions via which the concept can be designated or named”. This is opposed to semasiology, which “takes its starting point in the word as a form and charts the meanings that the word can occur with” (ibid). Indeed, the concept ‘lone wolf terrorist’ can be expressed via a number of polysemous lexical items, such as lone actorlone attacker, and solo actor, with the specificity of their meaning being derived from context.

To compile the Lone Wolf Corpus (LWC), rather than employing a simple query-string, the DGQ was devised to mitigate the polysemy of its query terms. It comprises two distinct groups of terms, with each based around a core semantic component of ‘lone-wolf terrorist’; Group A terms represented lone-wolf actors or actions, whereas Group B represented terrorism. By linking terms within each group with the Boolean operator OR and by then linking the two groups using AND, the query retrieved texts containing at least one term from each group. By drawing on textual context in the form of collocation, the potential for multiplicity of meaning of the polysemous query terms is restricted, leading to a reduction in the number of nonrelevant texts being retrieved by the query.
This paper develops the query formulation technique outlined by Gabrielatos (2007). Central to Gabrielatos’s technique is the metric of relative query term relevance (RQTR), which establishes the degree of relevance of candidate query terms to the topic being investigated. The RQTR technique has been adopted in a number of studies, such as Prentice (2010), Dimmroth, Steiger & Schünemann (2017), and Kreischer (2019), as a means to both expanding queries and establishing the relevancy of candidate terms. This paper expands the applicability of the RQTR method by illustrating how it can be applied to the DGQ and, therefore, cater for polysemous query terms.

From the initial core query terms lone wolf and terrorism, the LWC query was expanded to seventy query terms. When applied to the Lone Wolf Corpus (LWC) query, the DGQ improved query precision at minimal expense to recall, relative to a simple query. Based on a systematic sampling, the results show that the DGQ improved precision from 0.46 to 0.89, which was gained at the minimal expense of a 0.08 decrease in retrieved relevant texts.


Gabrielatos, C. (2007). Selecting query terms to build a specialised corpus from a restricted access database. ICAME Journal, 31, 5-44.

Geeraerts, D. (2010). Theories of Lexical Semantics. Oxford University Press.

Kreischer, K. S. (2019). The relation and function of discourses: a corpus-cognitive analysis of the Irish abortion debate. Corpora, 14(1), 105-130.

Prentice, S. (2010). Using automated semantic tagging in Critical Discourse Analysis: A case study on Scottish independence from a Scottish nationalist perspective. Discourse & Society, 21(4), 405–437.

Steiger, S., Schünemann, W. J., & Dimmroth, K. (2017). Outrage without consequences? Post-Snowden discourses and governmental practice in Germany. Media and Communication, 5(1), 7-16.

2. Katia Adimora (Edge Hill University)

Building bilingual corpora for Critical Discourse Analysis: Mexican immigration to the US

This talk will address the building of topic-specific corpora about Mexican immigration to the US during Donald Trump era. The corpora contain American and Mexican newspaper articles that cover Mexican immigration (44,779 articles, 30 million words). The aim is to analyse how immigrants are represented in them.  

The newspapers included in the corpora are:
US newspapers: New York Times, Washington Post, USA Today, Los Angeles Times, The Arizona Republic and Chicago Tribune. 
Mexican newspapers: El Universal,, Reforma, El Norte, and Mural.

To gather the relevant articles, three-part query was formed based on the reading through various American and Mexican articles, and by identifying the words that are deployed to talk about immigrants or immigration. Bilingual queries: in English and Spanish, needed to be constructed. Spanish query terms are synonyms to English ones, however not necessary the literal translation from English as Mexican newspapers do not use specific expression that is used in English, or they use different expressions to talk about immigration and immigrants.   
Articles were transferred from online database ProQuest (Global Newsstream) to the software tool Sketch Engine (  

American and Mexican corpora were divided in subcorpora to be able to compare how the newspapers in American states with the highest number of Mexican immigrants, represent them in comparison to national newspapers. Similarly, Mexican subcorpora was formed to compare how newspapers from the regions in Mexico with the high number of Mexican migrants that move to the US address them compared to the national newspapers.   

These subcorpora division differs from the one commonly applied to the British press, on broadsheets vs. tabloid, and according to political leaning on leftist, rightist and centrist. This is due to the difficulty to draw the line between these types of grouping of American, and especially, Mexican newspapers.  


Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel (2014) The Sketch Engine: ten years on. Lexicography, 1: 7-36.