All events are via MS Teams, and are free to attend.
Registration for each event opens two weeks in advance.
The MS Teams link will be sent the day before the event.

MEETING #12: Thursday 25 April 2024, 2:00-3:30 pm (UK time)
Topics: Corpus Methodology, Large Language Models
CLICK HERE TO REGISTER
Sylvia Jaworska (University of Reading, UK) & Mathew Gillings (Vienna University of Economics and Business, Austria)
How humans vs. machines identify discourse topics: an exploratory triangulation

Identifying discourses and discursive topics in a set of texts has not only been of interest to linguists, but to researchers working across social sciences. Traditionally, these analyses have been conducted based on small-scale interpretive analyses of discourse which involve some form of close reading. Naturally, however, that close reading is only possible when the dataset is small, and it leaves the analyst open to accusations of bias and/or cherry-picking.

Designed to avoid these issues, other methods have emerged which involve larger datasets and have some form of quantitative component. Within linguistics, this has typically been through the use of corpus-assisted methods, whilst outside of linguistics, topic modelling is one of the most widely-used approaches. Increasingly, researchers are also exploring the utility of LLMs (such as ChatGPT) to assist analyses and identification of topics. This talk reports on a study assessing the effect that analytical method has on the interpretation of texts, specifically in relation to the identification of the main topics. Using a corpus of corporate sustainability reports, totalling 98,277 words, we asked 6 different researchers, along with ChatGPT, to interrogate the corpus and decide on its main ‘topics’ via four different methods. Each method gradually increases in the amount of context available.

Method A: ChatGPT is used to categorise the topic model output and assign topic labels;

Method B: Two researchers were asked to view a topic model output and assign topic labels based purely on eyeballing the co-occurring words;

Method C: Two researchers were asked to assign topic labels based on a concordance analysis of 100 randomised lines of each co-occurring word;

Method D: Two researchers were asked to reverse-engineer a topic model output by creating topic labels based on a close reading.

The talk explores how the identified topics differed both between researchers in the same condition, and between researchers in different conditions shedding light on some of the mechanisms underlying topic identification by machines vs humans or machines assisted by humans. We conclude with a series of tentative observations regarding the benefits and limitations of each method along with suggestions for researchers in selecting an analytical approach for discourse topic identification. While this study is exploratory and limited in scope, it opens up a way for further methodological and larger scale triangulations of corpus-based analyses with other computational methods including AI.

—————————————————————————————————————————————————————–