Thursday 30 March 2023, 2-4pm (UK time)

Topic: Corpus Tools and Corpus Processing

Mike Scott

(Lexical Analysis Software & Aston University)

News Downloads and Text Coverage: Case Studies in Relevance

Click here to register


Thank goodness it is now quite easy for anyone with the relevant permissions (and patience) to download thousands of text files from online databases such as those of LexisNexis and Factiva. After downloading there are numerous issues to be handled in checking the format of the text, cleaning out remnants of HTML, handling references to images, formulae, reader comments, sorting them by date or source etc.

The problem to be addressed here, however, chiefly concerns a) duplicate contents and b) relevance to the user’s research aims.

News texts in particular suffer increasingly from duplication as minor changes are mad, or as a news story grows hour by hour.

Also, anyone who has looked at such searches will have noticed that some texts are really centrally concerned with the issue being studied, for example the brewing of beer, or the characters in Middlemarch, but others merely make a passing mention, such as “Peter always liked his beer” in an obituary or “most informants reportedly had never read Middlemarch” in a survey of hobbies and interests.

In this presentation, using WordSmith Tools 8.0,  I shall attempt to quantify both the degree of content duplication in three sets of text, and the degree of central relevance to a theme.   


After teaching English for many years in Brazil and Mexico, Mike Scott moved to Liverpool University in 1990, teaching initially Applied Linguistics generally but then specialising in Corpus Linguistics. In 2009 he moved to Aston University. In the early 1980s he learned computer programming and began to develop corpus software: MicroConcord (1993 with Tim Johns), WordSmith Tools 1996, which is now in version 8. His current field is corpus linguistics, sub-field corpus linguistics programming.