In looking for vulnerability to weather events in the recent and distant past, we analyzed modern and historical newspapers in Quebec, Canada. We searched newspapers over two periods, 1880-1899 and 1995-2014, for keywords relating to potentially disruptive weather events, in both French and English newspapers. The goal was threefold: to create a corpus of millions of words from which machine learning could extract examples of vulnerabilities to people and property over long periods, to build timelines and databases of disruptive events for the two time periods, and to compare/contrast the two time periods for impacts.
The extant literature strongly suggests that newspapers as a source are easy to access and use. We found that newspapers are not simple to use for our purposes, as we were concerned with performing quantitative analyses over long periods, prompting the use of machine learning. Preprocessing steps to discover, obtain and extract relevant articles can be much more time and resource consuming than one may expect compared to machine learning and other analysis steps.
We discuss the positive and negative aspects of using newspapers as a data source. Issues and roadblocks we encountered included 1) adapting to the various proprietary natures of the sources of digitized archival newspapers, and in some cases (Proquest and TDM Studio) the proprietary nature also of the search platform and analysis environment; 2) platform differences in the format of the digitized articles obtained (xml, rtf and PDF); 3) the quality of optical character recognition as well as the original scanned image; filtering (i.e., selecting articles based on keywords); and 4) individual article extraction from the OCR files. We draw on the literature to discuss how these issues influenced our analysis and imposed limitations on our results.
Relying on the platform’s built-in filtering based on existing Optical Character Recognition (OCR) yields results reflective of the quality of OCR, rather than the actual contents of the newspapers, which poses challenges to gather relevant and comprehensive sets of articles. Often, the digitization of newspapers has taken place over several years, and yields varying results in terms of quality, depending on the technology and precision of the tools used during digitization. These factors often end up affecting the searches and filtering (by keywords), and ultimately datasets.
Working with digitized newspapers also creates challenges surrounding licensing and limitations placed to prevent data mining. These limitations affect the user’s capabilities to export data into a workable space outside of proprietary platforms. In modern newspapers, tackling issues of copyright and licenses creates a challenging experience to access and use. For the scope of the data, we were fortunate that the university had a Perpetual License Agreement already in place with the proprietary platform, and the newspaper license holder, however, other libraries with fewer resources may not have the same recourse to gain access to newspaper data.
We report on the case of floods, which are by far the most written about type of disruptive weather event found in the corpus for southern Québec to date. Finally, we examine the vocabulary used to describe flooding events, the influence of this vocabulary across time periods and between the French and English language communities, and the impact of these differences on discerning disruptive events from newspaper accounts.

