Text-mining for early detection of water-related substances

BTO Workshop on Text-Mining

Text-mining can automatically search through (large) amounts of textual information and bring the information together in a structured way. Many techniques and applications are conceivable.

During the workshop for drinking water companies and NORMAN members on 22 March 2022, techniques and applications from which the water sector can benefit were discussed. The workshop was organized in the context of the project ‘Text-mining for early detection of water-related substances’, part of the Joint Research Programme with the water utilities (BTO).

Figure 1. Words that the participants associated with the concept of text-mining.

The workshop started with a quiz on time involved in the human processing of text. Participants learned that about 8000 papers were published containing the keyword ‘drinking water’ in 2021 alone (source: Scopus). Adding historical published papers this amounted to more than 30,000 papers up to now. The average amount of abstracts a human can evaluate for usefulness is 180 per hour. Then, the average reading speed is 200 words per minute, and for technical documents, this drops to about 50 words per minute. This set the stage for text-mining as a valuable and neccesary way to efficiently sieve through textual information.

Nienke Meekel from KWR water presented possibilities on mining Twitter messages to find news on possible new industrial activities in the Rhine area. This revealed 13 activities that can be further investigated. Web-scraping allowed easy downloading of many documents (for instance, permits) or data files for further processing and integration. Information retrieval allowed for prioritization of these documents to read and extraction of relevant text parts to make reading less time-consuming.

Tessa Pronk from KWR continued with ‘Natural Language Processing’ (NLP) techniques to aid text processing with grammar rules. NLP was used to construct object –  verb – subject triplets like ‘cumene’ ‘induced’ ‘mutations’ to get facts around a single chemical of interest. Some work is needed to optimize this task. Also, a way to recognize chemicals based on the character sequence was presented, and the option to associate groups of chemicals by co-occurrence in texts.

Participants could indicate the technique that could most readily be applied in their work. Figure 2 shows that these were web-scraping and information retrieval.

Figure 2. Voting results for readily useable techniques in the water sector.

The workshop ended with a hands-on exercise with a selected group with a working toy-example of web-scraping and NLP. In this group, the possibility of applying text-mining to find facts around chemicals and the statistical associations of chemicals by co-occurrence was found very interesting.

In this article in H2O magazine possibilities on text mining for the water sector are given in more detail (in Dutch): A report of the project ‘Text-mining for early detection of water-related substances’ will be delivered in 2022.