Data mining for asset management – survey and water sector examples

Data mining (the search for statistical connections in databases) is one step in ‘Knowledge Discovery in Databases (KDD)’. Both have already been embraced by the marketing, medical care, ICT and financial sectors, but their implementation in the water sector is so far limited. This, despite the fact that they potentially offer new possibilities for the water sector to improve its asset management. Water companies therefore want a better understanding of the added value that data-driven analytical methods could represent for them, and seek a knowledge base to guide them in the decision as to whether or not to implement KDD in their asset management.

With its extensive survey of asset management knowledge issues, a literature study of data mining, and the initial practical experiences from two TKI projects, this project represents the first, cautious step towards data-driven operational management for the water companies.

Collecting information on demand, supply and initial practical experiences with data mining

To begin with, a literature study was conducted into the main characteristics, possibilities and limitations of data mining. Also, the ‘knowledge needs’ were identified in consultation with asset managers at water companies. During a workshop at KWR, the knowledge issues were further prioritised in terms of their urgency and importance. An inventory was thus made of both data mining demand and supply. This inventory was used to gain insight into the most promising application areas for data mining for purposes of water infrastructure asset management. Moreover, the results of two TKI projects were used in this work as models for the implementation of data mining/KDD for asset management issues.

Data mining requires streamlined data management

The key lessons from already completed data mining projects are: (1) the importance of feature engineering (enriching an existing dataset with parameters derived, for example, from model simulations or calculations); (2) the importance of collaboration with professional experts in the field of water infrastructure and operational processes; and (3) the limited amount of available data.

One cannot yet speak of big data when it comes to the water companies. Still, because of the steady increase in the numbers of sensors in the distribution networks, smart meters, growing databases with measurement data, and failure registrations, more and more possibilities will open up to employ data mining to answer questions of relevance to the drinking water sector.

Water companies can use the collected knowledge when deciding whether or not to implement data mining and data-driven analytical techniques to improve their operational asset management. Given the (qualitative and quantitative) limitations of the currently available data, caution is called for with regard to data mining ambitions. However, data cleansing acts and data quality controls within the companies could counter the limitations.

Spearman correlation matrix which shows the correlation coefficients for a number of data attributes of cement-based mains (mat_CH) with the failures of these mains: failure frequency (per length and per year, freq), failure number (nstor), mains length (length), installation year (broken down by period 1800-1960, 1960-1980, 1980-2015), mains diameter (0 – 80, 80 – 150 and 150 – 600 cm), mean flow volume* (flow_mean), maximum pressure*, difference between maximum and minimum pressure in a day (dP), mean pressure (druk_mean), seriousness of the detected anomalies for volume flow or pressure ({flow/druk}_surprisescore), duration of anomaly in volume flow or pressure ({flow/druk}_Duration), number of anomalies detected for volume flow or pressure ({flow/druk}_nevents). Dark red indicates a strong positive correlation; dark blue a strong negative correlation. *The numbers for volume flow and pressure were calculated from operational data in the two days preceding, and including the day of, the failure.

Spearman correlations for all property combinations of cement-containing water mains in the supply area managed by Brabant Water. Deep red colors indicate a strong positive correlation, while deep blue shows a strong negative correlation. Failure rate (freq) appears to clearly correlate with the severity of detected anomalies in the measured flow rate (flow_SupriseScore).