Developing robust workflows for data science and analytics

In the era of big data and digital transformation, clinical laboratories increasingly use data science and analytics to enhance decision-making and improve operational efficiency. After all, laboratories generate vast amounts of data, including test results, patient information, quality-control metrics, and instrument logs.

But data analysis is only as worthwhile as the time and effort that labs put into it. Without well-structured and robust workflows, lab professionals may find the analytical process overwhelming and inefficient, and their results could be inconsistent and prone to error. Additionally, generating high-quality, reproducible data can be even more challenging when advanced technologies such as machine learning are integrated into the process.

For all of these reasons, designing workflows that streamline the end-to-end analytical process is essential. Doing so ensures that laboratories can maximize the value of their data assets, obtain meaningful insights, and maintain confidence in the results.

This article covers how to build robust analytical workflows on well-structured foundations that support the maintenance, usability, and reproducibility of laboratory data analysis. The process includes several key components, including data acquisition, cleaning, transformation, analysis, validation, and reporting. In workflows that involve machine learning techniques, labs must take additional steps, such as engaging in feature engineering, model development, and validation.

Data acquisition and extraction

The first step in any data science or analytics workflow is to acquire data. Whenever possible, collect data directly from source systems in standardized formats. Regardless of where you extract data from, it is essential to create a “data dictionary,” a resource that provides detailed information about the data elements within a system, including how it was collected, where it was collected from, what its relationships are with other datasets, how it should be used, and any caveats that should be noted when applying it.

Having a data dictionary will help you avoid misusing data when you perform subsequent analytical steps. Those who develop analytical workflows should ensure that access controls and data-security measures are in place to protect the data from being accessed — and potentially corrupted — by unauthorized individuals.

Make sure to document any code or applications you use to extract the data from source systems to support maintenance, continuous improvement, and the identification of any problems that may arise. It’s also critical to use a consistent version-control method on any data extracted to ensure that it can be audited. Lastly, automate data extraction to reduce the chance of human error. However, you should also make it a high priority to test your automated models, including performing routine audits, to verify that data is being extracted correctly.

Data cleaning and preprocessing

Given that raw data extracted from source systems often is incomplete, inconsistent, noisy, or inappropriately structured for effective analysis, data cleaning and preprocessing are necessary next steps. Cleaning involves handling missing values and correcting errors as needed, while preprocessing involves standardizing the data for analysis. Poor data quality can lead to misleading insights and incorrect conclusions, making it essential for lab professionals to invest time at this stage. Be sure to do each of the following:

Implement checks to detect missing or inconsistent data.
Carefully consider how to handle missing data, including deciding whether to use imputation or replace missing values.
Check for duplicate records to prevent skewed analysis.
Employ techniques for detecting data inconsistencies before analysis.
Document data preprocessing steps and code, and implement version control.

Before you move on to modelling and statistical analysis, consider whether to take additional steps based on your plans for downstream analytical processing. For example, you might decide to encode categorical variables appropriately for statistical analysis, normalize or standardize numerical values to eliminate bias, or use domain knowledge to create new, meaningful derived variables that enhance your analysis.

Finally, it’s always important to perform exploratory data analysis to investigate how the data is distributed and identify any patterns that emerge from it.

Modeling and statistical analysis

Once you’ve thoroughly prepared the data, you’re ready to select the right analytical approach. This crucial stage may involve using descriptive analytics, statistical modeling, machine learning, or traditional rule-based algorithms, depending on the problem you’re trying to solve. Choose models that align with your objectives and start with simple approaches before moving on to more complex ones — but only when required. Workflows should not be any more complex than necessary. This also seems to improve adoption in the clinical space due to familiarity of stats and the explainability of the model.

Document your methods, results, and any key decisions relating to the analysis. This includes recording “negative” results and clearly identifying which statistical methods were not effective.

If you’re applying machine learning techniques, you’ll need to consider and manage additional complexities, most of which are related to validation and quality control. For this process, you will:

Use training, validation, and test datasets to ensure models are generalizable and not overfitting.
When possible, benchmark models against gold-standard methods to assess performance in relation to well-established approaches or approaches not based on machine learning.
Conduct peer reviews and audits of analytical processes using a combination of human oversight and automated software tests.
Ensure that results from model training and validation can be reproduced. Without rigorous validation, results may be misleading or unreliable.

Reporting and visualization

Communicating insights effectively is essential for data-driven decision-making. Using visualizations, dashboards, and automated reports helps convey key findings and summarize your analytical results. Without clear reporting, even the most accurate analysis may go underutilized.

Some general approaches to consider when reporting and visualizing results are to:

Keep it simple. Complex visualizations may overwhelm stakeholders, particularly those without technical backgrounds, leading to a loss of engagement in the analytical work.
Automate report generation as much as possible for the purpose of maintaining a consistent, standardized approach to reporting.
When possible, present results as an interactive dashboard so that users can explore the outcomes of the analysis further. This also helps improve user engagement in the end-to-end analytical process.
Provide a transparent method for users to access the documentation relating to your analytical workflow.

Conclusion

Clinical laboratories should develop robust workflows to support data-driven decision-making, enhance efficiency, and maintain high-quality standards. By following a structured approach to data acquisition, preprocessing, analytics, validation, and reporting, lab professionals can harness the power of data science to drive operational improvements and improve patient outcomes.

Incorporating automation, documentation, version control, and best practices into workflows not only reduces errors; it also helps create an environment where reproducibility and trustworthiness are encouraged. Further, engaging in this process ensures that the products of analytics are timely, accurate, and easy to maintain.

Investing the time to develop robust workflows will pay off in the long term and allow for the effective incorporation of machine learning and other advanced analytical techniques to drive insights. Engaging in this process also fosters trust in data outputs, both within labs and across the spectrum of healthcare professionals who rely on them. As data science continues to evolve, laboratory professionals who adopt structured methodologies will be best equipped to leverage vast amounts of data and gain insights for continuous improvement and innovation.

Benjamin McFadden, BSc, MRes, is an applied data science researcher at the University of Western Australia and a senior data officer at the Western Australia Department of Health. +Email: [email protected]