Chapter 2 - Data Processing and Cleaning | Data Science for Health Informatists

📄️ 2.1 Pandas for Processing and Cleaning Small to Medium Sized Data

At the heart of many health informatics projects lies Pandas, a robust library for data analysis. It facilitates data cleaning and preprocessing tasks, aiding in the handling of missing values, outliers, and inconsistencies, which enhances the overall reliability of the data. With features specifically designed for time-stamped medical information and vitals, Pandas is well-suited for uncovering trends and patterns in patient history through time series data analysis. Additionally, Pandas proves its effectiveness in merging datasets, whether it involves combining lab results with physician notes or integrating various forms of imaging data. The library's flexibility and functionality make it a powerful tool in the realm of health informatics.

📄️ 2.2 Polars for Large Data

In this chapter we discuss Polars, which comes into play when dealing with large health informatics datasets that can span tens of millions to a few hundred million records. Its advanced parallel processing capabilities optimized for a single machine, combined with a columnar storage format, enable efficient operations on datasets that surpass traditional Pandas' capabilities. Polars makes the most of modern CPUs and memory architectures to provide seamless data processing and analysis within a single machine's memory.

📄️ 2.3 Distributed Computation with Dask, Ray, and Modin for Big Data

Intro

📄️ 2.4 Data Confidentiality and Synthetic Data Generation

In the realm of health informatics, maintaining patient data security is of utmost importance. Ensuring data confidentiality while still extracting valuable insights is a complex task. Python offers several tools and techniques to address this challenge:

Chapter 2 - Data Processing and Cleaning

📄️ 2.1 Pandas for Processing and Cleaning Small to Medium Sized Data

📄️ 2.2 Polars for Large Data

📄️ 2.3 Distributed Computation with Dask, Ray, and Modin for Big Data

📄️ 2.4 Data Confidentiality and Synthetic Data Generation

📄️ 2.5 Data Interoperability in Health

📄️ Resources for Further Exploration

📄️ End of Chapter Exercises