Skip to main content

1.9 General Data Formats

Health informatics deals with a multitude of data formats tailored to specific needs. Let's explore some common data formats, their use cases, and structures:

JSON (JavaScript Object Notation)

JSON is a versatile and human-readable data format. It's commonly used for structured data and APIs due to its simplicity and flexibility. Health data exchanges like FHIR rely on JSON for data interoperability. In python, this is very similiar to the concept of a dictionary click here to learn more, and differ in that they are in single quotes, but still abide by the key:value pair concept.

Structure:

{
"patient": {
"name": "Alice",
"age": 32,
"conditions": ["Diabetes", "Hypertension"]
}
}

Use Case: Sharing structured patient data between healthcare systems.

TOOL HINT!! A tool that I use when investigating the structure of perhaps a complex JSON file is JSON Lint, a web-based tool that checks to make sure that the json does not contain any character or structural errors (this process is called linting), and provides you with a easy navigation view of potentially nested key:value pairs. In addition, IDEs like Visual Studio Code also have plugins that can perform the same funciton. Please use caution if using jsonlint.com with real-world data.

CSV (Comma-Separated Values)

CSV is a plain-text tabular format, suitable for storing large datasets in a lightweight manner. It's widely supported by various applications and can be used for quick data analysis.

Structure:

Name, Age, Gender
Alice, 28, Female
Bob, 35, Male

XML (eXtensible Markup Language)

XML is a markup language that encodes documents in a format both human-readable and machine-readable. It's suitable for structured data and is commonly used in legacy healthcare systems.

Structure:

<patient>
<name>Alice</name>
<age>32</age>
<conditions>
<condition>Diabetes</condition>
<condition>Hypertension</condition>
</conditions>
</patient>

Use Case: Integrating data from diverse sources with different structures.

Parquet

Parquet is a columnar storage file format that's highly efficient for analytics. It's optimized for performance and compression, making it suitable for storing large-scale health datasets.

Use Case: Storing and analyzing large-scale clinical datasets efficiently.

For more information on Parquet, please visit their official documentation, or visit this page for pythonic examples of reading and writing the apache parquet format

Feather

Feather is a lightweight columnar file format designed for speed. It's ideal for quickly reading and writing data, making it suitable for data preprocessing and sharing.

Use Case: Rapid data exchange between different analytical tools.

Installation:

pip install feather-format

For more information on Feather, please visit their official documentation


Each of these data formats caters to specific needs in health informatics, from structured data exchange to efficient storage and analysis. Depending on your requirements, you can choose the format that best suits your data manipulation and processing tasks.