Introduction
In my career, I’ve seen a lot of situations where people use integrity and quality interchangeably, which is okay for non-technical stakeholders, but creates a cringeworthy moment for data engineers. Here, I try to dig deeper into these two concepts and explain each one of them and highlight their differences.
Data integrity and data quality are related but distinct concepts in data engineering. Data integrity refers to the accuracy and consistency of data, and involves ensuring that data has not been corrupted or altered during the data processing pipeline. This can include checks to ensure that data has not been tampered with or modified, as well as checks for data consistency. so, as you see, the focus in one the pipeline and whether it corrupts the data or not.
Data quality, on the other hand, involves evaluating the quality and reliability of data, and identifying and correcting errors or inconsistencies in the data. This can include techniques such as data profiling, data cleansing, and data quality checks to identify and correct errors or inconsistencies in the data. You can argue that data integrity is also part of data quality, which can be true in a literal sense. However, these two terms are defined separately and treated differently in the industry.
Let’s dig deeper into each one.
Data integrity
As highlighted before, data integrity involves ensuring that data has not been corrupted or altered during the data processing pipeline. So, one question can be why does it matter? or what would happen if the integrity is violated. Let me give you a few examples and then its consequences become very clear. For example, consider a healthcare organization that stores electronic medical records (EMR) for its patients. If the data in these EMRs is not properly protected and is accidentally or intentionally modified, it could have serious consequences for patient care. For example, if a patient’s medical history is altered, it could lead to incorrect diagnoses and treatments, which could have serious health consequences for the patient.
Another example of data integrity being broken might be a situation in which data is lost or deleted due to a technical issue or a natural disaster. For example, consider a company that stores customer data in a database. If the database is lost due to a hardware failure or a natural disaster, the company could lose valuable customer data, which could have serious consequences for the business.
So, I hope these examples clarify why data integrity is important and also give you some ideas how it must be protected. Below, I try to outline a few techniques that are used to ensure the integrity:
Data hashing: Data hashing is a technique that involves generating a unique “fingerprint” for each record in a dataset. This fingerprint can be used to detect any changes or modifications to the data, as any changes to the data will result in a different fingerprint.
Checksum algorithms: Checksum algorithms are mathematical functions that can be used to generate a unique value for a dataset. This value can be used to detect any changes or modifications to the data, as any changes to the data will result in a different checksum value.
Data replication: Data replication involves creating multiple copies of a dataset and storing them in different locations. This can help to ensure the integrity of the data, as any changes or modifications to the data in one copy will be detected when compared to the other copies.
Data backup: Regular data backups can help to ensure the integrity of data by creating a copy of the data that can be used to restore the data in the event of a data loss or corruption.
Data encryption: Data encryption involves encoding data using a secret key, which can help to ensure the integrity of the data by making it difficult for unauthorized individuals to access or modify the data
Data quality
As mentioned in the introduction, data quality involves evaluating the quality and reliability of data, and identifying and correcting errors or inconsistencies in the data. There are so many situations that quality becomes extremely important. For example, what if we received data and see some of the important information are missing, or even worth, some of the data are incorrect. For example, in the context of healthcare medical claims, some of the claims data have ICD9 instead of ICD10, or they used to send ICD9 but they have changed to ICD10 without proper notification. It can definitely mess up a lot of downstream processess. What are your safeguards to prevent it?
Here are a few important checks that should be considered for data quality:
Accuracy: Ensuring the accuracy of data is crucial for ensuring that data is reliable and can be trusted. Data engineers should check that data is correct and matches the expected values or ranges.
Completeness: Checking for missing values or incomplete data is important for ensuring that data is reliable. Data engineers should ensure that all required data is present and complete.
Consistency: Ensuring that data is consistent is important for ensuring that data is accurate and reliable. Data engineers should check that data is consistent with itself and with other sources of data.
Timeliness: Data that is out of date or stale may not be reliable. Data engineers should check that data is current and relevant.
Overall, it is important to understand the difference between data integrity and data quality in order to effectively ensure the accuracy and reliability of data in data engineering projects.