-
The CDC collected survey data from adults in Alabama to better understand health risks like heart disease, smoking, and access to care. This data was gathered through structured phone interviews as part of the BRFSS effort.
-
The data was stored in a standard CSV file format on a secure hard drive. This kept it protected, but also meant limited accessibility for others who may have needed to verify or work with it.
-
In June 2018, a newly hired analyst noticed that the database record count didn’t match the original CSV file. This discovery helped flag important issues with the data’s completeness.
-
With the data passing through multiple people and systems, the potential for small errors adds up. These inconsistencies raised concerns about whether the data could be reliably used for analysis or decision-making.
-
When the organization expanded, the data and codebook were moved to the new location using a flash drive. Although this method successfully transferred the files to the new location, it lacked verification steps to ensure that nothing was missed or corrupted along the way.
-
As the data was imported into the new SQL system, several rows failed to upload due to character encoding problems. This led to data loss, which wasn’t immediately identified.
-
Because this data relates to real people, it must be handled with care. Ongoing work involves not just cleaning and organizing the data, but making sure it complies with privacy regulations and is used responsibly.
-
The data currently exists in two forms: the original, complete CSV file and the incomplete SQL version. The two versions will need to be compared and cleaned before use to ensure accuracy.