i2b2 Blog: Data Awareness & New/Missing Variables

Changes in variable collection

February 17, 2020

Knowing the history and metadata about the hospital systems is often critical to interpreting data.  For instance, new data collection takes a while to propagate through our patient population.  If in 2014 you queried what percentage of patients were Hispanic, your result would be about 2% of Wake’s overall patient population.  However, this is simply only how many had been recorded thus far, and not accurately reflective of the truth…. 

Ethnicity was not a chart value collected separate from Race until we switched to using EPIC in September of 2012.  At that time, we started at zero patients marked as Hispanic. As patients came in for visits at our hospitals and clinics, Ethnicity data point would get updated. Some patients may never return to the system and therefore the system would still not know that they are Hispanic, but over time the expectation is that the data will become more and more complete.


Take for instance the below chart.  In the blue is the total number of patients with recorded Ethnicity of Hispanic.  The red line represents a simple un-nuanced calculation of 9% of Wake’s overall patient population (based on the reported NC population of 9% being Hispanic*).  Many factors are not considered in this, such as county or state of residence, or less easily measurable variables like health care utilization and access.  However, these metrics give a general sense of how long it takes for a new data collection feature to reach our population.  

2013-2020 data for new variable 'ethnicity: Hispanic'

Here in 2020, our data reflects only half the percentages we would expect, but this is a far better situation than we were in during the initial rollout of the Ethnicity category.

Surrogate Variables

In order to work around issues of absent data, we explored possible surrogate variables:
We ran a comparison to those patients who were marked as the deprecated Race value of “Spanish Surname” at the conversion to EPIC, which roughly correlates to ethnicity of Hispanic.  Some of these are expected (namely the last column) simply because the patients have not accessed the medical system since Ethnicity variable collection started.

Total Patients with a Language recorded  ‘Spanish Surname’ Race  Currently Marked Hispanic Not Marked   Not Marked and Not Seen Since EPIC 
1,157,533 80,553  33,041  47,512  43,855
100% 7%  2.9%  4.1%  3.8%

*Note: 2012 Hispanic population was closer to 8.4%.

 Similarly, we looked at patients who are marked as having a language of Spanish but not yet marked as Hispanic:

Total Patients with a Language recorded Spanish Language Currently Marked Hispanic Not Marked Not Marked and Not Seen Since EPIC
1,560,246 59,814 44,242 15,572 12,336
100% 3.8% 2.8% 1% 0.8%

While surrogate variables are not a replacement for explicit data, they can bridge the gap in situations where null results are due to nascency of a defined category. Keep this data consideration in mind as other data collection items are added to EPIC or other hospital systems.  Gender Identity collection began in September 2019, and we are seeing a similar slow uptake of data being recorded; the vast majority of records still have no value for this variable.  It may be a couple of years before we see enough lull in the nulls to do a full mull.

Curious to see what your data might show? 

Reach out today for more information on conducting your own i2b2 queries, or to request a data pull from the Translational Data Warehouse.

For more information submit a CTSI Service Request
or contact Informatics Program Director Brian Ostasiewski