man typing on a laptop

Over the past 10 years, the healthcare industry has been engulfed by a sea of data. By some estimates, an average hospital’s electronic health record (EHR) system generates more than 665 terabytes of data every day.1 Data is collected from clinician–patient interactions, lab tests, radiological images, diagnoses, treatment protocols, medical costs, insurance payments, and more.

At MDGuidelines, our work is similarly consumed by large amounts of data. Our team of researchers, medical editors, data analysts, and medical coders work tirelessly to organize content and data in a way that enhances our products and makes sense for our users. We analyze more than 15 million disability claims records and 2.5 billion medical claims, and every month we transmit and receive millions of client data requests through our API platforms that help integrate MDGuidelines content into clinician workflow systems. But one of the most complex data challenges we face as a business has to do with medical coding—specifically, the way that medical codes are used to organize clinical information and help users navigate medical content.

One of the most widely recognized medical classification systems is the International Classification of Diseases (ICD), which was originally created to track health and mortality statistics in populations.2 The ICD system still serves this purpose, it has also been adopted as the primary system for medical billing and reimbursement. Compared to other biomedical classifications, ICD is a relatively simple system where each diagnosis (or procedure) can only fall within a single classification. Medical conditions that are difficult to categorize are grouped into ambiguous categories and captured as “not elsewhere classified” or “not otherwise specified.”

duckThe 10th version of the ICD system (ICD-10), adopted in the United States in 2015, is far more complex than older versions, encompassing more than 140,000 codes in contrast to approximately 17,800 in the 9th version (ICD-9). Most of the new codes include clinically relevant information, such as the side of the body that is injured (e.g. meniscus tear in the left or right knee) and the stage of care (first or subsequent doctor’s visit). Other additions, such as “Burn due to water skis catching on fire” (code V91.07XA), “Struck by duck, initial encounter” (code W61.62XA), are of questionable utility. And just how common is it to be “struck” by waterfowl anyway?

All MDGuidelines content (including our guidelines, duration tables, and interactive tools) is meticulously coded by ICD and Current Procedural Terminology (CPT) code sets. This helps users navigate our website, perform data analytics, and locate relevant content. But we know the ICD and CPT systems aren’t the best tools for indexing digitized medical content, so we’re always looking for new ways to help our users perform important tasks.

One of the most exciting classification systems that we’re working with is SNOMED-CT (Systematized Nomenclature of Medical Terms – Clinical Terminology). With more than 300,000 concepts and over 1 million relationships, SNOMED-CT is used in the United States (and around the world) to support consistent representation of clinical information in electronic health records.

The problem with “big data” in health care is not the volume but the variety and complexity of medical knowledge. Classification systems, like SNOMED-CT, that represent the relationships between medical concepts are powerful tools that represent knowledge in a way that both humans and computers can understand. These are the critical building blocks for effective machine learning tools that will help us to sail through the sea of health care data.


  1. Sullivan, T. Healthcare is swimming in data, but what to do with it? Healthcare IT News website. October 23, 2017. Accessed June 20, 2018.
  2. Moriyama IM, Loy RM, Robb-Smith AH, et al. History of the statistical classification of diseases and causes of death. National Center for Health Statistics, 2011.