Bird's Eye View of Research Landscape

This page is a companion to an article published in STAT News, where I describe the value of taking a “bird’s eye perspective” on the clinical trial enterprise.

In December 2018, I downloaded data on all of the clinical trial registration records from 10 large pharmaceutical companies and AERO graphed them all together—by trial start date on the x-axis, and disease/condition of interest on the y-axis. The resulting figure, which includes more than 13,000 trials, is below.

Each bubble in the figure corresponds to a trial. The bubble color corresponds to the company; its size corresponds to the number of human subjects enrolled. Bubble shape indicates the trial’s current status, e.g., circles are completed trials; triangles are still active.

You can interact with the figure by mousing-over any bubble to see more information about the study. You can also click on any bubble to open the trial’s registration page on ClinicalTrials.gov.

For more on my interpretation of the figure, please do check out the piece at STAT. Or if you scroll all the way down to the bottom of the page (it’s a long way down there!), I describe some of the technical details about how this figure is produced.

Sponsor Trial Map

Notes and Updates

July 23, 2019: You can now download the data set for this figure from the Harvard Dataverse.

July 18, 2019: Like most of the visualizations on this site, this figure is produced using Python and the Bokeh visualization library. I have scripts to download all of the trial data (in xml format) from these 10 sponsors, extract and combine the data from the xmls into one spreadsheet, and then graph the result.

But one of the major challenges in analyzing or visualizing ClinicalTrials.gov data is the fact that the there is no standardized vocabulary or ontology for how to describe the condition/disease of interest in a clinical trial. For example, one trial might describe itself as studying “Alzheimer’s disease and dementia” and another describe itself as studying “Alzheimer’s and related dementias”. While a human can look at those two descriptions and immediately judge them to be the same, a computer will need some help.

Fortunately, many (although not all) registration records are tagged with MeSH terms, and MeSH is a structured vocabulary. Therefore, to transform the unstructured condition field data into structured data, I wrote an algorithm that uses the MeSH terms and fuzzy text matching with some of the other fields in the registration record to make a judgment about the condition of interest. Although this algorithm is not perfect and there is certainly some noise in the figure, it is nevertheless accurate enough to provide a meaningful picture of more than 13K trials—and it does so without having to do any manual coding of the data.

A Bird’s Eye View of Pharmaceutical Research and Development

Notes and Updates