This was due to the deidentification process that broke the links across encounters, a process that also obscured various protected health elements, such as dates, geographic locations, and provider identifiers. One limitation of the TREC Medical Records Track was a limitation of the UPMC corpus, which was retrieval at the encounter (eg, hospital or emergency department visit) and not the patient level. 16 Follow-on research with the test collection found continued improvement in performance from approaches such as query expansion for additional clinical and other corpora 17 as well as use of learning-to-rank methods. These included vocabulary normalization specific to the clinical domain, synonym-based query expansion from medical controlled terminology systems such as the Unified Medical Language System Metathesaurus, and recognition of negation. In the TREC Medical Records Track, several domain-specific enhancements on top of word-based queries were found to lead to improved retrieval performance. Word-based methods are in distinction to Boolean searching, which is sometimes called set-based searching, where sets of retrieved documents are combined using Boolean operators. 15 The judgments were performed by physicians enrolled in biomedical informatics educational programs.Ī common baseline method for all types of IR experimentation is “word-based” searching, where queries are submitted to the system and output is ranked by a similarity function between query and documents. 13, 14 Using the University of Pittsburgh collection containing 17 264 encounters containing 93 551 documents (some of which included ICD-9 diagnosis codes, laboratory results, and other structured data), a total of 34 and 47 topics, respectively, by year were developed and relevance judgments performed based on pooled results from participating research groups using the “Cranfield paradigm” common to IR evaluation research. The TREC Medical Records Track ran in 20, attracting 29 and 24 academic and industry research groups, respectively. 12 Among the uses of the UPMC corpus has been a cohort retrieval for clinical research studies task in a challenge evaluation as part of the annual Text REtrieval Conference (TREC). There are 2 EHR record collections that have been publicly available, one from the University of Pittsburgh Medical Center (UPMC) 11 and the other the Medical Information Mart for Intensive Care-III from the Massachusetts Institute of Technology.
10 This is especially so for use cases involving processing of textual data within records, including those used on the scale of information retrieval (IR) experiments where corpora of thousands to millions of patient records are typically desired.
9 A major barrier has been the challenge of protecting privacy of the patients from whom the records are from and institutional hesitancy to making such data widely available for informatics research, even in deidentified form. One challenge for evaluating this use case is the lack of test collections that include data, clinical study descriptions, and relevance judgments for retrieved patients, a problem that has hindered many types of research using EHR data, even in the modern era of ubiquitous EHR adoption. It has been shown that typical review of patients for study eligibility is a labor-intensive task, and that automated preprocessing of lists of patients may reduce human time and effort for selection of cohorts. 5 However, the performance of systems and algorithms for this EHR use case is not well studied. 1, 2 A number of systems are available to facilitate this task, such as i2b2 3, 4 and TriNetX. Many academic medical centers, including over 90% funded by the National Institutes of Health Clinical & Translational Science Award program, offer patient cohort discovery to their researchers to facilitate clinical research, usually including electronic health record (EHR) data.