Abstract
BACKGROUND: EMERSE (Electronic Medical Record Search Engine) is a search engine for free text clinical documents. EMERSE is designed for non-technical researchers, with a user interface that allows for simple query building and patient list management. EMERSE is deployed, or is being deployed, at academic medical centers across the U.S. and Europe. Users have appreciated the speed and simplicity of EMERSE but have sought additional capabilities that a traditional search index cannot provide: the most common feature request has been for the system to support negation (e.g., “no evidence of…”, “patient denies…”) so that a user can exclude negated terms from the results.
OBJECTIVES: Incorporate natural language processing (NLP) capabilities into EMERSE.
DESIGN: Using an “aligned-layer retrieval model” approach, we incorporated additional attributes/tokens that are overlayed over the original indexed terms. These attributes include (1) negation; (2) uncertainty; (3) subject (patient vs other); (4) history of; (5) concept unique identifiers (CUIs) referenced to the National Library of Medicine (NLM) Unified Medical Language System (UMLS); (6) UMLS semantic type (e.g., drug, procedure, finding, etc.). These are built into the system in a user-friendly manner so that no technical expertise is required to use them (see Figure).
RESULTS: With the capabilities of the native search engine (including proximity search, fuzzy search, and wildcard search) and integration of NLP, powerful queries can be written. Further, CUIs can be mixed with regular terms. For example, the query “C0000737 left” (see Figure) with a proximity of 5 words can identify all of the following phrases: (1) “left abdominal pain”; “left flank abdominal pain”; “left lower abdominal pain”; “left upper quadrant abdominal pain”; “abdominal pain in the left”; “abdominal pain, left”; “abdominal pain, which began in his left”; “left-sided upper quadrant abdominal pain”. Based on our sample dataset of approximately 635,000 test “documents” (primarily PubMed abstracts), there were 274,871 negation tokens, 324,842 uncertainty tokens, and 87,723 subject tokens. The addition of these tokens increased the size of the index by 26% (2.3 GB without the tokens versus 2.9 GB with the tokens), but these additional tokens had no discernable difference in the time required to identify a cohort based on a query (~1 second).
CONCLUSIONS: The EMERSE system, with the addition of NLP components, is anticipated to provide additional value to users and make searches more effective. It is still undergoing testing at the time of this writing, but we anticipate a release to the community sometime in 2024. For access to EMERSE, contact us at EMERSE-team@umich.edu.