Researchers at the University of Pennsylvania have introduced Observer, described as the first multimodal medical dataset designed to capture anonymized, real-time interactions between patients and clinicians. Unlike traditional sources that only provide clinician notes or patient vital signs after a visit, Observer collects video, audio, and transcripts during actual primary care encounters.
Kevin B. Johnson, David L. Cohen University Professor and lead author of a new paper on Observer in the Journal of the American Medical Informatics Association, explained the significance: "So much of what shapes medical visits and their outcomes has been invisible to researchers. Thanks to technology that anonymizes our recordings, enabling HIPAA compliance, Observer lets us watch care unfold. That kind of evidence isn't just the foundation for improving clinical practice, it's crucial for developing responsible AI tools to augment care."
The research team has already distributed pilot grants to other groups interested in using Observer. Johnson said, "These early projects are the start of a flywheel. As researchers generate new insights and recordings, the dataset will grow, letting us ask even more ambitious questions."
For years, researchers have used data from medical visits to find ways to improve health care delivery. For example, the Medical Information Mart for Intensive Care (MIMIC), an MIT-affiliated project launched in the 1990s, now contains tens of thousands of ICU records and has contributed to numerous studies about clinical decision making and hospital operations.
Recently, such datasets have also been important for training artificial intelligence models by allowing them to find links between diagnoses, treatments, and outcomes across large populations. Johnson noted: "We've learned a tremendous amount from what gets documented in the medical record. But if we want to understand the full experience of care, we need data that shows what happens in the room."
Observer connects multiple forms of data—video footage, audio recordings, transcripts—with clinical information from electronic health records (EHR). This allows researchers to investigate aspects like when laughter occurs during a visit or how often clinicians look at patients compared with computer screens.
Strict privacy rules under HIPAA require that any research data be stripped of identifying details. Traditionally this was difficult with video or audio material because it required manual review and editing.
To address this challenge, Penn researchers developed MedVidDeID—a tool described in a separate paper in the Journal of Biomedical Informatics—which automatically removes identifying features from video and audio captured during clinical encounters. In testing scenarios cited by Penn’s team, MedVidDeID successfully de-identified over 90% of video frames without human intervention and reduced total review time by more than half.
Sriharsha Mopidevi, Senior Application Developer in Penn’s AI-4-AI Lab and co-author on both papers stated: "We built a modular pipeline that automates most of the audio-video de-identification process. By keeping a human in the loop, we're able to protect patient privacy while enabling video-informed research at scale."
Before gathering any data for Observer, participants—including patients’ families—were given opportunities both to opt into participation and provide feedback afterward. Cameras were installed in clinics with different placements: one fixed camera captured overall activity; clinicians wore head-mounted cameras; when permitted by participants an additional camera recorded from each patient's perspective.
With initial data collection complete and pilot studies ongoing, plans are underway to expand access through an application-based model similar to MIMIC’s approach for qualified investigators.
Johnson summarized: "This is ultimately about changing the health care system. You cannot improve care or build meaningful clinical AI without understanding the encounter itself. When you can see what happens across hundreds or thousands of visits, transformation becomes possible."