Segmed Introduces LLM-based Medical Data De-identification Playground

PALO ALTO, CALIFORNIA, UNITED STATES, June 15, 2023/ — Segmed – a real-world data platform that is simplifying access to medical imaging data – today announced the launch of their LLM-based data de-identification playground. This free-to-use tool, found at, allows users to input text-based medical reports, and as an output receive a de-identified version of the original report, with direct and indirect identifiers classified and redacted.

There is no shortage of potential when it comes to medical data – researchers and developers in medical AI benefit from high-quality, standardized data to train, test and validate their algorithms. However, prior to sharing and collaborating on this data, patient privacy and confidentiality must be maintained.

For this, data must be effectively de-identified by removing any protected health information (PHI). This includes direct identifiers (i.e. patient name, phone number, medical registration number, etc.) and indirect identifiers (i.e. patient sex, date of birth, hospital, ZIP code, etc.). While data de-identification should be done in a manner that significantly reduces any chances of re-identifying a patient, it should also be done in a manner that preserves the meaning and structure of the original text.

In the context of data de-identification, large language models (LLMs) have shown promise in detecting and subsequently removing sensitive information, specifically due to their ability to identify said information through named entity recognition (NER).

Segmed has leveraged OpenAI’s davinci model to build our LLM-based data de-identification playground. Users can view PHI detected in their inputted data, as any redacted identifiers are tagged and classified for visibility. This tool is strictly for demo purposes, and is not intended for production PHI removal. Additionally, Segmed does not store or save user data from this tool.

We encourage users to test the tool and provide feedback either via email at [email protected], or through the survey on the site.

Future iterations of this prototype will allow users to conduct batch operations (i.e. upload a csv of text data they would like de-identified), as well as perform more customized data redaction. Segmed is excited to scale and fine-tune our data de-identification process, as this will allow us to accelerate collaboration with innovators in medical AI.