Automating a Streaming Pipeline with OCR on Databricks Lakehouse

2 года назад

562 Просмотров

Health systems and payers are dealing with vast amounts of clinical documents that often are delivered as scanned images. Most organizations struggle to build a scalable pipeline despite operationally needing these documents on a daily basis.

In this talk, Amir demonstrates how to build and automate a clinical data pipeline with JSL Healthcare Solutions on Databricks Lakehouse Platform. This pipeline uses Databricks’ Auto Loader, which automates data ingestion into Delta Lake, by enabling organizations to incrementally ingest data.

The pipeline retrieves scanned images from object storage, converts the files to text, extracts clinical entities, and outputs the results to the same storage location in delta format, which can further be analyzed for a variety of clinical applications using Databricks SQL. All of this happens within a fully managed environment, simplifying the ETL process.

Скачать видео