This project focuses on extracting data from scanned vehicle crash incident reports stored in PDFs. By using OCR technology and advanced data parsing, the solution converts scanned images into readable text and then processes it into structured CSV format.
OCR data extraction presented challenges due to varying text quality in scanned images. The unstructured text had to be parsed and cleaned effectively to identify key fields for the final report.
Pytesseract and OpenCV were used for OCR to extract text from scanned PDFs. The extracted text was then cleaned using regex and spacy to structure the data. The system outputs the data into CSV format, which is scheduled to run automatically each month.
The solution successfully automated the extraction of crash incident data from scanned PDFs. Clients reported faster access to critical data for analysis and improved efficiency in compiling monthly incident reports.