This project involves the extraction of vehicle crash and incident report data from various sources, including websites and PDF files. The data is parsed and converted into structured CSV format for analysis and reporting. The system automates the process of scraping crash and incident reports, which were previously manually compiled by the client.
The data was spread across multiple formats, including both text-based and scanned PDFs, presenting challenges in text extraction. Accurate extraction from image-based PDFs required the integration of OCR (Optical Character Recognition) to interpret the scanned content.
A solution was developed using Python libraries such as pdfminer, pypdf2, and pytesseract for OCR, along with web scraping tools like beautifulsoup and selenium. The data is extracted, cleaned, and formatted into a CSV/Excel file automatically. The process was scheduled to run at the beginning of each month to ensure the timely availability of the latest reports.
The system has streamlined the process of data extraction, reducing manual labor and improving data accuracy. Clients now receive up-to-date incident reports every month, stored in an easily accessible CSV/Excel format, allowing for better analysis and decision-making.