Most of my work related to database management has been in the construction, cleaning and new feature creation of large databases. One of the most challenging projects was in the creation of a large American wine industry organizational database.
This project was particularly challenging because all of the original data existed in pdf format. In addition, the researcher (Heather Haveman of UC Berkeley’s Haas School of Business and Sociology departments ) wanted the entire dataset, including names of the owners and employees of each winery, converted into a time-series database. The years included in the database were from 1940 through 2006.
The database construction for these types of projects is typically done through a labor-intensive, manual data-input process. I approached the project by proposing and then implementing an innovative and more efficient semi-automatic process using Python and optical character recognition software.
The steps I implement to construct the database are explained below:
1. Optical Character Recognition
Optical character recognition software (OCR) converted the wine industry data pdf file into text format. The text from the OCR was unstructured and untagged, and often contained a lot of noise. I wrote a program to convert the text from this unstructured format into a structured format identical to what existed in the pdf file. I implemented this step for downstream quality control tests.
2. Automatic Parsing and Feature Generation
A program was written to automatically parse the structured text and export it into csv format. I created two dataset – an organizational and personnel dataset for each winery. The personnel dataset included winery, year, name of the person and title. I used Python dictionaries to automatically add a binary variable for gender (male or female).
The organizational dataset contained such variables as winery name, address, type of wines produced, storage capacity, etc. I also added a location feature by writing a program that automatically added the geographic coordinate system (latitude and longitude) for each winery based on their address.
3. Quality Control
To ensure the program converted the pdf files accurately, I took a random sample of the 30 wineries in the csv files and compared the data in the original pdf file. I made corrections to the program when necessary. I also investigated and corrected the program for producing missing data.