Monday, March 19, 2012

Processing Zip Files using Pentaho Data Integration

Recently I was on a call and the requirement was that they received zip files that contain a .csv file from their customers and needed to take the contained .csv file and put it into a database.  In this particular example, the structure of the .csv file is already known.  Currently this process is a manual process where they unzip the file, take that file and run it into a transformation that populates it into a database.  Their goal is to automate this process.  In the example below I have created a transformation that shows how Pentaho Data Integration can automatically take a zip file, unzip it, pass the contents to a transformation, process the file into a database.

The first step is to create a Job that handles the unzipping of the file

The job then passes that file to a transformation that takes the enclosed .csv file, processes it and inserts it into a database:

I have created a file for you to download a working example.  In order to run through this example you need to perform the following:

  1. Download the “PDI - UNZIP AND PROCESS FILE.ZP” file and extract its’ contents into a directory called “PDI - UNZIP AND PROCESS FILE”

  2. Open up Pentaho Data Integration (do not connect to the repository)

  3. Open the following job: “unzip.kjb”

  4. Open the following transformation: “populated database from unzipped file.ktr”

  5. In the transformation, open up the “Table Output Step”

  6. Select the “SQL” button and then select “Execute”, then click “ok” and “close” on the next two dialog screens.
    1. NOTE – We are using a database that comes with Pentaho for testing purposes

  7. Now you can go to the job and run the job.  The following will happen:

    1. The zip file contents will placed in the main directory “PDI - UNZIP AND PROCESS FILE.ZP” and be called sales_data_{timestamp}.csv
    2. The job will then take the location of that .csv file and pass it to the transformation, which will take that file and insert it into the database