dimarts, 14 de maig del 2013

GHAP-BI: OLAP project with only open source tools

At this time, we are finishing the second course that GHAP is running. GHAP is an attendance control software and plus.

Previous GHAP feature was the ability to predict students attendance, a mix of Knime decision tree model and python lxml module to read resulting PMML. 


The new GHAP big feature is the Business Intelligence module. GHAP is for public schools, for this (and other) reason  only free software tools should be included. I'm used to develop BI projects but this is my first one with open software tools. In this post I explain GHAP-BI software components and experience.

As any BI solution GHAP-BI incorporates ETL and Data Analysis and visualization software.  

ETL is splitted as E+TL. The Extraction phase is made directly by GHAP application (django) and Transform and Load is made with pygrametl. This tool, pygrametl is praiseworthy: it is easy, fast and powerful and also clear documentation is available. 

I prefer OLAP solutions over Reporting solutions, I appreciate metadata layer with measures and dimensions, for this reason I have included Saiku software as viz software. I found Saiku through a post in stackoverflow.



ETL development: GHAP generates each night a 91MB text file (>500K lines), it takes over 4 hours to complete export. This file contains 1 course attendance controls full detailed. Then, a little python code load this data into Warehouse database. In 2 minutes all data is imported over 5 tables in snow-flock design. pygrametl is the piece that makes this possible combining dimension cache with bulk facts import. Notice that pygrametl could reduce drastically also export time avoiding database lookups, but I prefer to generate first text file.

Viz: Saiku is the front end for the users. You can learn about Saiku Pros and Cons reading @twn08 post, but this is my experience: I send GHAP-BI url, user and passwd to manager, 20 minutes later I move to manager's office to know if connection was successful and my surprise was to see Saiku page with a combination on filters, dimensions and measures and also with some sorted data! When user see me the question was: "how can I export this viz?" I realized than Saiku is a very friendly tool. Install Saiku is easy, I have apt-get installed tomcat6 and copy Saiku "binary" files into tomcat webapps folder, then change connections and schema.

DataWarehouse backend: Saiku perform MDX queries to Mondrian and Mondrian translate it to Relational Database. I have test with both MySQL and Postgres, this one, Postgres, is definitely the option adopted. Postgres serves in seconds queries that run forever in MySQL.

Cubes: I have wrote cube schema initially with Pentaho Schema Workbench, but I have left it when I have transformed dimensions to shared dimensions.



Summarizing, In 20 hours my open software BI solution was successfully running, thanks to: python, django, Saiku, Postgres and pygrametl.








Cap comentari:

Publica un comentari a l'entrada