Data Collector Consolidator

From E-COMPASS_Info_Guide
Jump to navigation Jump to search

Overview

The process of translating the collected data from different sources to RFD to consolidate is carried out by means of mapping functions. Each data source (Google Analytics, Piwik and Competitors' Data Collector) considered in the scope of the SME E-COMPASS project has a different method for collecting, gathering, and providing access to the analytical data. Therefore, a different set of mapping functions is required in order to parse the information provided by each data source to RDF, according to the ontology. Each set of mappings is then composed by functions to translate the attributes with their values into their corresponding triplet form in RDF.

Overview of Data Collector and Consolidator

The Data Collector and Consolidator is consists of 4 main services:

Almost all of these services are currently allocated in the same virtual machine VM1. There is only one service in a separate machine (VM2) which consists on the Virtuoso RDF Repository. Nevertheless, each service can also be easily installed and configured in a different machine in order to get a fair trade-off between network connection and resource requirements, for the sake of a good load balancing.

Mappings functions collect data from digital footprints (GA, PIWIK) and from Competitors' Data Collector module to generate RFD data and store them on the Virtuoso service. API REST functions read processed information from RDF Repository and returns data in JSON format to the remaining modules of E-Compass Data Mining Application.


Physical Hardware Characteristics

Model: HP ProLiant DL380 G5/HP ProLiant DL360 G5
Processor: 8 cores 2GHz/8 cores 2,5GHz
RAM: 32GB/40GB
Hard Drive Space: 4TB Shared
Network Connection: 10 Gbit Ethernet
Hypervisor used: VMware ESX 6 with vSphere Center
Physical Load Balancing: none

Virtual Machine Hardware Specifications and Operating System

Guest Operating System: Linux CentOS 6
Processor: Dual 2 core Intel processor and i7 intel 5 cores
RAM: 16 GB GB
Hard Drive Space VM: 100 GB internal
Network Connection: 10 Gbit/s Ethernet
Minimum required Network Connection: no info available

Service Environment and Set-up on VM1

For setting up the API please download and install the following software:

Required Software
Software Download
java version "1.7.0_79" http://www.oracle.com/technetwork/es/java/javase/downloads/index.html
Apache web Service 2.4 http://httpd.apache.org/
Apache Tomcat/7.0.32 http://tomcat.apache.org/
MySQL Server 5.6 (Community Edition) https://dev.mysql.com/downloads/mysql/

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately Openlink Software Virtuoso Universal server (used as RDF repository in the E-Compass Data Mining Services) requires a software license, which is free of cost for academic use only. In order to run this software productively a commercial license is required. The terms of licensing are available here

OS Environment Variables

Installation of Mapping Functions

Mapping Functions are deployed in the service as a "ecompass.jar" java executable file that should be allocated in a directory (recommented in directory HOME\ecompass\ecompass.jar). No additional installation tasks are required.

The (.jar) package contains four main applications:

  • Eshopbd. It is a process that checks if a new eshops is in the system database (MySQL). If so, it creates a new RDF graph in Virtuoso service to this eshop.
  • Google Analytics mappings (getDateAndLaunch.sh)
  • Piwik mappings
  • Scrapping mappings

Installation of REST API

Data Collector & Consolidator REST API is deployed in a Tomcat service as a "ecompass.war" file. To do so, you can either locate this .war file in webapps directory of Tomcat (for example, in /usr/share/tomcat/webapps/) or deploy it by means of the Manager App of Tomcat service in the address http://your.service.machine:8080.

Installation of RDF Repository

Download and install OpenLink Virtuoso. Click the link

Installation of Piwik

Download and install Piwik following the 5 minutes installation guide on your own machine. Click the link

Preparation of MySQL database

  1. Open command line and type: mysql –root –p
  2. Enter root password (given during the installation of MySQL)
  3. Create database and user for Price Monitor:
    1. mysql > CREATE DATABASE eshopBD;
    2. mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘password’;
    3. mysql > USE ecompass;
    4. mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
    5. Create data base tables scheme mysql -u ecompass -p eshopBD < script.sql
    6. mysql > exit;

The script.sql contains SQL orders to generate all tables, as follows:


#
# Structure for table "ecc_eshop_owner"
#

CREATE TABLE `ecc_eshop_owner` (
  `ecc_eshopowner_id` int(11) NOT NULL,
  `eshop_url` varchar(45) NOT NULL DEFAULT '',
  `first_name` varchar(45) DEFAULT NULL,
  `last_name` varchar(45) DEFAULT NULL,
  `street_name` varchar(45) DEFAULT NULL,
  `street_number` varchar(45) DEFAULT NULL,
  `city` varchar(45) DEFAULT NULL,
  `region` varchar(45) DEFAULT NULL,
  `country` varchar(45) DEFAULT NULL,
  `continent` varchar(45) DEFAULT NULL,
  `telephone_number` varchar(45) DEFAULT NULL,
  `telephone_country` varchar(45) DEFAULT NULL,
  `email` varchar(45) DEFAULT NULL,
  PRIMARY KEY (`ecc_eshopowner_id`,`eshop_url`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Structure for table "ecEshops"
#

CREATE TABLE `ecEshops` (
  `eshopID` int(11) NOT NULL,
  `name` varchar(50) DEFAULT NULL,
  `url` varchar(50) DEFAULT NULL,
  `activity` varchar(50) DEFAULT NULL,
  `currency` varchar(10) DEFAULT NULL,
  `chamber` varchar(20) DEFAULT NULL,
  `analytics` varchar(10) DEFAULT NULL,
  `ecommerceFunc` varchar(20) DEFAULT NULL,
  `newEshop` bit(1) DEFAULT NULL,
  `clustering` bit(1) DEFAULT NULL,
  `tree` bit(1) DEFAULT NULL,
  PRIMARY KEY (`eshopID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Structure for table "ecHolisticCompetitors"
#

CREATE TABLE `ecHolisticCompetitors` (
  `idCompetitor` int(11) NOT NULL AUTO_INCREMENT,
  `Company` varchar(50) DEFAULT NULL,
  `Product` varchar(50) DEFAULT NULL,
  `date` date NOT NULL,
  `Price` double DEFAULT NULL,
  `eShopID` int(11) DEFAULT NULL,
  PRIMARY KEY (`idCompetitor`)
) ENGINE=InnoDB AUTO_INCREMENT=471 DEFAULT CHARSET=latin1;

#
# Structure for table "ecProducts"
#

CREATE TABLE `ecProducts` (
  `idProduct` int(11) NOT NULL AUTO_INCREMENT,
  `date` date NOT NULL,
  `city` varchar(50) DEFAULT NULL,
  `region` varchar(50) DEFAULT NULL,
  `country` varchar(50) DEFAULT NULL,
  `productName` blob,
  `clgroup` varchar(50) DEFAULT NULL,
  `eShopID` int(11) DEFAULT NULL,
  PRIMARY KEY (`idProduct`)
) ENGINE=InnoDB AUTO_INCREMENT=3234 DEFAULT CHARSET=latin1;

#
# Structure for table "ecVisitors"
#

CREATE TABLE `ecVisitors` (
  `idVisitors` int(11) NOT NULL AUTO_INCREMENT,
  `date` date NOT NULL,
  `city` varchar(50) DEFAULT NULL,
  `region` varchar(50) DEFAULT NULL,
  `country` varchar(50) DEFAULT NULL,
  `continent` varchar(50) DEFAULT NULL,
  `clgroup` varchar(50) DEFAULT NULL,
  `visitors` varchar(50) DEFAULT NULL,
  `eShopID` int(11) DEFAULT NULL,
  PRIMARY KEY (`idVisitors`)
) ENGINE=InnoDB AUTO_INCREMENT=94458 DEFAULT CHARSET=latin1;

#
# Structure for table "eshop"
#

CREATE TABLE `eshop` (
  `ecc_eshop_id` int(11) NOT NULL,
  `name` varchar(45) DEFAULT NULL,
  `url` varchar(45) DEFAULT NULL,
  `currency` varchar(45) DEFAULT NULL,
  `timezone` varchar(45) DEFAULT NULL,
  PRIMARY KEY (`ecc_eshop_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Structure for table "eshop_autoinc_ids"
#

CREATE TABLE `eshop_autoinc_ids` (
  `ecc_eshop_id` int(11) NOT NULL,
  `metric` int(11) NOT NULL DEFAULT '0',
  `medium` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`ecc_eshop_id`),
  UNIQUE KEY `eshop_UNIQUE` (`ecc_eshop_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

#
# Structure for table "ids_eshop"
#

CREATE TABLE `ids_eshop` (
  `ecc_eshop_id` int(11) NOT NULL,
  `ids` varchar(30) DEFAULT NULL,
  PRIMARY KEY (`ecc_eshop_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Structure for table "jobs"
#

CREATE TABLE `jobs` (
  `id` int(11) NOT NULL,
  `eshop_ecc_id` int(11) NOT NULL,
  `product_id` int(11) NOT NULL,
  `days` int(11) NOT NULL,
  `time_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

#
# Structure for table "jobs_competitors"
#

CREATE TABLE `jobs_competitors` (
  `id` int(11) NOT NULL,
  `competitor_id` varchar(45) NOT NULL,
  PRIMARY KEY (`id`,`competitor_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

#
# Structure for table "jobs_schedule"
#

CREATE TABLE `jobs_schedule` (
  `job_id` int(11) NOT NULL,
  `next_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`job_id`,`next_date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

#
# Structure for table "piwik_site_data"
#

CREATE TABLE `piwik_site_data` (
  `ecc_eshop_id` int(11) NOT NULL AUTO_INCREMENT,
  `piwik_idsite` varchar(30) DEFAULT NULL,
  PRIMARY KEY (`ecc_eshop_id`)
) ENGINE=InnoDB AUTO_INCREMENT=197 DEFAULT CHARSET=latin1;

#
# Structure for table "piwik_visit_data"
#

CREATE TABLE `piwik_visit_data` (
  `idsite` int(10) unsigned NOT NULL,
  `idvisit` int(10) unsigned NOT NULL,
  `idvisitor` binary(8) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

#
# Structure for table "piwik_visit_records"
#

CREATE TABLE `piwik_visit_records` (
  `last_idvisit` int(10) unsigned NOT NULL,
  `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


Service Configuration

Configuration script

The Mapping Functions are launched every day (night) by means of a “crontab” deamon schedule. To use them, type %>crontab –e. It will appear the schedule identifying how to call all mapping applicaitons and how to resend log files. The

# run everynight at 2am, 3am, 4am, and 6am
0 2 * * * java -jar  /root/ecompass/ecompass.jar eshopdb 1>> /root/ecompass/ecompass.log 2>> /root/ecompass/ecompass.error
0 3 * * * java -jar  /root/ecompass/ecompass.jar piwik 1>> /root/ecompass/ecompass_piwik.log 2>> /root/ecompass/ecompass_piwik.error
0 4 * * * sh /root/ecompass/getDateAndLaunch.sh analytics
0 6 * * * java -jar ecompass.jar scrapping

README / User Manual

availability / location

Configuration steps

Configuration of Connecitons: there exist 3 databases/repositories to connect with: eshopBD, Piwik database and RDF Repository. Their connections are connfigurable by means of the ecompass.properties file inside the ecompass.jar and ecompass.war files. Therefore, it is required to import both files as eclipse projects and modify them to connect to the specific databases in order to change the configuration. The only required task afterwards is to import again both projects (mappings and REST API) to .jar and .war files to be deployed in the service machines as specified previously.

Parameter setting to database and RDF repository connecitons

Operation

Service startup procedure

Start services typing commands:

  • Start apache server: % service httpd start
  • Start tomcat server: % service tomcat start
  • Start mysql: % /etc/init.d/mysqld start

Restarting the service

Start services typing commands:

  • Start apache server: % service httpd restart
  • Start tomcat server: % service tomcat restart
  • Start mysql: % /etc/init.d/mysqld restart

Service Logs

  • Apache logs
  • MySQL logs
  • Tomcat logs
  • Files saved in mapping reditections: ecompass_analytics.log, ecompass_analytics.error (these two last files generated by script getDateAndLaunch.sh), ecompass_piwik.log, ecompass_piwik.error, ecompass_scrapping.log, ecompass_scrapping.error

Recurring Manual Actions / Maintenance Tasks

Limitations of the service

With which parameters does the service scale?

How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?

We are currently managing 30 E-Shops with data for 2 months by average (~30GBs + 20GBs)
If higher scaling was wanted, which of the hardware parameters would need to be increased?
RAM and HD storage
What else would be adjusted for higher scalability?
Which further configuration would be necessary?

Contact Information Data Collector & Consolidator Service

José Manuel García Nieto, jnieto@lcc.uma.es, +34 951 952924