Data Analyser

Physical Hardware Characteristics

Model:	HP ProLiant DL380 G5/HP ProLiant DL360 G5
Processor:	8 cores 2GHz/8 cores 2,5GHz
RAM:	32GB/40GB
Hard Drive Space:	4TB Shared
Network Connection:	10 Gbit Ethernet
Hypervisor used:	VMware ESX 6 with vSphere Center
Physical Load Balancing:	none

Virtual Machine Hardware Specifications and Operating System

Guest Operating System:
Processor:	Dual 2 core Intel processor and i7 intel 5 cores
RAM:	16 GB GB
Hard Drive Space VM:	100 GB internal
Network Connection:	10 Gbit/s Ethernet
Minimum required Network Connection:	no info available

Service Environment and Set-up on VM1

Required Software
Software	Download
Software-Name	Software-Link
Software-Name	Software-Link
Software-Name	Software-Link

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

OS Environment Variables

…

Installation

This following chapter provides detailed explanations for different technical modules developed by CICtourGUNE for the Data Analyser Module within the E-Compass Data Mining Services.

The Following subsections are described:

Database
REST services in JAVA
Data Mining in R

In order to have a better understanding of the different developments, their location, distribution and used protocols are commented.

DATABASE

First of all, a MySQL or SQL Server database must be configured so that REST services an R data mining processes can both properly work. Currently those two database types are supported in both JAVA REST services and R processes, but other database types could be considered (if necessary).

1. MySQL general schema

2. SQL Server general schema

In the following lines, required database schema and tables will be described.

MySql

After creating a new database in a MySQL server, those are needed code sentences in order to create the required tables:

1. E-shops table

Figure 1: E-Shops table CREATE code for MySQL

2. Competitors table

Figure 2: Competitors table CREATE code for MySQL

3. Products table

Figure 3: Products table CREATE code for MySQL

4. Visitors table

Figure 4: Visitors table CREATE code for MySQL

SQL Server

After creating a new database in a SQL Server server, those are needed code sentences in order to create the required tables: 1. E-shops table

Figure 5: E-Shops table CREATE code for SQL Server

2. Competitors table

Figure 6: Competitors table CREATE code for SQL Server

3. Products table

Figure 7: Products table CREATE code for SQL Server

4. Visitors table

Figure 8: Visitors table CREATE code for SQL Server

REST SERVICES IN JAVA

Following lines describe existing REST methods developed by CICtourGUNE.

Rest services provided by CICtourGUNE were served in the following base URL address:

http://ecompass.tourgune.org/...

However, having transferred technical infrastructure to UMA, this will be the final base URL address:

http://ecompass.khaos.uma.es/...

Project distribution

Java project for REST services is distributed in the following packages:

1. Bean: Contains java classes to manage elements with different properties, for example an e-shop

a.	“Eshop.java”: Contains some variables to manage information related to an element of type “Eshop”

 i.	int eshopID

 ii.	String name

 iii.	String url

 iv.	String currency

 v.	String chamber

 vi.	String analytics

 vii.	String ecommerceFunc

 viii.	boolean newEshop

2. Dao: Contains java methods to connect and interact with different databases. Currently “SqlServer” and “MySQL” databases are supported.

3. Math: Contains different mathematical operations required to calculate elasticity over competitors (HolisticCompetitor)

4. Rest API: Contains code that allows working with JSON format. In addition, following REST methods are defined (they are detailed in the following section):

 a.	Get data from existing e-shop for given date period and metric (GET)

 b.	Add new e-shop (POST)

 c.	Update info of existing e-shop (PUT)

 d.	Delete existing e-shop (PUT)

5. Utils: Contains global variables for common use in the whole JAVA project. Thus, variables affecting to different places in the project can more easily be changed in a single place.

Each “*.java” file is internally documented so that UMA can understand what has been developed or where make any change (if needed).

Figure 9: JAVA project for REST services

Description of REST methods

Get data from existing e-shop for given date period and metric (GET)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json

Variables:

metric: “visitors”, “products” o “competitors”

eshopid: e-shop ID to get data from

startdate: start date threshold for data collection

enddate: end date threshold for data collection

dbtype: “MySQL” or “SqlServer”

Pseudo-URL:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json?metric=<metric>&startdate=<yyyy-mm-dd>&enddate=<yyyy-mm-dd>&eshopid=<eshopid>&dbtype=<dbtype>

URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json?metric=visitors&startdate=2015-06-01&enddate=2015-06-10&eshopid=168&dbtype=MySQL

Add new e-shop (POST)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop

Variables:

dbtype: “MySQL” or “SqlServer”

eshopid: new e-shop’s ID number

name: new e-shop’s name

url: new e-shop’s URL address

Pseudo-URL:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop?dbtype=<dbtype>&eshopid=<eshopid>&name=<name>&url=<url>

URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop?dbtype=MySQL&eshopid=555&name=hugo&url=mikel

Update info of existing e-shop (PUT)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop

Variables (en el path):

{dbtype}: “MySQL” or “SqlServer”

{eshopid}: e-shop ID to be updated

• Pseudo-URL:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop/{dbtype}/{eshopid}

• URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop/MySQL/999

PAYLOAD (needed additional content to be sent in JSON format):

• Variables:

eshopID: new e-shop ID for the e-shop in URL address (999 in the example)

name: new e-shop name

url: new e-shop URL address

• PAYLOAD example:

{"eshopID":171,"name":"Coozina","url":"http://www.coozina.gr"}

Delete existing e-shop (PUT)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop

• Variables (in the path):

{dbtype}: “MySQL” or “SqlServer”

{eshopid}: existing e-shop’s ID

{name}: e-shop name to be deleted, corresponding to “eshopid” variable

• Pseudo-URL:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop/{dbtype}/{eshopid}/{name}

• URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop/MySQL/999/Coozina

DATA MINING IN R

The following lines describe required files for using data mining processes in R developed by CICtourGUNE.

File Description

The following files are the ones existing in the R Project:

CODE directory

File	Description
classifynew.r	Classifies new observations of Google. Auxiliar script called by “Clustergoogle.r”.
classifynewPiw.r	Classifies new observations of Piwik. Auxiliar script called by “ClusterPiwik.r”.
classifynewProductsPiw.r	Same functionality as “classifynewPiw.r” but oriented to products instead of visitors.
Clustergoogle.r	Main script for Google sites: makes clusters and generates the preliminar report.
ClusterPiwik.R	Main script for Piwik sites: makes clusters and generates the preliminar report.
ClusterProductsPiwik.R	Same functionality as “ClusterPiwik.R” but oriented to products instead of visitors.
createtre.r	Generates decision trees in order to rapidly clasify new observations of Google sites by using “classifynew.r”. In addition, inserts clusters’ names and it adds first cluster’s observations to the DB of REST service.
createtrepiw.r	Generates decision trees in order to rapidly clasify new observations of Piwik sites by using “classifynew.r”. In addition, inserts clusters’ names and it adds first cluster’s observations to the DB of REST service.
createtreProductsPiw.r	Same functionality as “createtrepiw.r” but oriented to products instead of visitors.
DataTablePiwik.Rmd	Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “ClusterPiwik.r”.
DataTableProductsPiwik.Rmd	Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “ClusterProductsPiwik.r”.
DescriptDataTable.Rmd	Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “Clustergoogle.r”.
getvisitor.r	Auxiliar script for generating the data table in Google. It does not exists an equivalent file for Piwik because there is no need for it.
installer.r	Script for installing all required packages (check just in case any is missing).

CODE/conf directory

File	Description
confMySQL.yml	Configuration file to set connection configuration for MySQL database.	confSqlServer.yml	Configuration file to set connection configuration for Sql Server database.

CODE/lib directory

File	Description
loadSource.R	Generic client which allows connecting to certain database specified by “dbtype” parameter (“MySQL” or “SqlServer”).	loadMySQL.R	MySQL driver used by “loadSource.R” for SELECT, INSERT or other operations against database server specified in “confMySQL.yml”.	loadSqlServer.R	Sql Server driver used by “loadSource.R” for SELECT, INSERT or other operations against database server specified in “confSqlServer.yml”.

Modus Operandi

Following with R data mining section, it is describe the rigth order to execute all processes and obtain results. For a better understanding, code to be executed in “R console” will be formatted in BLUE.

First execution in the machine

First step before executing anything else is to install all the required R packages

Load “installer.r” library

source(“CODE/installer.r”)

Execute “installPackages” function

installPackages()

Database configuration

In order to properly interact with any database, first of all database IP address and user credentials have to be configured. This has to be done in the corresponding file depending on the database type (“MySQL” or “SqlServer”).

In case of Sql Server database type, regardless of whether it is a Windows or Linux server, an ODBC connection is needed to establish the connection.

Windows server: configure it at “Control Panel  Administrative Tools  Data Sources (ODBC)”

Linux server: Installing FreeTDS tool is needed. It allows creating ODBC connections to “SqlServer“ instances

Those are the parameters to be configured:

confMySQL.yml: Configuration file for “MySQL” database server

host: IP adress or server name

name: database name

user: user with credentials in the database

pass: password for previously set user

confSqlServer.yml: Configuration file for “SqlServer” database server

odbcName: previously configured ODBC connection name

user: user with credentials in the database

pass: password for previously set user

First execution to make clusters with VISITORS of an existing e-shop

1. Load clustering library based on footprint type

a.	“Clustergoogle.r” in case of Google Analytics

 source(“CODE/Clustergoogle.r”)

b.	 “ClusterPiwik.R” in case of Piwik

 source(“CODE/ClusterPiwik.R”)

2. Execute clustering function based on footprint type

a.	Google Analytics function parameters:

 eshop: eshop ID (required)

 date_ini: init date for clustering data

 date_end: end date for clustering data

 metrics: metrics

 valmetr: value related metrics

 geovar: geography related variables

 geocom: geography related common variables

 force: force to re-create cluster if exists

b.	Call example:

 clusterprocess(eshop=171)

c.	Piwik function parameters:

 eshop: eshop ID (required)

 date_ini: init date for clustering data

 date_end: end date for clustering data

 numetrics: numeric metrics

 nometrics: nominal metrics

 dims: dimensions related to geography

 valmetrics: value related metrics

 force: force to re-create cluster if exists

d.	Call example:

 clusterprocesspiw(eshopid=171)

3. Check the output report in the following directory:

<project root directory>/output/<e-shop code>/

4. Report is named with the following pattern:

“informe<e-shop code>.pdf”

5. Example for ID = 171: “informe171.pdf”

Absolute path: <project root directory>/output/171/informe171.pdf

6. Load the corresponding “createtre” library based on the footprint type

a.	“createtre.r” in case of Google Analytics

 source(“CODE/createtre.r”)

b.	“createtrepiw.r” in case of Piwik

 source(“CODE/createtrepiw.r”)

7. Execute the corresponding “createtre” function based on the footprint type and indicating the name of each cluster in “corresp” variable. Criteria for selecting the right cluster names is visualizing the report mentioned in steps 3-5.

a.	Google Analytics function parameters:

 eshop: eshop ID (required)

 corresp: cluster names (required)

 metrics: metrics to use in tree creation

 geovar: geography related variables

 force: force to re-create tree if exists

b.	Call example:

 createtre(eshop=171,corresp=(“Need-Based”,”Wandering”,”Loyal”,”Misplaced”))

c.	Piwik function parameters:

 eshop: eshop ID (required)

 corresp: cluster names (required)

 metrics: metrics to use in tree creation

 geovar: geography related variables

 force: force to re-create tree if exists

d.	Call example:

 createtrepiw(eshop=171,corresp=(“Need-Based”,”Wandering”,”Loyal”,”Misplaced”))

First execution to make clusters with PRODUCTS of an existing e-shop

1. Load clustering library (ONLY Piwik footprint) “ClusterProductsPiwik.R”

source(“CODE/ClusterProductsPiwik.R”)

2. Execute clustering function

a.	Function parameters:

 eshop: eshop ID (required)

 date_ini: init date for clustering data

 date_end: end date for clustering data

 numetrics: numeric metrics

 nometrics: nominal metrics

 dims: dimensions related to geography

 valmetrics: value related metrics

 force: force to re-create cluster if exists

b.	Call example:

 clusterprocessProductsPiw(eshopid=171)

3. Check the output report in the following directory:

<project root directory>/output/< e-shop code>/

4. Report is named with the following pattern:

“informeProducts<e-shop code>.pdf”

5. Example for Coozina with ID = 171: “informeProducts171.pdf”

Absolute path: <project root directory>/output/171/informeProducts171.pdf

6. Load “createtreProductsPiw.r” library

source(“CODE/createtreProductsPiw.R”)

7. Execute “createtreProductsPiw” function indicating the name of each cluster in “corresp” variable. Criteria for selecting the right cluster names is visualizing the report mentioned in steps 3-5.

a.	Function parameters:

 eshop: eshop ID (required)

 date_ini: init date for clustering data

 date_end: end date for clustering data

 numetrics: numeric metrics

 nometrics: nominal metrics

 dims: dimensions related to geography

 valmetrics: value related metrics

 force: force to re-create cluster if exists

b.	Call example:

 createtreProductsPiw(eshopid=171,corresp=(“Speciality”,”Convenience”,”Unsought”))

Future executions

Execute clustering script based on footprint type as above explained.

Getting Prices

1. Load “getvisitor.r” library

source(“CODE/getvisitor.r”)

2. Load “loadSource.R” library

source("CODE/lib/loadSource.R")

3. Execute “getPrices” function:

a. 	Function parameters:

 eshop: eshop ID (required)

 date_ini: init date for pricing data (required)

 date_end: end date for pricing data (required)

b. 	Call example:

 pricesData <- getPrices(171,”2015-06-01”,”2015-07-15”)

4. Insert the data obtained in the previous step into the corresponding table (ecHolisticCompetitors) of the database:

a. 	Call example MySQL

 loadSource(databaseType="MySQL",queryType="insert",tableName="ecHolisticCompetitors",colNamesVec=c("Company","Product","date","Price","eShopID"),paths=paths,data=pricesData)

b.	 Call example SqlServer:

 loadSource(databaseType="SqlServer",queryType="insert",tableName="[ECompassREST].[dbo].[ecHolisticCompetitors]",colNamesVec=c("Company","Product","data","Price","eShopID"),paths=paths,data=pricesData)

Database library testing

After having configured “MySQL” or “SqlServer” database connection as above presented, some simple test can be performed in order to check DB proper access.

1. Set a variable (in this case “path”) indicating all relative paths of the project:

setwd("<path of the project directory>/Pipesecompass")

rootPath <- getwd()

libPath <- paste0(getwd(),"/CODE/lib/")

dbConfPath <- paste0(getwd(),"/CODE/conf/")

paths <- list(lib=libPath,dbConf=dbConfPath)

2. Load “loadSource.R” library

source("CODE/lib/loadSource.R")

3. Execute “loadSource” function

Function parameters:

 databaseType: “MySQL” or “SqlServer”

 queryType: “select” or “insert” (currently no other query supported)

 tableName: table name to query

 colNamesVec: vector with required column names

 paths: configuration paths previously prepared (step 1)

 data: data set required for INSERT query type (not for SELECT)

Call example MySQL:

 loadSource(databaseType="MySQL",queryType="insert",tableName="ecProducts",colNamesVec=c("date","city","region","country","productName","clgroup","eShopID"),paths=paths,data=data)

c.	Call example Sql Server:

 loadSource(databaseType="SqlServer",queryType="insert",tableName="[ECompassREST].[dbo].[ecProducts]",colNamesVec=c("date","city","region","country","productName","clgroup","eShopID"),paths=paths,data=data)

In case of doubts

Execute clustering script based on footprint type and follow the instructions that will appear as warning or error messages. They will indicate what to do.

Preparation

…

Service Configuration

…

Configuration script

availability / location

README / User Manual

availability / location

Configuration steps

… Configuration of REST endpoints at: …

Operation

Service startup procedure

…

Restarting the service

…

Service Logs

…

Recurring Manual Actions / Maintenance Tasks

…

Other

…

Limitations of the service

With which parameters does the service scale?

How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?

The service scales with the overall number of users over all E-Shops
If higher scaling was wanted, which of the hardware parameters would need to be increased?: …
What else would be adjusted for higher scalability?: …
Which further configuration would be necessary?: …

Contact Information Data Analyser

José Manuel García Nieto, jnieto@lcc.uma.es, +34 951 952924