Data Analyser

From E-COMPASS_Info_Guide
Revision as of 19:13, 14 January 2016 by GarciaNieto (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Physical Hardware Characteristics

Model: HP ProLiant DL380 G5/HP ProLiant DL360 G5
Processor: 8 cores 2GHz/8 cores 2,5GHz
RAM: 32GB/40GB
Hard Drive Space: 4TB Shared
Network Connection: 10 Gbit Ethernet
Hypervisor used: VMware ESX 6 with vSphere Center
Physical Load Balancing: none

Virtual Machine Hardware Specifications and Operating System

Guest Operating System:
Processor: Dual 2 core Intel processor and i7 intel 5 cores
RAM: 16 GB GB
Hard Drive Space VM: 100 GB internal
Network Connection: 10 Gbit/s Ethernet
Minimum required Network Connection: no info available

Service Environment and Set-up on VM1

Required Software
Software Download
Software-Name Software-Link
Software-Name Software-Link
Software-Name Software-Link

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

OS Environment Variables

Installation

This following chapter provides detailed explanations for different technical modules developed by CICtourGUNE for the Data Analyser Module within the E-Compass Data Mining Services.

The Following subsections are described:

  • Database
  • REST services in JAVA
  • Data Mining in R

In order to have a better understanding of the different developments, their location, distribution and used protocols are commented.

DATABASE

First of all, a MySQL or SQL Server database must be configured so that REST services an R data mining processes can both properly work. Currently those two database types are supported in both JAVA REST services and R processes, but other database types could be considered (if necessary).

1. MySQL general schema

01.png

2. SQL Server general schema

02.png

In the following lines, required database schema and tables will be described.  

MySql

After creating a new database in a MySQL server, those are needed code sentences in order to create the required tables:

1. E-shops table

03.png

Figure 1: E-Shops table CREATE code for MySQL

2. Competitors table

04.png

Figure 2: Competitors table CREATE code for MySQL

3. Products table

05.png

Figure 3: Products table CREATE code for MySQL

4. Visitors table

06.png

Figure 4: Visitors table CREATE code for MySQL  

SQL Server

After creating a new database in a SQL Server server, those are needed code sentences in order to create the required tables: 1. E-shops table

07.png

Figure 5: E-Shops table CREATE code for SQL Server

2. Competitors table

08.png

Figure 6: Competitors table CREATE code for SQL Server

3. Products table

09.png

Figure 7: Products table CREATE code for SQL Server

4. Visitors table

10.png

Figure 8: Visitors table CREATE code for SQL Server

REST SERVICES IN JAVA

Following lines describe existing REST methods developed by CICtourGUNE.

Rest services provided by CICtourGUNE were served in the following base URL address:

However, having transferred technical infrastructure to UMA, this will be the final base URL address:

Project distribution

Java project for REST services is distributed in the following packages:

1. Bean: Contains java classes to manage elements with different properties, for example an e-shop

a.	“Eshop.java”: Contains some variables to manage information related to an element of type “Eshop”
 i.	int eshopID
 ii.	String name 
 iii.	String url
 iv.	String currency
 v.	String chamber
 vi.	String analytics
 vii.	String ecommerceFunc
 viii.	boolean newEshop

2. Dao: Contains java methods to connect and interact with different databases. Currently “SqlServer” and “MySQL” databases are supported.

3. Math: Contains different mathematical operations required to calculate elasticity over competitors (HolisticCompetitor)

4. Rest API: Contains code that allows working with JSON format. In addition, following REST methods are defined (they are detailed in the following section):

 a.	Get data from existing e-shop for given date period and metric (GET)
 b.	Add new e-shop (POST)
 c.	Update info of existing e-shop (PUT)
 d.	Delete existing e-shop (PUT)

5. Utils: Contains global variables for common use in the whole JAVA project. Thus, variables affecting to different places in the project can more easily be changed in a single place.

Each “*.java” file is internally documented so that UMA can understand what has been developed or where make any change (if needed).

11.png

Figure 9: JAVA project for REST services

Description of REST methods

Get data from existing e-shop for given date period and metric (GET)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json

  • Variables:
metric: “visitors”, “products” o “competitors”
eshopid: e-shop ID to get data from
startdate: start date threshold for data collection
enddate: end date threshold for data collection
dbtype: “MySQL” or “SqlServer”
  • Pseudo-URL:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json?metric=<metric>&startdate=<yyyy-mm-dd>&enddate=<yyyy-mm-dd>&eshopid=<eshopid>&dbtype=<dbtype>
  • URL example:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json?metric=visitors&startdate=2015-06-01&enddate=2015-06-10&eshopid=168&dbtype=MySQL

Add new e-shop (POST)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop
  • Variables:
dbtype: “MySQL” or “SqlServer”
eshopid: new e-shop’s ID number
name: new e-shop’s name
url: new e-shop’s URL address
  • Pseudo-URL:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop?dbtype=<dbtype>&eshopid=<eshopid>&name=<name>&url=<url>
  • URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop?dbtype=MySQL&eshopid=555&name=hugo&url=mikel

Update info of existing e-shop (PUT)
Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop
  • Variables (en el path):
{dbtype}: “MySQL” or “SqlServer”
{eshopid}: e-shop ID to be updated

• Pseudo-URL:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop/{dbtype}/{eshopid}

• URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop/MySQL/999
PAYLOAD (needed additional content to be sent in JSON format):

• Variables:

eshopID: new e-shop ID for the e-shop in URL address (999 in the example)
name: new e-shop name
url: new e-shop URL address

• PAYLOAD example:

{"eshopID":171,"name":"Coozina","url":"http://www.coozina.gr"}

Delete existing e-shop (PUT)

Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop

• Variables (in the path):

{dbtype}: “MySQL” or “SqlServer”
{eshopid}: existing e-shop’s ID
{name}: e-shop name to be deleted, corresponding to “eshopid” variable

• Pseudo-URL:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop/{dbtype}/{eshopid}/{name}

• URL example:

http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop/MySQL/999/Coozina


DATA MINING IN R

The following lines describe required files for using data mining processes in R developed by CICtourGUNE.

File Description

The following files are the ones existing in the R Project:

CODE directory

File Description
classifynew.r Classifies new observations of Google. Auxiliar script called by “Clustergoogle.r”.
classifynewPiw.r Classifies new observations of Piwik. Auxiliar script called by “ClusterPiwik.r”.
classifynewProductsPiw.r Same functionality as “classifynewPiw.r” but oriented to products instead of visitors.
Clustergoogle.r Main script for Google sites: makes clusters and generates the preliminar report.
ClusterPiwik.R Main script for Piwik sites: makes clusters and generates the preliminar report.
ClusterProductsPiwik.R Same functionality as “ClusterPiwik.R” but oriented to products instead of visitors.
createtre.r Generates decision trees in order to rapidly clasify new observations of Google sites by using “classifynew.r”. In addition, inserts clusters’ names and it adds first cluster’s observations to the DB of REST service.
createtrepiw.r Generates decision trees in order to rapidly clasify new observations of Piwik sites by using “classifynew.r”. In addition, inserts clusters’ names and it adds first cluster’s observations to the DB of REST service.
createtreProductsPiw.r Same functionality as “createtrepiw.r” but oriented to products instead of visitors.
DataTablePiwik.Rmd Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “ClusterPiwik.r”.
DataTableProductsPiwik.Rmd Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “ClusterProductsPiwik.r”.
DescriptDataTable.Rmd Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “Clustergoogle.r”.
getvisitor.r Auxiliar script for generating the data table in Google. It does not exists an equivalent file for Piwik because there is no need for it.
installer.r Script for installing all required packages (check just in case any is missing).

CODE/conf directory

File Description
confMySQL.yml Configuration file to set connection configuration for MySQL database. confSqlServer.yml Configuration file to set connection configuration for Sql Server database.

CODE/lib directory

File Description
loadSource.R Generic client which allows connecting to certain database specified by “dbtype” parameter (“MySQL” or “SqlServer”). loadMySQL.R MySQL driver used by “loadSource.R” for SELECT, INSERT or other operations against database server specified in “confMySQL.yml”. loadSqlServer.R Sql Server driver used by “loadSource.R” for SELECT, INSERT or other operations against database server specified in “confSqlServer.yml”.
Modus Operandi

Following with R data mining section, it is describe the rigth order to execute all processes and obtain results. For a better understanding, code to be executed in “R console” will be formatted in BLUE.

First execution in the machine

First step before executing anything else is to install all the required R packages

  • Load “installer.r” library
source(“CODE/installer.r”)
  • Execute “installPackages” function
installPackages()

Database configuration

In order to properly interact with any database, first of all database IP address and user credentials have to be configured. This has to be done in the corresponding file depending on the database type (“MySQL” or “SqlServer”).

In case of Sql Server database type, regardless of whether it is a Windows or Linux server, an ODBC connection is needed to establish the connection.

  • Windows server: configure it at “Control Panel  Administrative Tools  Data Sources (ODBC)”
  • Linux server: Installing FreeTDS tool is needed. It allows creating ODBC connections to “SqlServer“ instances

Those are the parameters to be configured:

  • confMySQL.yml: Configuration file for “MySQL” database server
host: IP adress or server name
name: database name
user: user with credentials in the database
pass: password for previously set user
  • confSqlServer.yml: Configuration file for “SqlServer” database server
odbcName: previously configured ODBC connection name
user: user with credentials in the database
pass: password for previously set user

First execution to make clusters with VISITORS of an existing e-shop

1. Load clustering library based on footprint type

a.	“Clustergoogle.r” in case of Google Analytics
 source(“CODE/Clustergoogle.r”)
b.	 “ClusterPiwik.R” in case of Piwik
 source(“CODE/ClusterPiwik.R”)

2. Execute clustering function based on footprint type

a.	Google Analytics function parameters:
 eshop: eshop ID (required)
 date_ini: init date for clustering data
 date_end: end date for clustering data
 metrics: metrics
 valmetr: value related metrics
 geovar: geography related variables
 geocom: geography related common variables
 force: force to re-create cluster if exists
b.	Call example:
 clusterprocess(eshop=171)
c.	Piwik function parameters:
 eshop: eshop ID (required)
 date_ini: init date for clustering data
 date_end: end date for clustering data
 numetrics: numeric metrics
 nometrics: nominal metrics
 dims: dimensions related to geography
 valmetrics: value related metrics
 force: force to re-create cluster if exists
d.	Call example:
 clusterprocesspiw(eshopid=171)

3. Check the output report in the following directory:

<project root directory>/output/<e-shop code>/

4. Report is named with the following pattern:

“informe<e-shop code>.pdf”

5. Example for ID = 171: “informe171.pdf”

Absolute path: <project root directory>/output/171/informe171.pdf

6. Load the corresponding “createtre” library based on the footprint type

a.	“createtre.r” in case of Google Analytics
 source(“CODE/createtre.r”)
b.	“createtrepiw.r” in case of Piwik
 source(“CODE/createtrepiw.r”)

7. Execute the corresponding “createtre” function based on the footprint type and indicating the name of each cluster in “corresp” variable. Criteria for selecting the right cluster names is visualizing the report mentioned in steps 3-5.

a.	Google Analytics function parameters:
 eshop: eshop ID (required)
 corresp: cluster names (required)
 metrics: metrics to use in tree creation
 geovar: geography related variables
 force: force to re-create tree if exists
b.	Call example:
 createtre(eshop=171,corresp=(“Need-Based”,”Wandering”,”Loyal”,”Misplaced”))
c.	Piwik function parameters:
 eshop: eshop ID (required)
 corresp: cluster names (required)
 metrics: metrics to use in tree creation
 geovar: geography related variables
 force: force to re-create tree if exists
d.	Call example:
 createtrepiw(eshop=171,corresp=(“Need-Based”,”Wandering”,”Loyal”,”Misplaced”))


First execution to make clusters with PRODUCTS of an existing e-shop

1. Load clustering library (ONLY Piwik footprint) “ClusterProductsPiwik.R”

source(“CODE/ClusterProductsPiwik.R”)

2. Execute clustering function

a.	Function parameters:
 eshop: eshop ID (required)
 date_ini: init date for clustering data
 date_end: end date for clustering data
 numetrics: numeric metrics
 nometrics: nominal metrics
 dims: dimensions related to geography
 valmetrics: value related metrics
 force: force to re-create cluster if exists
b.	Call example:
 clusterprocessProductsPiw(eshopid=171)

3. Check the output report in the following directory:

<project root directory>/output/< e-shop code>/

4. Report is named with the following pattern:

“informeProducts<e-shop code>.pdf”

5. Example for Coozina with ID = 171: “informeProducts171.pdf”

Absolute path: <project root directory>/output/171/informeProducts171.pdf

6. Load “createtreProductsPiw.r” library

source(“CODE/createtreProductsPiw.R”)

  7. Execute “createtreProductsPiw” function indicating the name of each cluster in “corresp” variable. Criteria for selecting the right cluster names is visualizing the report mentioned in steps 3-5.

a.	Function parameters:
 eshop: eshop ID (required)
 date_ini: init date for clustering data
 date_end: end date for clustering data
 numetrics: numeric metrics
 nometrics: nominal metrics
 dims: dimensions related to geography
 valmetrics: value related metrics
 force: force to re-create cluster if exists
b.	Call example:
 createtreProductsPiw(eshopid=171,corresp=(“Speciality”,”Convenience”,”Unsought”))

Future executions

  • Execute clustering script based on footprint type as above explained.

Getting Prices

1. Load “getvisitor.r” library

source(“CODE/getvisitor.r”)

2. Load “loadSource.R” library

source("CODE/lib/loadSource.R")

3. Execute “getPrices” function:

a. 	Function parameters:
 eshop: eshop ID (required)
 date_ini: init date for pricing data (required)
 date_end: end date for pricing data (required)

b. 	Call example:
 pricesData <- getPrices(171,”2015-06-01”,”2015-07-15”)

4. Insert the data obtained in the previous step into the corresponding table (ecHolisticCompetitors) of the database:

a. 	Call example MySQL
 loadSource(databaseType="MySQL",queryType="insert",tableName="ecHolisticCompetitors",colNamesVec=c("Company","Product","date","Price","eShopID"),paths=paths,data=pricesData)
b.	 Call example SqlServer:
 loadSource(databaseType="SqlServer",queryType="insert",tableName="[ECompassREST].[dbo].[ecHolisticCompetitors]",colNamesVec=c("Company","Product","data","Price","eShopID"),paths=paths,data=pricesData)

Database library testing

After having configured “MySQL” or “SqlServer” database connection as above presented, some simple test can be performed in order to check DB proper access.

1. Set a variable (in this case “path”) indicating all relative paths of the project:

setwd("<path of the project directory>/Pipesecompass")
rootPath <- getwd()
libPath <- paste0(getwd(),"/CODE/lib/")
dbConfPath <- paste0(getwd(),"/CODE/conf/")
paths <- list(lib=libPath,dbConf=dbConfPath)

2. Load “loadSource.R” library

source("CODE/lib/loadSource.R")

3. Execute “loadSource” function

Function parameters:
 databaseType: “MySQL” or “SqlServer”
 queryType: “select” or “insert” (currently no other query supported)
 tableName: table name to query
 colNamesVec: vector with required column names
 paths: configuration paths previously prepared (step 1)
 data: data set required for INSERT query type (not for SELECT)

Call example MySQL:

 loadSource(databaseType="MySQL",queryType="insert",tableName="ecProducts",colNamesVec=c("date","city","region","country","productName","clgroup","eShopID"),paths=paths,data=data)
c.	Call example Sql Server:
 loadSource(databaseType="SqlServer",queryType="insert",tableName="[ECompassREST].[dbo].[ecProducts]",colNamesVec=c("date","city","region","country","productName","clgroup","eShopID"),paths=paths,data=data)

In case of doubts

  • Execute clustering script based on footprint type and follow the instructions that will appear as warning or error messages. They will indicate what to do.


Preparation

Service Configuration

Configuration script

availability / location

README / User Manual

availability / location

Configuration steps

… Configuration of REST endpoints at: …

Operation

Service startup procedure

Restarting the service

Service Logs

Recurring Manual Actions / Maintenance Tasks

Other

Limitations of the service

With which parameters does the service scale?

How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?

The service scales with the overall number of users over all E-Shops
If higher scaling was wanted, which of the hardware parameters would need to be increased?
What else would be adjusted for higher scalability?
Which further configuration would be necessary?

Contact Information Data Analyser

José Manuel García Nieto, jnieto@lcc.uma.es, +34 951 952924