Data Analyser
Contents
Physical Hardware Characteristics
| Model: | HP ProLiant DL380 G5/HP ProLiant DL360 G5 |
| Processor: | 8 cores 2GHz/8 cores 2,5GHz |
| RAM: | 32GB/40GB |
| Hard Drive Space: | 4TB Shared |
| Network Connection: | 10 Gbit Ethernet |
| Hypervisor used: | VMware ESX 6 with vSphere Center |
| Physical Load Balancing: | none |
Virtual Machine Hardware Specifications and Operating System
| Guest Operating System: | |
| Processor: | Dual 2 core Intel processor and i7 intel 5 cores |
| RAM: | 16 GB GB |
| Hard Drive Space VM: | 100 GB internal |
| Network Connection: | 10 Gbit/s Ethernet |
| Minimum required Network Connection: | no info available |
Service Environment and Set-up on VM1
| Software | Download |
|---|---|
| Software-Name | Software-Link |
| Software-Name | Software-Link |
| Software-Name | Software-Link |
Software Licenses
Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately
OS Environment Variables
…
Installation
This following chapter provides detailed explanations for different technical modules developed by CICtourGUNE for the Data Analyser Module within the E-Compass Data Mining Services.
The Following subsections are described:
- Database
- REST services in JAVA
- Data Mining in R
In order to have a better understanding of the different developments, their location, distribution and used protocols are commented.
DATABASE
First of all, a MySQL or SQL Server database must be configured so that REST services an R data mining processes can both properly work. Currently those two database types are supported in both JAVA REST services and R processes, but other database types could be considered (if necessary).
1. MySQL general schema
2. SQL Server general schema
In the following lines, required database schema and tables will be described.
MySql
After creating a new database in a MySQL server, those are needed code sentences in order to create the required tables:
1. E-shops table
Figure 1: E-Shops table CREATE code for MySQL
2. Competitors table
Figure 2: Competitors table CREATE code for MySQL
3. Products table
Figure 3: Products table CREATE code for MySQL
4. Visitors table
Figure 4: Visitors table CREATE code for MySQL
SQL Server
After creating a new database in a SQL Server server, those are needed code sentences in order to create the required tables: 1. E-shops table
Figure 5: E-Shops table CREATE code for SQL Server
2. Competitors table
Figure 6: Competitors table CREATE code for SQL Server
3. Products table
Figure 7: Products table CREATE code for SQL Server
4. Visitors table
Figure 8: Visitors table CREATE code for SQL Server
REST SERVICES IN JAVA
Following lines describe existing REST methods developed by CICtourGUNE.
Rest services provided by CICtourGUNE were served in the following base URL address:
However, having transferred technical infrastructure to UMA, this will be the final base URL address:
Project distribution
Java project for REST services is distributed in the following packages:
1. Bean: Contains java classes to manage elements with different properties, for example an e-shop
a. “Eshop.java”: Contains some variables to manage information related to an element of type “Eshop”
i. int eshopID
ii. String name
iii. String url
iv. String currency
v. String chamber
vi. String analytics
vii. String ecommerceFunc
viii. boolean newEshop
2. Dao: Contains java methods to connect and interact with different databases. Currently “SqlServer” and “MySQL” databases are supported.
3. Math: Contains different mathematical operations required to calculate elasticity over competitors (HolisticCompetitor)
4. Rest API: Contains code that allows working with JSON format. In addition, following REST methods are defined (they are detailed in the following section):
a. Get data from existing e-shop for given date period and metric (GET)
b. Add new e-shop (POST)
c. Update info of existing e-shop (PUT)
d. Delete existing e-shop (PUT)
5. Utils: Contains global variables for common use in the whole JAVA project. Thus, variables affecting to different places in the project can more easily be changed in a single place.
Each “*.java” file is internally documented so that UMA can understand what has been developed or where make any change (if needed).
Figure 9: JAVA project for REST services
Description of REST methods
Get data from existing e-shop for given date period and metric (GET)
Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json
- Variables:
metric: “visitors”, “products” o “competitors”
eshopid: e-shop ID to get data from
startdate: start date threshold for data collection
enddate: end date threshold for data collection
dbtype: “MySQL” or “SqlServer”
- Pseudo-URL:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json?metric=<metric>&startdate=<yyyy-mm-dd>&enddate=<yyyy-mm-dd>&eshopid=<eshopid>&dbtype=<dbtype>
- URL example:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/results.json?metric=visitors&startdate=2015-06-01&enddate=2015-06-10&eshopid=168&dbtype=MySQL
Add new e-shop (POST)
Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop
- Variables:
dbtype: “MySQL” or “SqlServer”
eshopid: new e-shop’s ID number
name: new e-shop’s name
url: new e-shop’s URL address
- Pseudo-URL:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/notifyNewEshop?dbtype=<dbtype>&eshopid=<eshopid>&name=<name>&url=<url>
- URL example:
Update info of existing e-shop (PUT)
Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop
- Variables (en el path):
{dbtype}: “MySQL” or “SqlServer”
{eshopid}: e-shop ID to be updated
• Pseudo-URL:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop/{dbtype}/{eshopid}
• URL example:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/updateEshop/MySQL/999
PAYLOAD (needed additional content to be sent in JSON format):
• Variables:
eshopID: new e-shop ID for the e-shop in URL address (999 in the example)
name: new e-shop name
url: new e-shop URL address
• PAYLOAD example:
{"eshopID":171,"name":"Coozina","url":"http://www.coozina.gr"}
Delete existing e-shop (PUT)
Base URL: http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop
• Variables (in the path):
{dbtype}: “MySQL” or “SqlServer”
{eshopid}: existing e-shop’s ID
{name}: e-shop name to be deleted, corresponding to “eshopid” variable
• Pseudo-URL:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop/{dbtype}/{eshopid}/{name}
• URL example:
http://ecompass.khaos.uma.es/ecompassAPI/getDataAnalyzerInfo/deleteEshop/MySQL/999/Coozina
DATA MINING IN R
The following lines describe required files for using data mining processes in R developed by CICtourGUNE.
File Description
The following files are the ones existing in the R Project:
CODE directory
| File | Description |
|---|---|
| classifynew.r | Classifies new observations of Google. Auxiliar script called by “Clustergoogle.r”. |
| classifynewPiw.r | Classifies new observations of Piwik. Auxiliar script called by “ClusterPiwik.r”. |
| classifynewProductsPiw.r | Same functionality as “classifynewPiw.r” but oriented to products instead of visitors. |
| Clustergoogle.r | Main script for Google sites: makes clusters and generates the preliminar report. |
| ClusterPiwik.R | Main script for Piwik sites: makes clusters and generates the preliminar report. |
| ClusterProductsPiwik.R | Same functionality as “ClusterPiwik.R” but oriented to products instead of visitors. |
| createtre.r | Generates decision trees in order to rapidly clasify new observations of Google sites by using “classifynew.r”. In addition, inserts clusters’ names and it adds first cluster’s observations to the DB of REST service. |
| createtrepiw.r | Generates decision trees in order to rapidly clasify new observations of Piwik sites by using “classifynew.r”. In addition, inserts clusters’ names and it adds first cluster’s observations to the DB of REST service. |
| createtreProductsPiw.r | Same functionality as “createtrepiw.r” but oriented to products instead of visitors. |
| DataTablePiwik.Rmd | Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “ClusterPiwik.r”. |
| DataTableProductsPiwik.Rmd | Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “ClusterProductsPiwik.r”. |
| DescriptDataTable.Rmd | Auxiliar “Rmd” file for generating preliminar report in Piwik. Called by “Clustergoogle.r”. |
| getvisitor.r | Auxiliar script for generating the data table in Google. It does not exists an equivalent file for Piwik because there is no need for it. |
| installer.r | Script for installing all required packages (check just in case any is missing). |
CODE/conf directory
| File | Description | ||
|---|---|---|---|
| confMySQL.yml | Configuration file to set connection configuration for MySQL database. | confSqlServer.yml | Configuration file to set connection configuration for Sql Server database. |
CODE/lib directory
| File | Description | ||||
|---|---|---|---|---|---|
| loadSource.R | Generic client which allows connecting to certain database specified by “dbtype” parameter (“MySQL” or “SqlServer”). | loadMySQL.R | MySQL driver used by “loadSource.R” for SELECT, INSERT or other operations against database server specified in “confMySQL.yml”. | loadSqlServer.R | Sql Server driver used by “loadSource.R” for SELECT, INSERT or other operations against database server specified in “confSqlServer.yml”. |
Modus Operandi
Following with R data mining section, it is describe the rigth order to execute all processes and obtain results. For a better understanding, code to be executed in “R console” will be formatted in BLUE.
First execution in the machine
First step before executing anything else is to install all the required R packages
- Load “installer.r” library
source(“CODE/installer.r”)
- Execute “installPackages” function
installPackages()
Database configuration
In order to properly interact with any database, first of all database IP address and user credentials have to be configured. This has to be done in the corresponding file depending on the database type (“MySQL” or “SqlServer”).
In case of Sql Server database type, regardless of whether it is a Windows or Linux server, an ODBC connection is needed to establish the connection.
- Windows server: configure it at “Control Panel Administrative Tools Data Sources (ODBC)”
- Linux server: Installing FreeTDS tool is needed. It allows creating ODBC connections to “SqlServer“ instances
Those are the parameters to be configured:
- confMySQL.yml: Configuration file for “MySQL” database server
host: IP adress or server name
name: database name
user: user with credentials in the database
pass: password for previously set user
- confSqlServer.yml: Configuration file for “SqlServer” database server
odbcName: previously configured ODBC connection name
user: user with credentials in the database
pass: password for previously set user
First execution to make clusters with VISITORS of an existing e-shop
1. Load clustering library based on footprint type
a. “Clustergoogle.r” in case of Google Analytics
source(“CODE/Clustergoogle.r”)
b. “ClusterPiwik.R” in case of Piwik
source(“CODE/ClusterPiwik.R”)
2. Execute clustering function based on footprint type
a. Google Analytics function parameters:
eshop: eshop ID (required)
date_ini: init date for clustering data
date_end: end date for clustering data
metrics: metrics
valmetr: value related metrics
geovar: geography related variables
geocom: geography related common variables
force: force to re-create cluster if exists
b. Call example:
clusterprocess(eshop=171)
c. Piwik function parameters:
eshop: eshop ID (required)
date_ini: init date for clustering data
date_end: end date for clustering data
numetrics: numeric metrics
nometrics: nominal metrics
dims: dimensions related to geography
valmetrics: value related metrics
force: force to re-create cluster if exists
d. Call example:
clusterprocesspiw(eshopid=171)
3. Check the output report in the following directory:
<project root directory>/output/<e-shop code>/
4. Report is named with the following pattern:
“informe<e-shop code>.pdf”
5. Example for ID = 171: “informe171.pdf”
Absolute path: <project root directory>/output/171/informe171.pdf
6. Load the corresponding “createtre” library based on the footprint type
a. “createtre.r” in case of Google Analytics
source(“CODE/createtre.r”)
b. “createtrepiw.r” in case of Piwik
source(“CODE/createtrepiw.r”)
7. Execute the corresponding “createtre” function based on the footprint type and indicating the name of each cluster in “corresp” variable. Criteria for selecting the right cluster names is visualizing the report mentioned in steps 3-5.
a. Google Analytics function parameters:
eshop: eshop ID (required)
corresp: cluster names (required)
metrics: metrics to use in tree creation
geovar: geography related variables
force: force to re-create tree if exists
b. Call example:
createtre(eshop=171,corresp=(“Need-Based”,”Wandering”,”Loyal”,”Misplaced”))
c. Piwik function parameters:
eshop: eshop ID (required)
corresp: cluster names (required)
metrics: metrics to use in tree creation
geovar: geography related variables
force: force to re-create tree if exists
d. Call example:
createtrepiw(eshop=171,corresp=(“Need-Based”,”Wandering”,”Loyal”,”Misplaced”))
First execution to make clusters with PRODUCTS of an existing e-shop
1. Load clustering library (ONLY Piwik footprint) “ClusterProductsPiwik.R”
source(“CODE/ClusterProductsPiwik.R”)
2. Execute clustering function
a. Function parameters:
eshop: eshop ID (required)
date_ini: init date for clustering data
date_end: end date for clustering data
numetrics: numeric metrics
nometrics: nominal metrics
dims: dimensions related to geography
valmetrics: value related metrics
force: force to re-create cluster if exists
b. Call example:
clusterprocessProductsPiw(eshopid=171)
3. Check the output report in the following directory:
<project root directory>/output/< e-shop code>/
4. Report is named with the following pattern:
“informeProducts<e-shop code>.pdf”
5. Example for Coozina with ID = 171: “informeProducts171.pdf”
Absolute path: <project root directory>/output/171/informeProducts171.pdf
6. Load “createtreProductsPiw.r” library
source(“CODE/createtreProductsPiw.R”)
7. Execute “createtreProductsPiw” function indicating the name of each cluster in “corresp” variable. Criteria for selecting the right cluster names is visualizing the report mentioned in steps 3-5.
a. Function parameters:
eshop: eshop ID (required)
date_ini: init date for clustering data
date_end: end date for clustering data
numetrics: numeric metrics
nometrics: nominal metrics
dims: dimensions related to geography
valmetrics: value related metrics
force: force to re-create cluster if exists
b. Call example:
createtreProductsPiw(eshopid=171,corresp=(“Speciality”,”Convenience”,”Unsought”))
Future executions
- Execute clustering script based on footprint type as above explained.
Getting Prices
1. Load “getvisitor.r” library
source(“CODE/getvisitor.r”)
2. Load “loadSource.R” library
source("CODE/lib/loadSource.R")
3. Execute “getPrices” function:
a. Function parameters:
eshop: eshop ID (required)
date_ini: init date for pricing data (required)
date_end: end date for pricing data (required) b. Call example:
pricesData <- getPrices(171,”2015-06-01”,”2015-07-15”)
4. Insert the data obtained in the previous step into the corresponding table (ecHolisticCompetitors) of the database:
a. Call example MySQL
loadSource(databaseType="MySQL",queryType="insert",tableName="ecHolisticCompetitors",colNamesVec=c("Company","Product","date","Price","eShopID"),paths=paths,data=pricesData)
b. Call example SqlServer:
loadSource(databaseType="SqlServer",queryType="insert",tableName="[ECompassREST].[dbo].[ecHolisticCompetitors]",colNamesVec=c("Company","Product","data","Price","eShopID"),paths=paths,data=pricesData)
Database library testing
After having configured “MySQL” or “SqlServer” database connection as above presented, some simple test can be performed in order to check DB proper access.
1. Set a variable (in this case “path”) indicating all relative paths of the project:
setwd("<path of the project directory>/Pipesecompass")
rootPath <- getwd()
libPath <- paste0(getwd(),"/CODE/lib/")
dbConfPath <- paste0(getwd(),"/CODE/conf/")
paths <- list(lib=libPath,dbConf=dbConfPath)
2. Load “loadSource.R” library
source("CODE/lib/loadSource.R")
3. Execute “loadSource” function
Function parameters:
databaseType: “MySQL” or “SqlServer”
queryType: “select” or “insert” (currently no other query supported)
tableName: table name to query
colNamesVec: vector with required column names
paths: configuration paths previously prepared (step 1)
data: data set required for INSERT query type (not for SELECT)
Call example MySQL:
loadSource(databaseType="MySQL",queryType="insert",tableName="ecProducts",colNamesVec=c("date","city","region","country","productName","clgroup","eShopID"),paths=paths,data=data)
c. Call example Sql Server:
loadSource(databaseType="SqlServer",queryType="insert",tableName="[ECompassREST].[dbo].[ecProducts]",colNamesVec=c("date","city","region","country","productName","clgroup","eShopID"),paths=paths,data=data)
In case of doubts
- Execute clustering script based on footprint type and follow the instructions that will appear as warning or error messages. They will indicate what to do.
Preparation
…
Service Configuration
…
Configuration script
availability / location
README / User Manual
availability / location
Configuration steps
… Configuration of REST endpoints at: …
Operation
Service startup procedure
…
Restarting the service
…
Service Logs
…
Recurring Manual Actions / Maintenance Tasks
…
Other
…
Limitations of the service
- With which parameters does the service scale?
How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?
- The service scales with the overall number of users over all E-Shops
- If higher scaling was wanted, which of the hardware parameters would need to be increased?
- …
- What else would be adjusted for higher scalability?
- …
- Which further configuration would be necessary?
- …
Contact Information Data Analyser
José Manuel García Nieto, jnieto@lcc.uma.es, +34 951 952924










