Difference between revisions of "Competitors Data Collector"

From E-COMPASS_Info_Guide
Jump to navigation Jump to search
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Overview ==
+
=== Overview ===
 
+
[[File:Competitors Data Collector.png|thumbnail|Overview of the Competitors' Data Collector]]
 
The Competitors’ Data Collector includes the four main classes
 
The Competitors’ Data Collector includes the four main classes
 
# Scheduler,
 
# Scheduler,
Line 7: Line 7:
 
# ProductResolver.
 
# ProductResolver.
 
It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.
 
It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.
 
[[File:Competitors Data Collector|thumbnail|Overview of the Competitors' Data Collector]]
 
  
 
The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.
 
The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.
Line 33: Line 31:
  
 
{{Hardware | text =  
 
{{Hardware | text =  
| model =  
+
| model = Cisco UCS B200 M4 und M2 / Cisco UCS B230 M2
| cpu =  
+
| cpu = 2 socket CPUs with 12, 20 or 24 cores
| ram =  
+
| ram = 256 GB
| hdd =  
+
| hdd = SAN Storage mirrored HP 3par System 7400c with 100 TB each
| network =  
+
| network = between VMs: 10 Gbit, SAN Storage connection: 8 Gbit FibreChannel at minimum, outward: 10 Gbit Ethernet
| hypervisor = VMware
+
| hypervisor = VMware ESX 6 with vSphere Center 6 in the Cluster
 
| balancing = none
 
| balancing = none
 
}}
 
}}
Line 44: Line 42:
 
{{VM | text = As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:
 
{{VM | text = As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:
 
| os = Windows 8 Enterprise, 64 Bit
 
| os = Windows 8 Enterprise, 64 Bit
| cpu = 2 Processors, 2.27GHz
+
| cpu = 2 Processors, 2.27GHz, 4 cores (VM1) + 8 cores (VM2)
| ram = 32
+
| ram = 8 GB (VM 1) + 32 GB (VM2)
 
| hdd = 50 GB (VM1) + 500 GB (VM2)
 
| hdd = 50 GB (VM1) + 500 GB (VM2)
| network = 1Gbit
+
| network = 10 Gbit/s
| network required =  
+
| network required = no info available
 
}}
 
}}
 
VM1 needs to be available from external networks (internet). VM2 should not be available from outside.
 
VM1 needs to be available from external networks (internet). VM2 should not be available from outside.
Line 80: Line 78:
  
 
==== Installation of Mod_wsgi ====
 
==== Installation of Mod_wsgi ====
# The downloaded file should be renamed to ≫<code>mod_wsgi.so</code>≪
+
<ol style="list-style-type:decimal">
# <code>mod_wsgi.so</code> need to be placed into <code>C:\Apache2.4\modules</code>
+
<li> The downloaded file should be renamed to ≫<code>mod_wsgi.so</code>≪</li>
# Open the following file  
+
<li> <code>mod_wsgi.so</code> need to be placed into <code>C:\Apache2.4\modules</code></li>
#: <code>C:\Apache2.4\conf\httpd.conf</code>
+
<li> Open the following file <br/>
#: and insert the following line  
+
<code>C:\Apache2.4\conf\httpd.conf</code><br/>
#: <code>LoadModule wsgi_module modules/mod_wsgi.so</code>
+
and insert the following line<br/>
#: and define the script alias and the app directory as follows
+
<code>LoadModule wsgi_module modules/mod_wsgi.so</code><br/>
#: <pre>
+
and define the script alias and the app directory as follows
<Directory c:/price_monitor>
+
<pre><Directory c:/price_monitor><br/>
 
     Require all granted
 
     Require all granted
 
</Directory>
 
</Directory>
WSGIScriptAlias / c:/price_monitor/app.wsgi
+
WSGIScriptAlias / c:/price_monitor/app.wsgi</pre>
</pre>
+
<li> Save <code>httpd.conf</code></li>
# Save <code>httpd.conf</code>
+
<li> Restart Apache server</li>
# Restart Apache server
+
</ol>
  
 
==== Installation of Flask ====
 
==== Installation of Flask ====
# Install <code>easy_install</code>:
+
<ol style="list-style-type:decimal">
## Download the <code>ez_setup.py</code> file from https://bootstrap.pypa.io/ez_setup.py
+
<li> Install <code>easy_install</code>:</li>
## Save it to <code>C:\Python27\Scripts\</code>
+
<ol style="list-style-type:lower-latin">
## Open command line and go to <code>C:\Python27\Scripts\</code> an run:  
+
<li> Download the <code>ez_setup.py</code> file from https://bootstrap.pypa.io/ez_setup.py</li>
#: <code>python ez_setup.py</code>
+
<li> Save it to <code>C:\Python27\Scripts\</code></li>
# Install <code>pip</code> by running the following command: <code>easy_install pip</code>
+
<li> Open command line and go to <code>C:\Python27\Scripts\</code> an run: <br/>
# Install <code>flask</code> by running the command: <code>pip install –Iv flask==0.10.1</code>
+
<code>python ez_setup.py</code></li>
 +
</ol>
 +
<li> Install <code>pip</code> by running the following command: <code>easy_install pip</code></li>
 +
<li> Install <code>flask</code> by running the command: <code>pip install –Iv flask==0.10.1</code></li>
 +
</ol>
  
 
==== Preparation of MySQL database ====
 
==== Preparation of MySQL database ====
# Open command line and type: <code>vmysql –root –p</code>
+
<ol style="list-style-type:decimal">
# Enter root password (given during the installation of MySQL)
+
<li> Open command line and type: <code>mysql –root –p</code></li>
# Create database and user for Price Monitor:
+
<li> Enter root password (given during the installation of MySQL)</li>
## <code>mysql > CREATE DATABASE ecompass;</code>
+
<li> Create database and user for Price Monitor:
## <code>mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;</code>
+
<ol style="list-style-type:lower-latin">
## <code>mysql > USE ecompass;</code>
+
<li> <code>mysql > CREATE DATABASE ecompass;</code></li>
## <code>mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';</code>
+
<li> <code>mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;</code></li>
## <code>mysql > exit;</code>
+
<li> <code>mysql > USE ecompass;</code></li>
# Create database tables for Price Monitor:
+
<li> <code>mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';</code></li>
## Open command line and go to <code>c:/price_monitor</code>
+
<li> <code>mysql > exit;</code></li>
## Run the following command: <code>python db_create.py</code>
+
</ol></li>
#: COMMENT: For migrating the database after a change run: python db_migrate.py
+
<li> Create database tables for Price Monitor:
# Restart Apache webserver
+
<ol style="list-style-type:lower-latin">
 +
<li> Open command line and go to <code>c:/price_monitor</code></li>
 +
<li> Run the following command: <code>python db_create.py</code></li>
 +
COMMENT: For migrating the database after a change run: <code>python db_migrate.py</code>
 +
</ol></li>
 +
<li> Restart Apache webserver</li>
 +
</ol>
  
 
==== Installation of the E-COMPASS Price Monitor API ====
 
==== Installation of the E-COMPASS Price Monitor API ====
Line 170: Line 178:
  
 
==== Preparation of MySQL database ====
 
==== Preparation of MySQL database ====
# Open command line and type: <code>mysql –root –p</code>
+
<ol style="list-style-type:decimal">
# Enter root password (given during the installation of MySQL)
+
<li> Open command line and type: <code>mysql –root –p</code></li>
# Create database and user for Price Monitor:
+
<li> Enter root password (given during the installation of MySQL)</li>
## <code>mysql > CREATE DATABASE ecompass;</code>
+
<li> Create database and user for Price Monitor:
## <code>mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;</code>
+
<ol style="list-style-type:lower-latin">
## <code>mysql > USE ecompass;</code>
+
<li> <code>mysql > CREATE DATABASE ecompass;</code></li>
## <code>mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';</code>
+
<li> <code>mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;</code></li>
## <code>mysql > exit;</code>
+
<li> <code>mysql > USE ecompass;</code></li>
# Create database tables for Price Monitor:
+
<li> <code>mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';</code></li>
## Open command line and go to <code>c:/price_monitor</code>
+
<li> <code>mysql > exit;</code></li>
## Run the following command: <code>python db_create.py</code>
+
</ol></li>
#: COMMENT: For migrating the database after a change run: python db_migrate.py
+
<li> Create database tables for Price Monitor:
 +
<ol style="list-style-type:lower-latin">
 +
<li> Open command line and go to <code>c:/price_monitor</code></li>
 +
<li> Run the following command: <code>python db_create.py</code></li>
 +
COMMENT: For migrating the database after a change run: <code>python db_migrate.py</code>
 +
</ol></li>
 +
<li> Restart Apache webserver</li>
 +
</ol>
  
==== Service Configuration ====
+
=== Service Configuration ===
 
For starting the Scheduler in order to run the data collection jobs a Windows scheduled task is required. Go to the Windows Task Scheduler and create a new scheduled tasked which triggers the batch file <code>c:/price_monitor/scheduler.bat</code> every 30 minutes.
 
For starting the Scheduler in order to run the data collection jobs a Windows scheduled task is required. Go to the Windows Task Scheduler and create a new scheduled tasked which triggers the batch file <code>c:/price_monitor/scheduler.bat</code> every 30 minutes.
Configuration script
+
==== Configuration script ====
 
availability / location
 
availability / location
 
C:\price_monitor\config.json
 
C:\price_monitor\config.json
 
MySQL configuration
 
MySQL configuration
 
URI of API on VM1
 
URI of API on VM1
README / User Manual
+
==== README / User Manual ====
 
availability / location
 
availability / location
Configuration steps
+
==== Configuration steps ====
 
Changes in C:\price_monitor\config.json
 
Changes in C:\price_monitor\config.json
 
Set-up Windows task which starts C:\price_monitor\scheduler.bat
 
Set-up Windows task which starts C:\price_monitor\scheduler.bat
 
Configuration of REST endpoints at:  
 
Configuration of REST endpoints at:  
  
5.3.6 Operation
+
=== Operation ===
Service start-up procedure
+
==== Service start-up procedure ====
 
Start MySQL and Apache, create Windows tasks for starting C:\price_monitor\scheduler.bat
 
Start MySQL and Apache, create Windows tasks for starting C:\price_monitor\scheduler.bat
Restarting the service
+
==== Restarting the service ====
 
MySQL and Apache need to be started and the scheduler.bat within c:\price_monitor needs to be started by a Windows task.  
 
MySQL and Apache need to be started and the scheduler.bat within c:\price_monitor needs to be started by a Windows task.  
Service Logs
+
==== Service Logs ====
Apache logs
+
* Apache logs
MySQL logs
+
* MySQL logs
Recurring Manual Actions / Maintenance Tasks
+
==== Recurring Manual Actions / Maintenance Tasks ====
 
 
  
5.3.7 VPN Service  
+
=== VPN Service ===
 
In order to hide the IP address of VM2 to the e-shops where data is collected as some e-shops will break the connection to an external machine which has too many hits on the website the installation of a VPN service hiding and changing the own IP address is recommended. Thus, a service as one of the following should be installed:
 
In order to hide the IP address of VM2 to the e-shops where data is collected as some e-shops will break the connection to an external machine which has too many hits on the website the installation of a VPN service hiding and changing the own IP address is recommended. Thus, a service as one of the following should be installed:
 
# https://www.perfect-privacy.com/
 
# https://www.perfect-privacy.com/
 
# https://www.hidemyass.com/
 
# https://www.hidemyass.com/
5.3.8 Limitations of the service
+
 
How many concurrent E-Shops with how many concurrent products are possible without causing loss in quality/speed for the hardware described above?  
+
=== Limitations of the service ===
+
; With which parameters does the service scale?
If higher scaling was wanted, which of the hardware parameters would need to be increased?
+
{{comment|How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?}}
+
: The parameter most relevant for the scaling of this module is the number of scraping jobs, i.e. the number of products monitored multiplied with the number of competitors for which these products are monitored or expressed mathematically: Σ <sub>E-Shops</sub> ( Σ <sub>Products</sub> (Number of Competitors ) ). The user testing during the project came near the limits of the CPU capacity for VM2 with 10 competitors monitored while only 25% of the RAM was in use. The storage is not a limit as the scraping data does not need to be kept for more than 2 days and the database limits are not in sight with the current usage.
What else would be adjusted for higher scalability?  
+
; If higher scaling was wanted, which of the hardware parameters would need to be increased?
+
: CPU, RAM
Which further configuration would be necessary?
+
; What else would be adjusted for higher scalability?  
+
: If more CPUs/Cores are available further parallelization of the program code is necessary. The data extraction is already parallelized, however this is not speed-critical and therefore not relevant for scaling.
5.3.9 Contact Information Competitors’ Data Collector Service
+
; Which further configuration would be necessary?
Andrea Horch, andrea.horch@iao.fraunhofer.de, +49 711 970-2349
+
:
 +
 
 +
=== Contact Information Competitors’ Data Collector Service ===
 +
Andrea Horch, [mailto:andrea.horch@iao.fraunhofer.de andrea.horch@iao.fraunhofer.de], +49 711 970-2349

Latest revision as of 23:39, 30 December 2015

Overview

Overview of the Competitors' Data Collector

The Competitors’ Data Collector includes the four main classes

  1. Scheduler,
  2. Controller,
  3. Scraper and
  4. ProductResolver.

It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.

The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.

The components of the service module have the following functions:

  1. Scheduler The Scheduler manages the execution of the data collection jobs defined by the users. The Scheduler is responsible to ensure that every job runs in time and there are not several jobs collecting data from the same website are running together in order to prevent denial of service attacks by the system. The Scheduler is executed by a Scheduled Windows Task (Windows Cronjob) which is running every 30 minutes.
  2. Controller The Controller fetches all information from the database of VM1 which is required to run a specific data collection job. Having collected the necessary information it controls the Scraper and afterwards the ProductResolver provides them input and takes and processes the output. Finally, it writes the result data into the database on VM1.
  3. Scraper The Scraper walks through all pages of an e-shop website, checks the pages for occurring product records, extracts the product records from the webpages, identifies and extracts defined product attributes within and from the extracted product records and writes the information to the database on VM2.
    1. Crawler: The Crawler collects all links of an e-shop website until the third level of the website and stores the information to the database on VM2. The collected links are updated every three weeks.
    2. LightExtractor: The LightExtractor analyses given a webpage for the occurrence of product lists. In the case that a webpage includes a product list it identifies and extracts the products records within the product lists.
    3. AttributeExtractor: The AttributeExtractor analyses the product records extracted by the LightExtractor and identifies and extracts pre-defined product attributes as current price, regular price, currency, product name, link to detail page and (link to) product image.
  4. ProductResolver Currently, the ProductResolver include only one single component called ProductMatcher. In future versions of the service module it will include an additional component which identifies and extracts further product attributes as product colour, product manufacturer or product units through using semantic data as ontologies or Linked Open Data (LOD) stores.
    1. ProductMatcher: The ProductMatcher takes the results of one run of a collection job stored in the database on VM2 and filters out the product price information of the products of the e-shops defined in the collection job and assigns them to the job data. The resulting data is returned by the ProductMatcher and stored to the database on VM1 by the Controller.

Physical Hardware Characteristics

Model: Cisco UCS B200 M4 und M2 / Cisco UCS B230 M2
Processor: 2 socket CPUs with 12, 20 or 24 cores
RAM: 256 GB
Hard Drive Space: SAN Storage mirrored HP 3par System 7400c with 100 TB each
Network Connection: between VMs: 10 Gbit, SAN Storage connection: 8 Gbit FibreChannel at minimum, outward: 10 Gbit Ethernet
Hypervisor used: VMware ESX 6 with vSphere Center 6 in the Cluster
Physical Load Balancing: none

Virtual Machine Hardware Specifications and Operating System

As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:

Guest Operating System: Windows 8 Enterprise, 64 Bit
Processor: 2 Processors, 2.27GHz, 4 cores (VM1) + 8 cores (VM2)
RAM: 8 GB (VM 1) + 32 GB (VM2) GB
Hard Drive Space VM: 50 GB (VM1) + 500 GB (VM2)
Network Connection: 10 Gbit/s
Minimum required Network Connection: no info available

VM1 needs to be available from external networks (internet). VM2 should not be available from outside.

Service Environment and Set-up on VM1

The API of the system and the corresponding database are located on VM1. The API is based on the Flask Microframework for Python and running within an Apache Webserver . The database is a MySQL database . The interface is implemented in Python. For setting up the API please download and install the following software:

Required Software
Software Download
Apache 2.4 http://httpd.apache.org/
Python 2.7 https://www.python.org/download/releases/2.7/
MySQL Server 5.6 (Community Edition) https://dev.mysql.com/downloads/mysql/
Mod_wsgi for Apache 2.4 and Python 2.7 http://www.lfd.uci.edu/~gohlke/pythonlibs/#mod_wsgi

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

Windows Environment Variables

For completing the Python installation the Windows environment variable PATH need to be set to the following:
C:\Python27\;C:\Python27\Scripts\;"C:\Program Files\MySQL\MySQL Server 5.6\bin";

Installation of Mod_wsgi

  1. The downloaded file should be renamed to ≫mod_wsgi.so
  2. mod_wsgi.so need to be placed into C:\Apache2.4\modules
  3. Open the following file
    C:\Apache2.4\conf\httpd.conf
    and insert the following line
    LoadModule wsgi_module modules/mod_wsgi.so
    and define the script alias and the app directory as follows
    <Directory c:/price_monitor><br/>
        Require all granted
    </Directory>
    WSGIScriptAlias / c:/price_monitor/app.wsgi
  4. Save httpd.conf
  5. Restart Apache server

Installation of Flask

  1. Install easy_install:
    1. Download the ez_setup.py file from https://bootstrap.pypa.io/ez_setup.py
    2. Save it to C:\Python27\Scripts\
    3. Open command line and go to C:\Python27\Scripts\ an run:
      python ez_setup.py
  2. Install pip by running the following command: easy_install pip
  3. Install flask by running the command: pip install –Iv flask==0.10.1

Preparation of MySQL database

  1. Open command line and type: mysql –root –p
  2. Enter root password (given during the installation of MySQL)
  3. Create database and user for Price Monitor:
    1. mysql > CREATE DATABASE ecompass;
    2. mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;
    3. mysql > USE ecompass;
    4. mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
    5. mysql > exit;
  4. Create database tables for Price Monitor:
    1. Open command line and go to c:/price_monitor
    2. Run the following command: python db_create.py
    3. COMMENT: For migrating the database after a change run: python db_migrate.py
  5. Restart Apache webserver

Installation of the E-COMPASS Price Monitor API

  1. Place the price_monitor folder to c:/
  2. Open the command line and go to c:/
  3. Run the following command to create the database
    ...

Service Environment and Set-up on VM2

The Scraper and ProductResolver (Product Matching Component) are located on VM2. Those components are based on Python and use a MySQL database for storing the collected product and price data. Thus, the following software is required and needs to be installed:

Required Software
Software Download
Python 2.7 https://www.python.org/download/releases/2.7/
MySQL Server 5.6 (Community Edition) https://dev.mysql.com/downloads/mysql/
Firefox 41.0.2 https://ftp.mozilla.org/pub/firefox/releases/

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

Python Libraries

Additionally, there are several python libraries required to run the service on VM2. In order to get the libraries open the command line and run a pip install command for installation:

  1. SQLAlchemy 0.9.7
  2. BeautifulSoup 3.2.1
  3. beautifulsoup4 4.3.2 (both installations of BeautifulSoup are required)
  4. cssselect 0.9.1
  5. cssutils 1.0
  6. chardet 2.3.0
  7. goslate 1.3.0
  8. langdetect 1.0.5
  9. mechanize 0.2.5
  10. nltk 3.0.4
  11. py-translate 1.0.3
  12. python-Levenshtein 0.12.0
  13. rdflib 4.1.2
  14. requests 2.4.3
  15. selenium 2.48.0
  16. simplejson 3.7.3
  17. tinycss 0.3
  18. tld 0.7.2
  19. utils 0.5

Preparation of MySQL database

  1. Open command line and type: mysql –root –p
  2. Enter root password (given during the installation of MySQL)
  3. Create database and user for Price Monitor:
    1. mysql > CREATE DATABASE ecompass;
    2. mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;
    3. mysql > USE ecompass;
    4. mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
    5. mysql > exit;
  4. Create database tables for Price Monitor:
    1. Open command line and go to c:/price_monitor
    2. Run the following command: python db_create.py
    3. COMMENT: For migrating the database after a change run: python db_migrate.py
  5. Restart Apache webserver

Service Configuration

For starting the Scheduler in order to run the data collection jobs a Windows scheduled task is required. Go to the Windows Task Scheduler and create a new scheduled tasked which triggers the batch file c:/price_monitor/scheduler.bat every 30 minutes.

Configuration script

availability / location C:\price_monitor\config.json MySQL configuration URI of API on VM1

README / User Manual

availability / location

Configuration steps

Changes in C:\price_monitor\config.json Set-up Windows task which starts C:\price_monitor\scheduler.bat Configuration of REST endpoints at:

Operation

Service start-up procedure

Start MySQL and Apache, create Windows tasks for starting C:\price_monitor\scheduler.bat

Restarting the service

MySQL and Apache need to be started and the scheduler.bat within c:\price_monitor needs to be started by a Windows task.

Service Logs

  • Apache logs
  • MySQL logs

Recurring Manual Actions / Maintenance Tasks

VPN Service

In order to hide the IP address of VM2 to the e-shops where data is collected as some e-shops will break the connection to an external machine which has too many hits on the website the installation of a VPN service hiding and changing the own IP address is recommended. Thus, a service as one of the following should be installed:

  1. https://www.perfect-privacy.com/
  2. https://www.hidemyass.com/

Limitations of the service

With which parameters does the service scale?

How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?

The parameter most relevant for the scaling of this module is the number of scraping jobs, i.e. the number of products monitored multiplied with the number of competitors for which these products are monitored or expressed mathematically: Σ E-Shops ( Σ Products (Number of Competitors ) ). The user testing during the project came near the limits of the CPU capacity for VM2 with 10 competitors monitored while only 25% of the RAM was in use. The storage is not a limit as the scraping data does not need to be kept for more than 2 days and the database limits are not in sight with the current usage.
If higher scaling was wanted, which of the hardware parameters would need to be increased?
CPU, RAM
What else would be adjusted for higher scalability?
If more CPUs/Cores are available further parallelization of the program code is necessary. The data extraction is already parallelized, however this is not speed-critical and therefore not relevant for scaling.
Which further configuration would be necessary?

Contact Information Competitors’ Data Collector Service

Andrea Horch, andrea.horch@iao.fraunhofer.de, +49 711 970-2349