Difference between revisions of "Competitors Data Collector"

From E-COMPASS_Info_Guide
Jump to navigation Jump to search
m
Line 39: Line 39:
 
| balancing = none
 
| balancing = none
 
}}
 
}}
 +
 
{{VM | text = As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:
 
{{VM | text = As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:
 
| os = Windows 8 Enterprise, 64 Bit
 
| os = Windows 8 Enterprise, 64 Bit
Line 47: Line 48:
 
| network required =  
 
| network required =  
 
}}
 
}}
 +
VM1 needs to be available from external networks (internet). VM2 should not be available from outside.
  
VM1 needs to be available from external networks (internet). VM2 should not be available from outside.
+
=== Service Environment and Set-up on VM1 ===
 +
The API of the system and the corresponding database are located on VM1. The API is based on the Flask Microframework for Python  and running within an Apache Webserver . The database is a MySQL database . The interface is implemented in Python. For setting up the API please download and install the following software:
 +
{| class="wikitable"
 +
|+ Required Software
 +
! Software
 +
! Download
 +
|-
 +
| Apache 2.4
 +
| http://httpd.apache.org/
 +
|-
 +
| Python 2.7
 +
| https://www.python.org/download/releases/2.7/
 +
|-
 +
| MySQL Server 5.6 (Community Edition)
 +
| https://dev.mysql.com/downloads/mysql/
 +
|-
 +
| Mod_wsgi for Apache 2.4 and Python 2.7
 +
| http://www.lfd.uci.edu/~gohlke/pythonlibs/#mod_wsgi
 +
|}
 +
 
 +
==== Software Licenses ====
 +
{{comment|Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately}}
 +
 
 +
==== Windows Environment Variables ====
 +
For completing the Python installation the Windows environment variable PATH need to be set to the following:<br/>
 +
<code>C:\Python27\;C:\Python27\Scripts\;"C:\Program Files\MySQL\MySQL Server 5.6\bin";</code>
 +
 
 +
==== Installation of Mod_wsgi ====
 +
# The downloaded file should be renamed to ≫<code>mod_wsgi.so</code>≪
 +
# <code>mod_wsgi.so</code> need to be placed into <code>C:\Apache2.4\modules</code>
 +
# Open the following file
 +
#: <code>C:\Apache2.4\conf\httpd.conf</code>
 +
#: and insert the following line
 +
<code>LoadModule wsgi_module modules/mod_wsgi.so</code>
 +
and define the script alias and the app directory as follows
 +
<pre>
 +
<Directory c:/price_monitor>
 +
    Require all granted
 +
</Directory>
 +
WSGIScriptAlias / c:/price_monitor/app.wsgi
 +
</pre>
 +
# Save httpd.conf
 +
# Restart Apache server
 +
 
 +
==== Installation of Flask ====
 +
# Install easy_install:
 +
## Download the ez_setup.py file from https://bootstrap.pypa.io/ez_setup.py
 +
## Save it to C:\Python27\Scripts\
 +
## Open command line and go to C:\Python27\Scripts\ an run:
 +
python ez_setup.py
 +
# Install pip by running the following command: easy_install pip
 +
# Install flask by running the command: pip install –Iv flask==0.10.1
 +
 
 +
==== Preparation of MySQL database ====
 +
# Open command line and type: mysql –root –p
 +
# Enter root password (given during the installation of MySQL)
 +
# Create database and user for Price Monitor:
 +
## mysql > CREATE DATABASE ecompass;
 +
## mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;
 +
## mysql > USE ecompass;
 +
## mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
 +
## mysql > exit;
 +
# Create database tables for Price Monitor:
 +
## Open command line and go to c:/price_monitor
 +
## Run the following command: python db_create.py
 +
#: COMMENT: For migrating the database after a change run: python db_migrate.py
 +
# Restart Apache webserver
 +
 
 +
==== Installation of the E-COMPASS Price Monitor API ====
 +
# Place the price_monitor folder to c:/
 +
# Open the command line and go to c:/
 +
# Run the following command to create the database

Revision as of 14:44, 9 December 2015

The Competitors’ Data Collector includes the four main classes

  1. Scheduler,
  2. Controller,
  3. Scraper and
  4. ProductResolver.

It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.

File:Competitors Data Collector
Overview of the Competitors' Data Collector

The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.

The components of the service module have the following functions:

  1. Scheduler The Scheduler manages the execution of the data collection jobs defined by the users. The Scheduler is responsible to ensure that every job runs in time and there are not several jobs collecting data from the same website are running together in order to prevent denial of service attacks by the system. The Scheduler is executed by a Scheduled Windows Task (Windows Cronjob) which is running every 30 minutes.
  2. Controller The Controller fetches all information from the database of VM1 which is required to run a specific data collection job. Having collected the necessary information it controls the Scraper and afterwards the ProductResolver provides them input and takes and processes the output. Finally, it writes the result data into the database on VM1.
  3. Scraper The Scraper walks through all pages of an e-shop website, checks the pages for occurring product records, extracts the product records from the webpages, identifies and extracts defined product attributes within and from the extracted product records and writes the information to the database on VM2.
    1. Crawler: The Crawler collects all links of an e-shop website until the third level of the website and stores the information to the database on VM2. The collected links are updated every three weeks.
    2. LightExtractor: The LightExtractor analyses given a webpage for the occurrence of product lists. In the case that a webpage includes a product list it identifies and extracts the products records within the product lists.
    3. AttributeExtractor: The AttributeExtractor analyses the product records extracted by the LightExtractor and identifies and extracts pre-defined product attributes as current price, regular price, currency, product name, link to detail page and (link to) product image.
  4. ProductResolver Currently, the ProductResolver include only one single component called ProductMatcher. In future versions of the service module it will include an additional component which identifies and extracts further product attributes as product colour, product manufacturer or product units through using semantic data as ontologies or Linked Open Data (LOD) stores.
    1. ProductMatcher: The ProductMatcher takes the results of one run of a collection job stored in the database on VM2 and filters out the product price information of the products of the e-shops defined in the collection job and assigns them to the job data. The resulting data is returned by the ProductMatcher and stored to the database on VM1 by the Controller.

Physical Hardware Characteristics

Model:
Processor:
RAM:
Hard Drive Space:
Network Connection:
Hypervisor used: VMware
Physical Load Balancing: none

Virtual Machine Hardware Specifications and Operating System

As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:

Guest Operating System: Windows 8 Enterprise, 64 Bit
Processor: 2 Processors, 2.27GHz
RAM: 32 GB
Hard Drive Space VM: 50 GB (VM1) + 500 GB (VM2)
Network Connection: 1Gbit
Minimum required Network Connection:

VM1 needs to be available from external networks (internet). VM2 should not be available from outside.

Service Environment and Set-up on VM1

The API of the system and the corresponding database are located on VM1. The API is based on the Flask Microframework for Python and running within an Apache Webserver . The database is a MySQL database . The interface is implemented in Python. For setting up the API please download and install the following software:

Required Software
Software Download
Apache 2.4 http://httpd.apache.org/
Python 2.7 https://www.python.org/download/releases/2.7/
MySQL Server 5.6 (Community Edition) https://dev.mysql.com/downloads/mysql/
Mod_wsgi for Apache 2.4 and Python 2.7 http://www.lfd.uci.edu/~gohlke/pythonlibs/#mod_wsgi

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

Windows Environment Variables

For completing the Python installation the Windows environment variable PATH need to be set to the following:
C:\Python27\;C:\Python27\Scripts\;"C:\Program Files\MySQL\MySQL Server 5.6\bin";

Installation of Mod_wsgi

  1. The downloaded file should be renamed to ≫mod_wsgi.so
  2. mod_wsgi.so need to be placed into C:\Apache2.4\modules
  3. Open the following file
    C:\Apache2.4\conf\httpd.conf
    and insert the following line

LoadModule wsgi_module modules/mod_wsgi.so and define the script alias and the app directory as follows

<Directory c:/price_monitor>
    Require all granted
</Directory>
WSGIScriptAlias / c:/price_monitor/app.wsgi
  1. Save httpd.conf
  2. Restart Apache server

Installation of Flask

  1. Install easy_install:
    1. Download the ez_setup.py file from https://bootstrap.pypa.io/ez_setup.py
    2. Save it to C:\Python27\Scripts\
    3. Open command line and go to C:\Python27\Scripts\ an run:

python ez_setup.py

  1. Install pip by running the following command: easy_install pip
  2. Install flask by running the command: pip install –Iv flask==0.10.1

Preparation of MySQL database

  1. Open command line and type: mysql –root –p
  2. Enter root password (given during the installation of MySQL)
  3. Create database and user for Price Monitor:
    1. mysql > CREATE DATABASE ecompass;
    2. mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;
    3. mysql > USE ecompass;
    4. mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
    5. mysql > exit;
  4. Create database tables for Price Monitor:
    1. Open command line and go to c:/price_monitor
    2. Run the following command: python db_create.py
    COMMENT: For migrating the database after a change run: python db_migrate.py
  5. Restart Apache webserver

Installation of the E-COMPASS Price Monitor API

  1. Place the price_monitor folder to c:/
  2. Open the command line and go to c:/
  3. Run the following command to create the database