Competitors Data Collector

From E-COMPASS_Info_Guide
Revision as of 23:39, 30 December 2015 by Falkner (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

Overview of the Competitors' Data Collector

The Competitors’ Data Collector includes the four main classes

  1. Scheduler,
  2. Controller,
  3. Scraper and
  4. ProductResolver.

It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.

The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.

The components of the service module have the following functions:

  1. Scheduler The Scheduler manages the execution of the data collection jobs defined by the users. The Scheduler is responsible to ensure that every job runs in time and there are not several jobs collecting data from the same website are running together in order to prevent denial of service attacks by the system. The Scheduler is executed by a Scheduled Windows Task (Windows Cronjob) which is running every 30 minutes.
  2. Controller The Controller fetches all information from the database of VM1 which is required to run a specific data collection job. Having collected the necessary information it controls the Scraper and afterwards the ProductResolver provides them input and takes and processes the output. Finally, it writes the result data into the database on VM1.
  3. Scraper The Scraper walks through all pages of an e-shop website, checks the pages for occurring product records, extracts the product records from the webpages, identifies and extracts defined product attributes within and from the extracted product records and writes the information to the database on VM2.
    1. Crawler: The Crawler collects all links of an e-shop website until the third level of the website and stores the information to the database on VM2. The collected links are updated every three weeks.
    2. LightExtractor: The LightExtractor analyses given a webpage for the occurrence of product lists. In the case that a webpage includes a product list it identifies and extracts the products records within the product lists.
    3. AttributeExtractor: The AttributeExtractor analyses the product records extracted by the LightExtractor and identifies and extracts pre-defined product attributes as current price, regular price, currency, product name, link to detail page and (link to) product image.
  4. ProductResolver Currently, the ProductResolver include only one single component called ProductMatcher. In future versions of the service module it will include an additional component which identifies and extracts further product attributes as product colour, product manufacturer or product units through using semantic data as ontologies or Linked Open Data (LOD) stores.
    1. ProductMatcher: The ProductMatcher takes the results of one run of a collection job stored in the database on VM2 and filters out the product price information of the products of the e-shops defined in the collection job and assigns them to the job data. The resulting data is returned by the ProductMatcher and stored to the database on VM1 by the Controller.

Physical Hardware Characteristics

Model: Cisco UCS B200 M4 und M2 / Cisco UCS B230 M2
Processor: 2 socket CPUs with 12, 20 or 24 cores
RAM: 256 GB
Hard Drive Space: SAN Storage mirrored HP 3par System 7400c with 100 TB each
Network Connection: between VMs: 10 Gbit, SAN Storage connection: 8 Gbit FibreChannel at minimum, outward: 10 Gbit Ethernet
Hypervisor used: VMware ESX 6 with vSphere Center 6 in the Cluster
Physical Load Balancing: none

Virtual Machine Hardware Specifications and Operating System

As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:

Guest Operating System: Windows 8 Enterprise, 64 Bit
Processor: 2 Processors, 2.27GHz, 4 cores (VM1) + 8 cores (VM2)
RAM: 8 GB (VM 1) + 32 GB (VM2) GB
Hard Drive Space VM: 50 GB (VM1) + 500 GB (VM2)
Network Connection: 10 Gbit/s
Minimum required Network Connection: no info available

VM1 needs to be available from external networks (internet). VM2 should not be available from outside.

Service Environment and Set-up on VM1

The API of the system and the corresponding database are located on VM1. The API is based on the Flask Microframework for Python and running within an Apache Webserver . The database is a MySQL database . The interface is implemented in Python. For setting up the API please download and install the following software:

Required Software
Software Download
Apache 2.4 http://httpd.apache.org/
Python 2.7 https://www.python.org/download/releases/2.7/
MySQL Server 5.6 (Community Edition) https://dev.mysql.com/downloads/mysql/
Mod_wsgi for Apache 2.4 and Python 2.7 http://www.lfd.uci.edu/~gohlke/pythonlibs/#mod_wsgi

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

Windows Environment Variables

For completing the Python installation the Windows environment variable PATH need to be set to the following:
C:\Python27\;C:\Python27\Scripts\;"C:\Program Files\MySQL\MySQL Server 5.6\bin";

Installation of Mod_wsgi

  1. The downloaded file should be renamed to ≫mod_wsgi.so
  2. mod_wsgi.so need to be placed into C:\Apache2.4\modules
  3. Open the following file
    C:\Apache2.4\conf\httpd.conf
    and insert the following line
    LoadModule wsgi_module modules/mod_wsgi.so
    and define the script alias and the app directory as follows
    <Directory c:/price_monitor><br/>
        Require all granted
    </Directory>
    WSGIScriptAlias / c:/price_monitor/app.wsgi
  4. Save httpd.conf
  5. Restart Apache server

Installation of Flask

  1. Install easy_install:
    1. Download the ez_setup.py file from https://bootstrap.pypa.io/ez_setup.py
    2. Save it to C:\Python27\Scripts\
    3. Open command line and go to C:\Python27\Scripts\ an run:
      python ez_setup.py
  2. Install pip by running the following command: easy_install pip
  3. Install flask by running the command: pip install –Iv flask==0.10.1

Preparation of MySQL database

  1. Open command line and type: mysql –root –p
  2. Enter root password (given during the installation of MySQL)
  3. Create database and user for Price Monitor:
    1. mysql > CREATE DATABASE ecompass;
    2. mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;
    3. mysql > USE ecompass;
    4. mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
    5. mysql > exit;
  4. Create database tables for Price Monitor:
    1. Open command line and go to c:/price_monitor
    2. Run the following command: python db_create.py
    3. COMMENT: For migrating the database after a change run: python db_migrate.py
  5. Restart Apache webserver

Installation of the E-COMPASS Price Monitor API

  1. Place the price_monitor folder to c:/
  2. Open the command line and go to c:/
  3. Run the following command to create the database
    ...

Service Environment and Set-up on VM2

The Scraper and ProductResolver (Product Matching Component) are located on VM2. Those components are based on Python and use a MySQL database for storing the collected product and price data. Thus, the following software is required and needs to be installed:

Required Software
Software Download
Python 2.7 https://www.python.org/download/releases/2.7/
MySQL Server 5.6 (Community Edition) https://dev.mysql.com/downloads/mysql/
Firefox 41.0.2 https://ftp.mozilla.org/pub/firefox/releases/

Software Licenses

Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately

Python Libraries

Additionally, there are several python libraries required to run the service on VM2. In order to get the libraries open the command line and run a pip install command for installation:

  1. SQLAlchemy 0.9.7
  2. BeautifulSoup 3.2.1
  3. beautifulsoup4 4.3.2 (both installations of BeautifulSoup are required)
  4. cssselect 0.9.1
  5. cssutils 1.0
  6. chardet 2.3.0
  7. goslate 1.3.0
  8. langdetect 1.0.5
  9. mechanize 0.2.5
  10. nltk 3.0.4
  11. py-translate 1.0.3
  12. python-Levenshtein 0.12.0
  13. rdflib 4.1.2
  14. requests 2.4.3
  15. selenium 2.48.0
  16. simplejson 3.7.3
  17. tinycss 0.3
  18. tld 0.7.2
  19. utils 0.5

Preparation of MySQL database

  1. Open command line and type: mysql –root –p
  2. Enter root password (given during the installation of MySQL)
  3. Create database and user for Price Monitor:
    1. mysql > CREATE DATABASE ecompass;
    2. mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’;
    3. mysql > USE ecompass;
    4. mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost';
    5. mysql > exit;
  4. Create database tables for Price Monitor:
    1. Open command line and go to c:/price_monitor
    2. Run the following command: python db_create.py
    3. COMMENT: For migrating the database after a change run: python db_migrate.py
  5. Restart Apache webserver

Service Configuration

For starting the Scheduler in order to run the data collection jobs a Windows scheduled task is required. Go to the Windows Task Scheduler and create a new scheduled tasked which triggers the batch file c:/price_monitor/scheduler.bat every 30 minutes.

Configuration script

availability / location C:\price_monitor\config.json MySQL configuration URI of API on VM1

README / User Manual

availability / location

Configuration steps

Changes in C:\price_monitor\config.json Set-up Windows task which starts C:\price_monitor\scheduler.bat Configuration of REST endpoints at:

Operation

Service start-up procedure

Start MySQL and Apache, create Windows tasks for starting C:\price_monitor\scheduler.bat

Restarting the service

MySQL and Apache need to be started and the scheduler.bat within c:\price_monitor needs to be started by a Windows task.

Service Logs

  • Apache logs
  • MySQL logs

Recurring Manual Actions / Maintenance Tasks

VPN Service

In order to hide the IP address of VM2 to the e-shops where data is collected as some e-shops will break the connection to an external machine which has too many hits on the website the installation of a VPN service hiding and changing the own IP address is recommended. Thus, a service as one of the following should be installed:

  1. https://www.perfect-privacy.com/
  2. https://www.hidemyass.com/

Limitations of the service

With which parameters does the service scale?

How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?

The parameter most relevant for the scaling of this module is the number of scraping jobs, i.e. the number of products monitored multiplied with the number of competitors for which these products are monitored or expressed mathematically: Σ E-Shops ( Σ Products (Number of Competitors ) ). The user testing during the project came near the limits of the CPU capacity for VM2 with 10 competitors monitored while only 25% of the RAM was in use. The storage is not a limit as the scraping data does not need to be kept for more than 2 days and the database limits are not in sight with the current usage.
If higher scaling was wanted, which of the hardware parameters would need to be increased?
CPU, RAM
What else would be adjusted for higher scalability?
If more CPUs/Cores are available further parallelization of the program code is necessary. The data extraction is already parallelized, however this is not speed-critical and therefore not relevant for scaling.
Which further configuration would be necessary?

Contact Information Competitors’ Data Collector Service

Andrea Horch, andrea.horch@iao.fraunhofer.de, +49 711 970-2349