Competitors Data Collector
Contents
- 1 Overview
- 2 Physical Hardware Characteristics
- 3 Virtual Machine Hardware Specifications and Operating System
- 4 Service Environment and Set-up on VM1
- 5 Service Environment and Set-up on VM2
- 6 Service Configuration
- 7 Operation
- 8 VPN Service
- 9 Limitations of the service
- 10 Contact Information Competitors’ Data Collector Service
Overview
The Competitors’ Data Collector includes the four main classes
- Scheduler,
- Controller,
- Scraper and
- ProductResolver.
It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.
The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.
The components of the service module have the following functions:
- Scheduler The Scheduler manages the execution of the data collection jobs defined by the users. The Scheduler is responsible to ensure that every job runs in time and there are not several jobs collecting data from the same website are running together in order to prevent denial of service attacks by the system. The Scheduler is executed by a Scheduled Windows Task (Windows Cronjob) which is running every 30 minutes.
- Controller The Controller fetches all information from the database of VM1 which is required to run a specific data collection job. Having collected the necessary information it controls the Scraper and afterwards the ProductResolver provides them input and takes and processes the output. Finally, it writes the result data into the database on VM1.
- Scraper
The Scraper walks through all pages of an e-shop website, checks the pages for occurring product records, extracts the product records from the webpages, identifies and extracts defined product attributes within and from the extracted product records and writes the information to the database on VM2.
- Crawler: The Crawler collects all links of an e-shop website until the third level of the website and stores the information to the database on VM2. The collected links are updated every three weeks.
- LightExtractor: The LightExtractor analyses given a webpage for the occurrence of product lists. In the case that a webpage includes a product list it identifies and extracts the products records within the product lists.
- AttributeExtractor: The AttributeExtractor analyses the product records extracted by the LightExtractor and identifies and extracts pre-defined product attributes as current price, regular price, currency, product name, link to detail page and (link to) product image.
- ProductResolver
Currently, the ProductResolver include only one single component called ProductMatcher. In future versions of the service module it will include an additional component which identifies and extracts further product attributes as product colour, product manufacturer or product units through using semantic data as ontologies or Linked Open Data (LOD) stores.
- ProductMatcher: The ProductMatcher takes the results of one run of a collection job stored in the database on VM2 and filters out the product price information of the products of the e-shops defined in the collection job and assigns them to the job data. The resulting data is returned by the ProductMatcher and stored to the database on VM1 by the Controller.
Physical Hardware Characteristics
| Model: | Cisco UCS B200 M4 und M2 / Cisco UCS B230 M2 |
| Processor: | 2 socket CPUs with 12, 20 or 24 cores |
| RAM: | 256 GB |
| Hard Drive Space: | SAN Storage mirrored HP 3par System 7400c with 100 TB each |
| Network Connection: | between VMs: 10 Gbit, SAN Storage connection: 8 Gbit FibreChannel at minimum, outward: 10 Gbit Ethernet |
| Hypervisor used: | VMware ESX 6 with vSphere Center 6 in the Cluster |
| Physical Load Balancing: | none |
Virtual Machine Hardware Specifications and Operating System
As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:
| Guest Operating System: | Windows 8 Enterprise, 64 Bit |
| Processor: | 2 Processors, 2.27GHz, 4 cores (VM1) + 8 cores (VM2) |
| RAM: | 8 GB (VM 1) + 32 GB (VM2) GB |
| Hard Drive Space VM: | 50 GB (VM1) + 500 GB (VM2) |
| Network Connection: | 10 Gbit/s |
| Minimum required Network Connection: | no info available |
VM1 needs to be available from external networks (internet). VM2 should not be available from outside.
Service Environment and Set-up on VM1
The API of the system and the corresponding database are located on VM1. The API is based on the Flask Microframework for Python and running within an Apache Webserver . The database is a MySQL database . The interface is implemented in Python. For setting up the API please download and install the following software:
| Software | Download |
|---|---|
| Apache 2.4 | http://httpd.apache.org/ |
| Python 2.7 | https://www.python.org/download/releases/2.7/ |
| MySQL Server 5.6 (Community Edition) | https://dev.mysql.com/downloads/mysql/ |
| Mod_wsgi for Apache 2.4 and Python 2.7 | http://www.lfd.uci.edu/~gohlke/pythonlibs/#mod_wsgi |
Software Licenses
Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately
Windows Environment Variables
For completing the Python installation the Windows environment variable PATH need to be set to the following:
C:\Python27\;C:\Python27\Scripts\;"C:\Program Files\MySQL\MySQL Server 5.6\bin";
Installation of Mod_wsgi
- The downloaded file should be renamed to ≫
mod_wsgi.so≪ -
mod_wsgi.soneed to be placed intoC:\Apache2.4\modules - Open the following file
C:\Apache2.4\conf\httpd.conf
and insert the following line
LoadModule wsgi_module modules/mod_wsgi.so
and define the script alias and the app directory as follows<Directory c:/price_monitor><br/> Require all granted </Directory> WSGIScriptAlias / c:/price_monitor/app.wsgi - Save
httpd.conf - Restart Apache server
Installation of Flask
- Install
easy_install: - Download the
ez_setup.pyfile from https://bootstrap.pypa.io/ez_setup.py - Save it to
C:\Python27\Scripts\ - Open command line and go to
C:\Python27\Scripts\an run:
python ez_setup.py - Install
pipby running the following command:easy_install pip - Install
flaskby running the command:pip install –Iv flask==0.10.1
Preparation of MySQL database
- Open command line and type:
mysql –root –p - Enter root password (given during the installation of MySQL)
- Create database and user for Price Monitor:
-
mysql > CREATE DATABASE ecompass; -
mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’; -
mysql > USE ecompass; -
mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost'; -
mysql > exit;
-
- Create database tables for Price Monitor:
- Open command line and go to
c:/price_monitor - Run the following command:
python db_create.py
COMMENT: For migrating the database after a change run:
python db_migrate.py - Open command line and go to
- Restart Apache webserver
Installation of the E-COMPASS Price Monitor API
- Place the price_monitor folder to
c:/ - Open the command line and go to
c:/ - Run the following command to create the database
...
Service Environment and Set-up on VM2
The Scraper and ProductResolver (Product Matching Component) are located on VM2. Those components are based on Python and use a MySQL database for storing the collected product and price data. Thus, the following software is required and needs to be installed:
| Software | Download |
|---|---|
| Python 2.7 | https://www.python.org/download/releases/2.7/ |
| MySQL Server 5.6 (Community Edition) | https://dev.mysql.com/downloads/mysql/ |
| Firefox 41.0.2 | https://ftp.mozilla.org/pub/firefox/releases/ |
Software Licenses
Please indicate if a commercial provider would need to buy commercial licenses of a certain software used for operating the service and – if so – what cost this may produce approximately
Python Libraries
Additionally, there are several python libraries required to run the service on VM2. In order to get the libraries open the command line and run a pip install command for installation:
- SQLAlchemy 0.9.7
- BeautifulSoup 3.2.1
- beautifulsoup4 4.3.2 (both installations of BeautifulSoup are required)
- cssselect 0.9.1
- cssutils 1.0
- chardet 2.3.0
- goslate 1.3.0
- langdetect 1.0.5
- mechanize 0.2.5
- nltk 3.0.4
- py-translate 1.0.3
- python-Levenshtein 0.12.0
- rdflib 4.1.2
- requests 2.4.3
- selenium 2.48.0
- simplejson 3.7.3
- tinycss 0.3
- tld 0.7.2
- utils 0.5
Preparation of MySQL database
- Open command line and type:
mysql –root –p - Enter root password (given during the installation of MySQL)
- Create database and user for Price Monitor:
-
mysql > CREATE DATABASE ecompass; -
mysql > CREATE USER ‘ecompass’@’localhost’ INDENTIFIED BY ‘datamining$2014’; -
mysql > USE ecompass; -
mysql > GRANT ALL PRIVILEGES ON *.* TO 'ecompass'@'localhost'; -
mysql > exit;
-
- Create database tables for Price Monitor:
- Open command line and go to
c:/price_monitor - Run the following command:
python db_create.py
COMMENT: For migrating the database after a change run:
python db_migrate.py - Open command line and go to
- Restart Apache webserver
Service Configuration
For starting the Scheduler in order to run the data collection jobs a Windows scheduled task is required. Go to the Windows Task Scheduler and create a new scheduled tasked which triggers the batch file c:/price_monitor/scheduler.bat every 30 minutes.
Configuration script
availability / location C:\price_monitor\config.json MySQL configuration URI of API on VM1
README / User Manual
availability / location
Configuration steps
Changes in C:\price_monitor\config.json Set-up Windows task which starts C:\price_monitor\scheduler.bat Configuration of REST endpoints at:
Operation
Service start-up procedure
Start MySQL and Apache, create Windows tasks for starting C:\price_monitor\scheduler.bat
Restarting the service
MySQL and Apache need to be started and the scheduler.bat within c:\price_monitor needs to be started by a Windows task.
Service Logs
- Apache logs
- MySQL logs
Recurring Manual Actions / Maintenance Tasks
…
VPN Service
In order to hide the IP address of VM2 to the e-shops where data is collected as some e-shops will break the connection to an external machine which has too many hits on the website the installation of a VPN service hiding and changing the own IP address is recommended. Thus, a service as one of the following should be installed:
Limitations of the service
- With which parameters does the service scale?
How many concurrent E-Shops, how many concurrent products and how many users/E-Shop customers are possible without causing loss in quality/speed for the hardware described above?
- The parameter most relevant for the scaling of this module is the number of scraping jobs, i.e. the number of products monitored multiplied with the number of competitors for which these products are monitored or expressed mathematically: Σ E-Shops ( Σ Products (Number of Competitors ) ). The user testing during the project came near the limits of the CPU capacity for VM2 with 10 competitors monitored while only 25% of the RAM was in use. The storage is not a limit as the scraping data does not need to be kept for more than 2 days and the database limits are not in sight with the current usage.
- If higher scaling was wanted, which of the hardware parameters would need to be increased?
- CPU, RAM
- What else would be adjusted for higher scalability?
- If more CPUs/Cores are available further parallelization of the program code is necessary. The data extraction is already parallelized, however this is not speed-critical and therefore not relevant for scaling.
- Which further configuration would be necessary?
- …
Contact Information Competitors’ Data Collector Service
Andrea Horch, andrea.horch@iao.fraunhofer.de, +49 711 970-2349