Competitors Data Collector
The Competitors’ Data Collector includes the four main classes
- Scheduler,
- Controller,
- Scraper and
- ProductResolver.
It provides the ECC function ≫Price Monitor≪ for collecting and comparing product prices of the own e-shop with those of the competitors. The main service and its API are distributed over two separate virtual machines (VMs) as shown in Figure 2. The reason for this distribution is the need to run a VPN service which changes the IP address of the VM running the scraping service (VM2) in a regular cycle as some e-shop websites would not accept too many requests coming from one single IP address. But the VM running the API (VM1) needs to be available under a certain address as other service modules need to send and receive data from by querying the API.
The components on the VMs are communicating over the REST API of the service module, thus the scraper fetches and writes data from and into the database running on VM1 through a request to the service API. The database storing the clean data includes the data provided by the user, e.g. URL of the own e-shop and the e-shops of the competitors or the definition of the data collection tasks (price data which shall be collected for which product and which competitors in which time period). Additionally, the database includes the collected product and price data for the defined products and the specified competitors which is displayed in the Price Monitor of the ECC. The database running on VM2 stores the data gathered from the e-shop website which needs to be assigned to the product requests of the user in order to be able to find the right data sets of the data collection jobs defined by the users within the collected data of the whole e-shop websites. Additionally, the database stores information about running and finished data collection jobs which is required by the scheduler.
The components of the service module have the following functions:
- Scheduler The Scheduler manages the execution of the data collection jobs defined by the users. The Scheduler is responsible to ensure that every job runs in time and there are not several jobs collecting data from the same website are running together in order to prevent denial of service attacks by the system. The Scheduler is executed by a Scheduled Windows Task (Windows Cronjob) which is running every 30 minutes.
- Controller The Controller fetches all information from the database of VM1 which is required to run a specific data collection job. Having collected the necessary information it controls the Scraper and afterwards the ProductResolver provides them input and takes and processes the output. Finally, it writes the result data into the database on VM1.
- Scraper
The Scraper walks through all pages of an e-shop website, checks the pages for occurring product records, extracts the product records from the webpages, identifies and extracts defined product attributes within and from the extracted product records and writes the information to the database on VM2.
- Crawler: The Crawler collects all links of an e-shop website until the third level of the website and stores the information to the database on VM2. The collected links are updated every three weeks.
- LightExtractor: The LightExtractor analyses given a webpage for the occurrence of product lists. In the case that a webpage includes a product list it identifies and extracts the products records within the product lists.
- AttributeExtractor: The AttributeExtractor analyses the product records extracted by the LightExtractor and identifies and extracts pre-defined product attributes as current price, regular price, currency, product name, link to detail page and (link to) product image.
- ProductResolver
Currently, the ProductResolver include only one single component called ProductMatcher. In future versions of the service module it will include an additional component which identifies and extracts further product attributes as product colour, product manufacturer or product units through using semantic data as ontologies or Linked Open Data (LOD) stores.
- ProductMatcher: The ProductMatcher takes the results of one run of a collection job stored in the database on VM2 and filters out the product price information of the products of the e-shops defined in the collection job and assigns them to the job data. The resulting data is returned by the ProductMatcher and stored to the database on VM1 by the Controller.
Physical Hardware Characteristics
| Model: | |
| Processor: | |
| RAM: | |
| Hard Drive Space: | |
| Network Connection: | |
| Hypervisor used: | VMware |
| Physical Load Balancing: | none |
Virtual Machine Hardware Specifications and Operating System
As mentioned in the previous sections the API of the service is separated from the data collection modules, thus the both parts are running on two separate VMs. The VMs have the following specification:
| Guest Operating System: | Windows 8 Enterprise, 64 Bit |
| Processor: | 2 Processors, 2.27GHz |
| RAM: | 32 GB |
| Hard Drive Space VM: | 50 GB (VM1) + 500 GB (VM2) |
| Network Connection: | 1Gbit |
| Minimum required Network Connection: |
VM1 needs to be available from external networks (internet). VM2 should not be available from outside.