How It Works

Let's take a closer look at Orbit Web Spider and examine each part separately. You'll see that any search engine owner can use our software to deliver their own distinctive search results, bringing users back to their search engine over and over again.

Everyone can add
his site to Spider database

The Orbit Web Spider Consists
of Four Parts:

A Crawler that finds and fetches web pages.
A Parser that analyzes each page.
An Indexer which sorts through every word on every page and stores the resulting index of words in a database.

The Crawler: Orbit Web Spider's
Highly Sophisticated Web Crawling Robot.

The crawler is the automated component of the Orbit Web Spider that combs the internet for web pages. The crawler finds and retrieves pages and hands them off to the parser. Think of the crawler as a little squirrel scurrying through the forrest of cyberspace collecting links to the pages it visits. When it finds a web server, the crawler works in ways which are very similar to your average, everyday web browser. Like a browser, the crawler sends a request to a web server for a web page, it then downloads the entire page, and hands it off to a parser.

However, there are ways in which a crawler works differently and more efficiently than your basic browser. A web browser is the translation link between humans and the machine language of the Internet. A web browser converts bits and bytes to the words and pictures we understand. That conversion can be a time and resource intensive task (well, in the relative terms of time in the context of Internet page loading benchmarks). To avoid overwhelming web servers or crowding out requests from human users, the crawler does not take that extra step of converting machine language into human-friendly information. The crawler can manipulate and store the machine language it gathers from web servers in its original state. Omitting that extra step allows the crawler to work more efficiently than a browser when gathering data from the web. Furthermore, the crawler can make requests for thousands of different pages simultaneously.

Orbit Web Spider's crawler collects pages in two ways: through our “Add Url” form and by finding links while crawling the web. The system administrator can choose which methods work for their individual needs. The crawler continuously re-crawls all links stored in the database. To conserve system resources, the administrator can set longer time intervals in between iterations of re-crawling for any or all pages. However to keep information up-to-date, some sites should be re-crawled more often than others.

News and current events oriented pages should be downloaded daily. Pages with stock quotes must be downloaded much more often. To keep your index current, the system allows administrators to make lists of pages which will be re-crawled more frequently then others. The combination of these two types of crawling processes permits the search engine to use its resources wisely and maintain an updated index.

Several Parameters That Govern the Crawler:

  • URL-Filters - rules that determine which pages will be processed and which will be omitted
  • Maximum Outlinks Per Page - specifies the maximum number of links to be obtained from one page
  • Ignore Robots - indicates if robots should be ignored or processed
  • Default Fetch Interval - establishes time intervals for page re-crawling

Orbit Web Spider's HTML Parser

The Crawler returns the full text of the pages it finds to the parser. First, the parser checks the type of content on the page. If it is not text or html it stops processing that page. If the condition for text or html is met the parser continues to parse that page's content. When parsing pages, the parser takes into account tags such as meta-robots and base.

The parser attempts to find hyper-links among the following tags:

<a> <form> <area> <frame>
<iframe> <script> <link> <img>

Orbit Web Spiders's Indexer

The purpose of the indexer is to create an index of web pages and keep it available for search query processing. The index is the component Orbit Web Spider uses to store data used in searches. This component houses keywords, urls to pages and metrics on relevance (range of correspondence) between each keyword and page. To improve search performance, Orbit Web Spider ignores (doesn't index) common words called stop words (ie. is, on, or, of, how, and why; as well as certain single digits and single letters). Stop words are so common they do little to narrow a search; therein they can be safely discarded from the indexing process. The range of correspondence is established after an analysis is made of the content of all pages. Let's call this range 'the score'.

The Score of each individual page is calculated using the following rules:

Score(A) =

Where: nS(1) .. nS(n) - the nextScore of pages pointing to current page; C - the quantity of outlinks on the page pointing to current; d - damping factor which is 0.85; The value of the nextScore is calculated using the formula:

nextScore(A) =

Where: nS(1) .. nS(n) - nextScores of the pages pointing to page A; Cl - the number of links on page 1 to the pages which have their own outlinks; d - damping factor which is 0.85; To get more precise values of score and nextScore these are calculated in the several iterations.

Orbit Web Spider's Query Processor

The query processor has several parts: the user interface (this part is delivered in a single mode), the “engine” that evaluates queries and matches them to relevant documents and the results formatter.

The Orbit Web Spider considers several factors in determining which documents are most relevant to a query. Factors considered are the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page.

Popularity is evaluated using a mechanism that resembles google's page ranking system. Indexing the full text of the web page allows Orbit Web Spider to go beyond simply matching single search terms. Orbit Web Spider can also match multi-word phrases and sentences and supports boolean search operators (+, - and quotes to search exact instances of a search term or phrase).

Orbit Web Spider search query processing occurs in the order:

A user submits a search query to the web server. The web server passes it to Orbit Web Spider. Orbit Web Spider's query processor deletes all stop words, checks if the user employed special boolean operators and reformulates the query so that it works to search the index. Then Orbit Web Spider retrieves stored documents and generates snippets that describe each search result. The last step is to return a list of search results to the user. Search results can be returned in two formats: html (if Orbit Web Spider is used as a standalone product) and xml which is very useful for distributed system based web-services.

An Example of Orbit Web Spider in Work Mode

Example:

We have 3 sites: A, B, and C. Orbit Web Spider's database only contains a link for site A.

Site A: Orbit Web Spider only processes pages specified by the administrator. Site B: Orbit Web Spider only processes the internal pages of specified domains. It can't walk out of the site. It ignores all links to external sites and takes links to internal sites in order as shown in the picture above. Site C: Orbit Web Spider processes all pages associated with the specified sites, it may crawl beyond the borders of these sites. Here is the algorithm of how it will process the links:

  • In the 1st step Orbit Web Spider processes the 1st page of site A (A1) and adds the links to A2, B2 into the database
  • In the 2nd step Orbit Web Spider processes A2 and B2 and adds A3, B3 into the database
  • In the 3rd step Orbit Web Spider processes A3, B3 and adds A4, B4, C1
  • In the 4th step Orbit Web Spider processes A4, B4, C1 and adds C2
  • And so on..



Intranet Crawling Mode

Intranet Crawling Mode can be preferable if you want to process a large quantity of pages from a small number of sites. Orbit Web Spider enables you to specify the depth to which the spider is allowed to reach. The Spider assumes these pages will not be renewed.

In any of the available modes, the Orbit Web Spider sets a future date to fetch data for each page processed. Orbit Web Spider checks everytime if a page should be re-indexed. If necessary, it adds the url of the page to its list of pages to be re-crawled.

Herein, we describe only one work mode for the Orbit Web Spider. If you would like additional information, please contact us for details.