Orbit Spider overview
Orbitscripts Spider consists of four parts:
- Crawler that finds and fetches web pages.
- Parser which makes analysis of each page.
- Indexer which sorts every word on every page and stores the resulting index of words in a database.
- Query processor.
Let's take a closer look at Orbit Spider and each part separately. All of the search engine owners will need this software in order to have their own distinctive search results, which will bring searchers back to your search engine over and over again. Orbitscripts Spider is supported and customizable by OrbitScripts to provide the best solution.
Crawler - Orbit Spider's crawling robot.
Crawler - Orbit Spider's crawling robot, which finds and retrieves pages from the web and hands them off to the parser. It's easy to imagine crawler as a little worm scurrying across the strands of cyberspace and collecting links to a visited pages. After it finds some web server, it works much like the usual web browser by sending a request to a web server for a web page, downloading the entire page, then handing it off to parser.
Crawler may consist of many computers requesting and fetching pages much more quickly than one browser. In fact, the crawler can request thousands of different pages simultaneously. To avoid overwhelming web servers, or crowding out requests from human users, Crawler deliberately makes requests of each individual web server more slowly than it can be done.
Crawler finds the pages in two ways: through an "Add Url" form and through finding links by crawling the web. System administrator can choose which of the ways to use. Crawler continuously recrawls all links stored in database. To avoid the system overloading administrator may set the bigger recrawling interval for all pages. But some of the sites need to be recrawled more often then others.
Newspaper pages must be downloaded daily, pages with stock quotes must be downloaded much more frequently. To keep index current system allows to make the lists of pages which will be recrawled much more offten then others. This combination of two types of crawling processes allows search engine to make efficient use of its resources and keep its index reasonably current.
Here listed several parameters that have an effect on crawler:
- URL-filter. Contains rules which declare what pages should be processed and what pages should be omitted
- Max outlinks per page. This parameter stores the maximum number of links to be obtained from one page
- Ignore Robots. Indicates if robots should be ignored or processed
- Default Fetch Interval. Page re-crawling time interval
Orbit Spider's HTML parser
Crawler gives the parser the full text of the pages it has found. First of all parser checks contentType of the page and if it is not text/html it stops processing of current page. Then it makes parsing of the pagesТ content. During the parsing it takes into account such tags as meta-robots, base. It tries to find hyper-links among the following tags: <a>, <form>, <area>, <frame>, <iframe>, <script>, <link>, <img>.
Orbit Spiders's Indexer.
The major aim of indexer is to create the index and keep it applicable for search. Index is a structure where Orbit Spider stores information which is used during the search. This structure contains keywords, urls to pages and the ranges of correspondence between each keyword and pages. To improve search performance, Orbit Spider ignores (doesn't index) common words called stop words (such the, is, on, or, of, how, why as well as certain single digits and single letters). Stop words are so common that they do little to narrow a search, and therefore they can safely be discarded. The range of correspondence is established after analysis of content of all pages. LetТs call this range the score . Score of each separated page is calculated using following rules:
Score(A) = (1 Ц d ) + d * (nS(1) / C(1) + .. + nS(n) / C(n)),
-
Where:
- nS(1) Е nS(n) Ц the nextScore of pages pointing to current page;
- C Ц the quantity of the outlinks on the page pointing to current;
- d Ц damping factor which is 0.85;
The value of the nextScore is calculated by the formula:
nextScore(A) = (1 Ц d ) + d * (nS(1) / Cl(1) + .. + nS(n) / Cl (n)),
-
Where:
- nS(1) Е nS(n) Ц nextScores of the pages pointing to the page A ;
- Cl Ц the number of links on the page 1 to the pages which have their own outlinks;
- d Ц damping factor which is 0.85;
To get more precision values of score and nextScore they are calculated during the several iterations.
Orbit Spider's Query Processor
The query processor has several parts, including the user interface (this part is delivered for single mode), the "engine" that evaluates queries and matches them to relevant documents and the results formatter.
Orbit Spider considers several factors in determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page.
Popularity is evaluated using mechanism resembling the google's page rank. Indexing the full text of the web allows Orbit Spider to go beyond simply matching single search terms. Orbit Spider can also match multi-word phrases and sentences, also it supports boolean search operators (+, - and quotes to search exact coincidence).
Orbit Spider processes the query in the following order:
- The surfer submits search query to web server. Web server passes it to Orbit Spider.
- Orbit Spider's query processor deletes all stop words, check if the visitor used special boolean operators and forms the query which must applicable for searching in the index.
- Then it actually retrieves the stored documents and generates snippets to describe each search result.
- The last step is a representation of the results. Search results can be returned in two formats: html (if Orbit Spider is used as a single product) and xml which is very useful for distributed systems based on web-services.
Orbit Spider may work in 4 modes:
Example:
We have 3 sites: A, B, C. Spider has link only to A.
- It processes only pages specified by admin.
- Spider processes only internal pages of specified domains. It canТt walk out of the site. It ignores all links to external sites and takes links to internal sites in order as it is shown at the picture above.
-
Spider processes all pages which it finds at specified sites, it may go out of the borders of these sites.
Here the algorithm of how it will
process the links:
- At 1st step it processes 1st page of the site A (A1) and adds into database the links to A2, B2
- At 2nd step it processes A2 and B2 and adds A3, B3 into database
- At 3rd it processes A3, B3 and adds A4, B4, C1
- At 4th it processes A4, B4, C1 and adds C2
- And so onЕ
- Intranet Crawling.This mode is preferred if you want to process a huge of pages from a little number of sites. You may specify the depth where spider is allowed to reach. It is supposed that pages will not be renewed
In any mode spiders set next fetching date for each processed page. Every time spider checks if certain page should be reindexed and if it is necessary it adds url of this page to segment where it stores the list of pages which should be crawled.
