Blogger Widgets

Sunday, July 22, 2012

How A Search Engine Works

           Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:
They search the Internet -- or select pieces of the Internet -- based on important words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.


In order for search engines to tell you where to find information, it must find the information first. To find the millions and millions of global web pages that exist, the search engines use programs called 'spiders', or robots. These spiders crawl the web and build indexes of URLs, keywords found on web pages and other pertinent information.
The search engines are basically made of three parts:
  • The spider
  • The index
  • The software
The spider visits a web page, reads it, and then follows links to other pages within the website. This is what 'crawling' is generally referred to. The spiders return to the site to look for changes.
Once the spider has crawled your site it adds all the information it found to the index. Sometimes referred to as a catalogue, it's like a giant book containing a copy of every single page a search engine spider finds. If a web page changes, then the index gets updated with the new content. Sometimes it can take some time for new pages or content to be included in the index. This means you may see a visit by a search engine spider in your log files, but is not yet 
 indexed. Until your web pages are indexed, it is not available to people searching with search engines.
So your pages are now indexed - how do the search engines decide who comes first?

While each search engines has its own unique way of ranking web pages (algorithms) there are common themes that they all share.

Web Crawling:
When most people talk about Internet search engines, they really mean World Wide Web search engines. Before the Web became the most visible part of the Internet, there were already search engines in place to help people find information on the Net. Programs with names like "gopher" and "Archie" kept indexes of files stored on servers connected to the Internet, and dramatically reduced the amount of time required to find programs and documents.
To find information on the hundreds of millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. In order to build and maintain a useful list of words, a search engine's spiders have to look at a lot of pages.

(Click on Read more that is below to see entire post)

The Spider:
How does any spider start its travels over the Web? The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spearing system quickly begins to travel, spreading out across the most widely used portions of the Web.

Google began as an academic search engine. In the paper that describes how the system was built, Sergey Brin and Lawrence Page give an example of how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.
                   (fig. Collecting the keywords, url & other information from the other sites by spider)
When the Google spider looked at an HTML page, it took note of two things:
  • The words within the page
  • Where the words were found
Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.

Building the Index:
Once the spiders have completed the task of finding information on Web pages the search engine must store the information in a way that makes it useful. There are two key components involved in making the gathered data accessible to users:
  • The information stored with the data.
  • The method by which the information is indexed.
A search engine could just store the word and the URL where it was found. the word was used once or many times or whether the page contained links to other pages containing the word. In other words, there would be no way of building the ranking list that tries to present the most useful pages at the top of the list of search results.
An engine might store the number of times that the word appears on a page. The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page.
Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space. For example, the original Google paper describes using 2bytes, of 8 bits each, to store information on weighting.
An index has a single purpose: It allows information to be found as quickly as possible. There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table.
There are a few factors for which we can be sure they are gonna be included in every search engines ranking algorithm.
  • Content
  • Title of the article/content
  • Proper grammar & spelling
  • A working website