seo chapter 1

Search Engines

Crawler based search engines

     Search engines are software programs that provide users with URL of relevant internet web pages relative to the keyword used to perform the search.

     A crawler based search engine consists of:

  1. Spider/Crawler visits a web page, stores a mirror image of all the information gathered from the web page on visit date and time and follows URL to other web pages within the site and other web sites. The mirror copy is called cached page. The spider returns to all web pages previously crawled to maintain up to date information about these pages.
  2. Indexer is a catalog which consists of copies of all web pages crawled by the spider with date time stamp. There is generally a delay between spidering a web page and adding it to the index. The search engine results are derived from the index and hence may not reflect the spidered web pages until the index is updated.
  3. Search software searches through all pages recorded in the index in response to a query and returns URL of related web pages ranked in an order determined by the search engine relevancy algorithm.

     A relevant search engine result may be defined as the set of URLs displayed in response to a user query of which the user clicks one or more URLs. Relevancy of search engine result is relative to the user as two users querying the same search engine with the exact same keyword may be searching for very different information.

Web directories

     Most search engine spiders use web directories as the seed or starting point for their crawl. A web directory is a human compiled listing of URLs to thousands of websites categorized into different groups. Most well known directories are DMOZ (www.dmoz.org) and Yahoo! Directory (dir.yahoo.com).

DMOZ is the largest and most comprehensive human-edited directory on the internet. Listing a website in DMOZ is free. It takes around 3 weeks to 6 months for a listing to be approved. If the submission is improper there is a good chance that the listing will be denied. Yahoo! Directory is a paid directory with an annual recurring cost of US$ 299 for commercial sites and US$ 600 for sites with adult content. Yahoo also provides a free listing feature but there is no guarantee whether the listing will be accepted or rejected. Search engines view directory listings as a vote of confidence in web sites [9]. Being listed in either of these directories is crucial since most popular search engines spider these directories and one can be certain that their website will be spidered by the search engine if these directories link to their site.

Search engine relationship chart

     Figure 4 illustrates the relationship between major search engines and directories. One can conclude that:

  1. DMOZ acts as the seed for Lycos (www.lycos.com), HotBot (www.hotbot.com), AOL Search (search.aol.com), Teoma (www.teoma.com), Google, iWon (www.iwon.com) and Netscape Search (search.netscape.com)
  2. Yahoo! directory acts as the seed for Yahoo! Search and AltaVista (www.altavista.com)
    1. Google Adwords (adwords.google.com) provides paid search results to Google Search, HotBot, AOL Search, Lycos, Ask Jeeves (www.askjeeves.com), Teoma, iWon, Netscape Search
    2. Yahoo! Search marketing (searchmarketing.yahoo.com) provides paid search results to Yahoo! Search, AllTheWeb (www.alltheweb.com), AltaVista and MSN Search

    Organic and inorganic searches

         Key terms that one may encounter in the study of search engine results are:

    1. Organic search results: Non sponsored results returned by a search engine in response to a user query. The ranking of the results is determined by the relevancy algorithm of the search engine.

    Figure 5: Organic and Inorganic search results – Google snapshot

    1. Inorganic search results: Results returned by a search engine where ranking of results are determined by the cost paid by the advertiser to the advertising network.

    Search engine user trends

         Figure 6 and 7 are compiled from data collected by comScore Media metrix  (www.comscore.com/metrix/) gsearch service which monitors web activities of 1.5 million English speaking internet surfers worldwide. Both figures highlight the significance of

    Figure 6: Percent share of searches conducted by U.S. surfers in July 2005 [5]

    Google, Yahoo! Search and MSN Search as search engines. Figure 7 – “percent share of searches trend” clearly indicates that: Google is the most popular search engine and its popularity has increased between Jan 2005 to Jul 2005. The popularity of Google makes it necessary to understand their proprietary search algorithms.

    Page Rank and Trust Rank Algorithms

         Google determines rankings of its search result listings using PageRank and TrustRank algorithms. It is important to understand these algorithms since the higher one’s website ranks in search engine results, higher the potential to gain more targeted visitors.

    PageRank [6]: The rank of a webpage in organic search results of Google is determined by PageRank.

    PR(A)=(1-d) + d[PR(T1)/C(T1) + … + PR(Tn)/ C(Tn)]

    where

    PR(A) is Page Rank of web page A

    T1…Tn are web pages that point to page A

    d is damping factor which can be set between 0 and 1. It is usually set to 0.85

    C(A) are the number of links going out from web page A

    PR(A) is based on the concept that a random surfer who is given a web page A keeps clicking on links at random until he gets bored. The surfer never hits the back button. On getting bored, the random surfer requests a random web page. The probability that a surfer visits a page A is PR(A). The damping factor d is the probability that at each page, the surfer gets bored and requests another random web page. A variation that is added to the PageRank calculation is that different damping factors may be assigned different pages T1…Tn which link to page A.

    One can conclude from the PageRank equation that:

    1. The more inbound links a web page has, the higher the PageRank
    2. It is better to have inbound links from a web page that has high PageRank and few out links over a webpage with high PageRank and too many out links.
      e.g.                   PR(X) = 4 and C(X) = 5         then     d[PR(X)/C(X)] = 0.85d                                                           PR(Y) = 8 and C(Y) = 100     then     d[PR(Y)/C(Y)] = 0.085d

    PageRank forms a probability distribution over web pages, so the sum of all web pages’ PageRanks will be 1. PR(A) can be calculated using an iterative algorithm, and corresponds to principal eigenvector of the normalized link matrix of the web [6].

    PR(A1) + PR(A2) + PR(A3) + … + PR(An) = 1

    PR(A) = (1-d)              if web page A has no inbound links.

    There are hundreds of web pages added to the World Wide Web every moment. Since sum of PageRank of all web pages over the WWW is a constant i.e. 1, this means that as more pages are added to the WWW, PageRank of each web page gets constantly updated to accommodate the PageRank of new web pages’. Assume that, if a web page has no inbound links, (1-d)≈ 0. As inbound links increase the PageRank of a webpage, one can conclude that outbound links decrease the PageRank of a web page. This decrease in PageRank of a webpage due to outbound links is called PageRank Leak.

         To ensure a high PageRank it is necessary that:

    1. A web page should have high number of inbound links
      1. A web page should have low number of outbound links

           The PageRank algorithm determines the importance of a web site by counting the number of inbound links. This concept can be manipulated by artificially inflating the number of inbound links to a web page. PageRank also does not incorporate the quality of the web page in its calculations. Hence Google is developing the TrustRank algorithm and has registered the trademark for TrustRank on March 16, 2005.

      TrustRank [7]: According to Gyongyi, Garcia-Molina and Pederson, the proposed algorithms for TrustRank rely on the PageRank algorithm. This algorithm takes into account, not only the inbound links to a web page but also the quality of the web page. To determine the quality of a web page, a panel of human experts will identify a set of reputable web pages that will act as the seed for the spider. This algorithm is based on an empirical observation that: good pages seldom point to bad ones.

           One can conclude that a web page can achieve higher TrustRank if:

      1. Reputable (good) web pages link to the web page
      2. The web page does not link to any bad web pages
      3. The web page does not mislead the search engine or employ search engine spam

      Overlap analysis

      A study conducted by Dogpile.com in collaboration with the University of Pittsburg and Pennsylvania State University in April 2005 and July 2005 reveals that only 1.1% of 485,460 first page search results were the same across Google, Yahoo!, MSN Search and Ask Jeeves [8]. The study of search engine results for a given keyword over different search engines at the same time is termed as Overlap analysis and forms the basis of Meta search engines like Dogpile.com. Meta search engines send search queries to popular search engines and their results are displayed together on a single page. Since Google, Yahoo! Search and MSN Search are significant in terms of percent share of search queries answered, it is important to optimize the web pages to achieve top rankings in all three search engines.

    2. Figure 8 is a snapshot of overlap analysis performed on Google, Yahoo! Search and MSN Search conducted on August 25, 2005 at 18.50 EST for the keyword “free image library” and URL pattern “imageblowout.com”.Figure 8: Overlap analysis of imageblowout.com – Googlerankings.com snapshot
      1. Yahoo Search displayed “imageblowout.com” on the page# 1 of search results
      2. MSN Search displayed “imageblowout.com” on the page# 2 of search results
      3. Google displayed “imageblowout.com” on page# 55 of search results

      The reason for the substantial difference in the ranking between Google search results and Yahoo! Search and MSN Search results is due to proprietary relevancy algorithms used by these search engines. Yahoo! Search and MSN Search use content based relevancy algorithms. The title tag of imageblowout.com is “Imageblowout – Free Image Library for Commercial Use”. This is an exact match to the keyword used to perform the search query, hence the higher rankings in Yahoo! Search and MSN Search.

Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: