Google runs over a distributed network of thousands of low cost computers. The parallel execution is the key of the fast processing of the queries. The Google's spider is called Googlebot and it is made up several computers requesting continuously pages which are downloaded and then passed to the Google's indexer. There are two different kind of crawling techniques: in the deep crawling, Googlebot fetches a page and then it creates a list of all the links connected to that page, it is a good way to retrieve all the pages of the same website; on the contrary, during fresh crawling, popular frequently changing pages, for example those from websites of newspapers, are crawled at a rate roughly proportional to how often the pages change.
The indexer store all the text that was crawled in the database; then the index is sorted alphabetically by search terms, with each index entry storing a list of documents in which the particular term appears and the location within the text where it occurs. Like in all the major search engines, the common words, called stop words, are ignored to save disk space and speed up search results. Since they would be present in the majority of the pages, they are not useful to narrow the list of results. Some stop words are for instance: "how", "when", "where", "the", "a" and any characters.
In the Google ranking system there are hundreds of factors involved but the main one is based on the so called PageRank algorithm which considers inbound and outbound links, I will analyse in deep this algorithm further on. The ranking is one of the functions of the Google's query processor; another is to apply the spelling correcting system. This is one of the ways Google uses machine learning techniques to automatically learn relationships and associations within the stored data. In the spelling correcting system, the search engine checks if you have typed an orthographically wrong word or if there is a similar sentence with a much longer number of results.
Approximately once a month the Google's index is updated: the PageRank of all the crawled pages is calculated and since this requires on average 40 iterations of the algorithm, the calculations take several days to complete. During these days the ranking positions of the pages fluctuate; sometimes minute-by minute, that is the why it is nicknamed Google's Dance.
There is another circumstance during which Google dances: the Updates. It is called Update every time the Google's ranking algorithm is changed to improve the service to the users. In particular every change has the aim to spot better and better SEO spam techniques. Those are particularly dangerous for a search engine when are used in unethical and unscrupulous ways to mislead the search engine and let it thinks a website has a different relevant content.
The first Update was the world famous "Florida Update" (November 2003). It was followed by the Austin, Brandy, Bourbon, and Jagger updates. But the huge one was the first one. It took eight weeks to re-establish stable listings; a quick search over the Internet is enough to see how many upset webmasters were involved in this change. The update consisted in the introduction of semantic contextualization software and a variation of the Hilltop Expert Document Algorithm.
The semantic components were quite mild, it is general belief that Google is moving to a semantic-type system but also nowadays you still are not able to write "glass, nails, and wood" hoping to find windows.
The Hilltop was more important; it cut the importance of the reciprocal links. Since it was understood the importance of the inbound links for the PageRank, lots of website started to exchange links. After the Update the weight of those links was dramatically reduced.
Perhaps the greatest effect of the Florida Update was the sudden rise in the popularity of blogs. The reason is that all the owners of a blog have a link on the website to their blog entry, and these are not reciprocal links. Every link increases of a bit the PageRank, when the blogs are thousands, the ranking of the website will go sky high. By the way, Google owns the main blog software developer, Blogger.
The last change in the Google search engines is called Bigdaddy (February 2006) but it was not a real update, like the Google engineer Matt Cutts said: "in this case the changes are relatively subtle (less ranking changes and more infrastructure changes). Most of the changes are under the hood, and this infrastructure prepares the framework for future improvements throughout the year". Anyway in other threads of his forum Cutts said about a site: "Some one sent in a health care directory domain. It seems like a fine site, and it's not linking to anything junky. But it only has six links to the entire domain. With that few links, I can believe that out toward the edge of the crawl, we would index fewer pages." It means that links into and out of a site are being used to determine how many pages of the site should be included in the index. Since the Web is exponentially growing day after day, not all the pages can be indexed and kept in a distributed database. In order to keep in the index more and more websites, the idea is to index much more pages for the major websites and just few pages for the minor ones. Those will be at least represented by their home page and some other major pages. This is a solution to a problem which may cause to have only big websites indexed and small or young websites not indexed at all.
In this change a new attempt to fight the reciprocal links was made, in particular links which connect not related website are now considered junk, Cutts makes the example with free ring tones website and Omega 3 fish oil sites.
Even if it was from February that Google was working on BigDaddy, during the summer 2006 the new system did not seem stable yet, there were still lots of complains about duplication of results, non existing links and different results only after few seconds with the same search term.
Now let us analyse the most important aspect of the optimization from a Google standpoint.