Luca Vassalli's Website - Home page


Home
Index

1.Introduction

2.Ethical SEO

3.Spider's view

4.SEO spam

5.General topics

 

2.3 Links

When you think to the Web you think to a jungle of pages interconnected each others by links. If you create a website with few links to the external world, not only it will be hardly found by a spider but even when it will be eventually crawled for sure the assigned rank will be pretty poor.
You can distinguish two different kinds of links in a page: inbound link, aka backlink, is a link from another page to this one; outbound link is on the contrary a link which points from the page to another one. Backlinks are the important ones because the webmaster has little control on them, there are some way they can be faked but it is definitely more difficult than faking a metatag or keywords; in particular inbound links are fundamental for the PageRank algorithm used by Google.

2.3.1 PageRank

For a deeper treatment of the subject I suggest consulting the Phil Craven's article.
When Google arrived with its link-based PageRank, link popularity took off and became an absolute essential ingredient in achieving top rankings. In particular the idea behind PageRank is that is not only important how many inbound links a page has but where these links come from, in particular a website which is pointed only from other websites with a poor PageRank will have a low PageRank as well, but if it is pointed by even a single link which comes from the NASA website it is likely it will have a great rank. Actually you can consider an external link like a vote to the pointed page quality, this is the reason that an external link to a website guilty of web spam is punished with a drop in the ranking: it may be not your fault if you have a backlink from a bad website but you have to control the websites you link to. Every page has a PageRank which is a numeric value to represent the importance of that page, the higher the PageRank is, and the more important are the votes to the other websites. To calculate roughly the estimated value of a page you can use the following equation:
PR(A) = (1-d) + d(PR(t1)/C(t1) + ... + PR(tn)/C(tn))
This equation was published when the PageRank was being developed the first time. Obviously Google uses a variation of it but anyway this one is good enough to understand the underlying mechanism.
In the equation PR(A) means the PageRank of the page A; t1…tn are the backlinks of the page A, C is the number of links that the correspondent page has and d is a damping factor usually set to 0,85. Like you can see from the equation, the more links a page has the less value this vote give to the pointed page, hence, when you check if your page has a fair number of inbound links, you should also consider the number of other links in the page the backlinks start from.
If you check the Google toolbar you will find a value from 0 to 10 for the PageRank but this is like a label which says which interval the PageRank of the page is in. In fact the billions of pages on the web average out to a PageRank of 1.0 per page, the difference between the page with the highest rank and the one with the lowest one is divided in ten intervals and these are the ten values which are shown in the Google toolbar. How these ten intervals are divided is unknown, but there is confidence among the experts that divisions are based on a logarithmic scale or something very similar, one of the possible proofs of this is that it is much harder to move up a toolbar point at the higher end than it is at the lower end. This means that is much better having a backlink from a page with PageRank of 8 where there are other 20 or 30 links than having an inbound link from a page with PageRank of 4 where there are only other 5 links.
So far the lesson is that every inbound link will increase the ranking of your website.
Unfortunately this is not always true. In case your website has backlinks from a link farm you risk be punishing or banning from Google. We generally refer to link farm when there are a group of web pages which are created only to provide links to a group of websites, so that a single webmaster, who was paid, or a group of webmasters, who agreed to promote each others, put one or more pages of links in their website. Since search engines cannot understand the reasons of the great number of links with no content in the page they usually exclude the websites from the index or decrease their ranking. An example is given by "LinksToYou.com".
Going back to the algorithm behind the PageRank, it is interesting to notice that the calculation of the PageRank must be done iteratively. You can imagine two page A and B, they are connected each others and no other page points to them. The problem is that every time a PageRank is calculated the old value is thrown away, accordingly, if you want to calculate the value of A, you need to consider its backlinks, but its unique backlink is from B and the new B's PageRank has not be calculated yet so, the new value of the A's PageRank will be inaccurate, and when the B's PageRank will be calculated, for the same reason, also its value will be inaccurate. Like Phil Craven showed in his famous paper "Google's PageRank explained" the solution is possible when you calculate iteratively the value of the PageRank using the above equation and a starting value of 1 for a page which has no PageRank yet. It was showed that 40 or 50 iterations are enough to reach a point where any further iteration does not change the solution in a meaningful way. This is the reason of the Google's Dance: it takes a lot of time and work to calculate the final value of the PageRank, during the first iterations the value is not stable and subsequently, even for the same search terms, the result list may change after few minutes.
Another thing to be remembered is that the results you get are proportion among the different values of the PageRank of the different pages, the real value of the PageRank is found using a scale, known only to Google. However the results are interesting to consider the effects of inbound and outbound links in the overall ranking of a page. Every website can have a maximum PageRank equals to its number of pages. That value can be only increased with backlinks from external website. Using the equation we have seen before, you can calculate the effects of the internal links to the PageRank of every page. There are tools on the Web, like the Phil Craven's PageRank Calculator, that can be used to simulate how the PageRank changes adding or eliminating links from the page. In general a link to an external website will decrease the overall average PageRank of your website. In fact the vote of the page is divided among the links; if you have, on the same page, links both to other pages of the same website and to another website, only part of the page vote weight is kept inside the site.
It is obvious that if the number of pages increases also the maximum PageRank will increase, but you should be careful adding new pages with a similar content or you can be penalized. In general a good idea is trying to concentrate the PageRank in a few pages, to have at least an "entry point" to your website which is well ranked, leaving all the other pages with a poor rank. It is also important that any inbound link points to the pages where you want a high value of PageRank and not to any random page because, in the latter case, the benefit will be shared among all the pages of the website.

2.3.2 Outbound links and Robot.txt

You have just seen how much the outbound links are a drain for on a site's total PageRank. In general also for all the other search engines, it is not such a good idea having a page where the outbound links outnumber the inbound links. So you can decide to hide the outbound links to the spiders. You can do this using a JavaScript or using a particular attribute of the anchor tag.
An example of the first case is:
<a href="javascript:goto('http://www.hiddensite.com')"> link text </a>
To be totally sure that the spider will not be able to see the link, the JavaScript code may be loaded from a file which is in a directory that is barred to Google's spider by the Robots.txt file instead of being inside the tag.
An example of the second solution is:
<a href="http://www.hiddensite.com" rel="nofollow">link text</a>
Since January 2005 this attribute is recognized by Google and other search engines and it tells to the spider to not consider the link at all for the ranking of the destination page, or the page itself.
Another reason you may want Google not consider a link is that it is connected to a "bad neighbour". In fact if there is another website who you want to link to, but which has a bad reputation, for example is banned from the Google's index, you have to use a link inside a JavaScript; the reason is that search engines do not penalize your website if it has inbound links from suspicious pages, maybe over optimized ones, because you do not have much control on the other webmasters' actions but it is up to you avoiding outbound links to bad sites. The concept is similar to real life if you choose bad guys for friends, you are considered to be one of them. For this reason all the outbound links which can be view by a spider should be carefully analyzed also by the webmaster to check if they link to a banned website. Verifying this is trivial, you just need to write "site: nameOfTheSite" on a search and if the website does not appear in the list of the results it means that it is banned. In case your website has too many links to manually verify all of them you can use automatic tools like the one at http://www.bad-neighborhood.com/.
Now let us consider the case where you want that the spider will not read a whole page, or directory, and not just it will not index the page using a link from another page, what could you do? You can use the "Robots.txt" file. This is a file where is explained which pages are supposed to be analyzed and which ones are not. It is not a mean to hide sensitive data since it does not protect the website like a firewall or a password do; in fact the crawler may decide to ignore the file but, since it usually will not, this file is useful every time you have a page in two different version, for instance one for printing and one for visualization, to avoid to be banned for duplicate content by the search engine. However you should understand the difference between "index" and "read": using the Robots file you can prevent a spider from reading, or analyzing, a page so that it will not, for instance, calculate the keywords of the page but you cannot prevent to index the page, if it has a backlink from a page which is not protected by a Robots file. In fact, in the latter case, the address of the page will be present in the database of the search engine among the pages connected to the one which was analysed.
To use a Robots.txt you have to follow some rules. First of all it has to be stored in the main directory, because the crawler will look for it only there. If nothing is found at that address, which can be for instance "http://mydomain.com/robots.txt", the crawler will assume that the Robot.txt is not used and it will read all the pages of the website. Then you have to follow a protocol: it is a text format where is written for every kind of crawler which pages are to be ignored. The structure is the following:
User-agent:
Disallow:
Where the "User-agent" is the name of the crawler (i.e. Googlebot for Google, the character * means everyone), and "Disallow" is the directory or the page that should not be read.
Since the smallest grammatical mistake is enough to prevent the spider to understand your indications, there are several tools to validate the file or even to automatically generate it using a graphical approach where you have just to point and select which files and folders are to be excluded.

2.3.3 Inbound links

Like we have already seen, the inbound links are fundamental for the ranking position of your website: the more backlinks a page has, the higher ranking it will receive. However, the quality of the links matters even more than the quality. A link of high quality is always more important than a bunch of links from a link farm. The quality of a link is determined by different factors: the ranking of the page it comes from, the relevance of the content of the two connected websites and how the anchor text of the backlink incorporates keywords relating to your site. In particular about the last point you may need to contact another webmaster to suggest the best anchor text that he should use to link your website, in particular it is a bad habit using "click here" instead of a meaningful sentence. It is enlightening the "Google bombing" practice. A group of people linked many pages to particular target page and all used the same link text phrase. The target page was nothing to do with the link text phrase. It didn't contain it and it wasn't about it. Nevertheless, the page was at the top position in Google's result list for the link text phrase.
You can retrieve good inbound links in some different ways.
First of all, you can use a forum on a website with a good ranking and with a content which is relevant to your website. You need only to write a post in a topic and to place a link to your site in the signature line. However before spending time posting in lots of website, you must check the Robots.txt to see if the forum is readable by the spider and if the links in the posts are not hidden, for instance displayed inside a JavaScript.
Then you may submit an email to a directory like Yahoo and DMOZ. Unfortunately these directories are managed by hand instead of automatic software like the spiders, thus you will have to wait a while before your pages will be included. Furthermore DMOZ is an open project run by volunteers where every website and page that is added has to be manually reviewed. The editors are divided in categories and each editor can only edit in his, or her, own categories, which means that for minor categories the queue can be much shorter then for the major ones. Since the Web grows faster than the editors can review, it seems there are some categories where there are literally hundreds of thousands websites waiting for a review; if your website falls under one of those categories, it can take your website years before being included! For this reason you have to carefully submit to the right category, if you did not it would be forwarded to the right category only after an editor would start to review it, hence it would have to queue twice.
Another way is links exchange. You can join a link exchange centre like LinkPartners.com. There you can find other websites, with a relevant content to your site one, which want to exchange links. In this kind of services it is usually good only sign up with centres where you can approach other sites personally, and where they can approach you personally. However you have to always avoid link farms and avoid links exchange with websites which are not relevant to the content of your website or you risk being penalized. You have also to pay attention in case you have more websites hosted on the same server and not content related. In fact some search engines, like Google, verify if the IP address of the connected websites, therefore try to keep the number of links between not highly content related websites on the same host to a bare minimum.
You can always buy inbound links from high ranked websites. You may find a good site with a content related to the ones of your website and buy the links directly from it or you can use a third part, usually called broker. Obviously you can choose the topic of the website where the links will appear and you can also sell links through a broker.
However, even if links are pretty important, they are not the only way to optimize a website. Keywords are another important thing to consider, in particular for Yahoo.