Why should search engines influence indexing?
There’s a variety of reasons to control indexing and thus to dictate how a search engine should deal with websites and links:
- Allow or disallow following links
- Prevent indexing of irrelevant websites
- Index duplicate content under only one URL
The goal, of course, is to deliver only relevant HTML pages to the engine. But this doesn’t always happen properly. Duplicate content quickly occurs due to technical problems or the ubiquitous ‘human factor‘, which is is all to common. But there are ways to keep an index clean and counteract this.
Which methods work?
I will be covering 3 methods for influencing the indexing for your site. Which ones these are and how they can be used.
The /robots.txt is like a ‘bouncer‘ for search engine crawlers. It explicitly allows which crawlers may search which pages/sections on a domain. Most of the crawlers follow the /robots.txt file, but it is more a suggestion, not an order.
The /robots.txt file primarily uses two instructions:
User-agent: – determines which crawler should apply for the following instructions
Allow/disallow: – determines the file or the index
An empty line closes the data set.
The /robot.txt file generally looks like this:
# robots.txt for http://www.example.de/ User-agent: ROBOTNAME Disallow: /pictures/ User-agent: * Disallow: /confidentialData/ Disallow: /allPasswords.html
If you want to respond to the crawler, use the following expression: User-agent: *
Be careful, with disallow/ you block: / all robots from the entire domain. If there is no organic traffic to a site, this could be the reason.
As long work is being done in a test environment and the data should not yet be found, it is useful to not index complete indexes.
Crawlers from dubious providers usually are not influenced by /robots.txt. But established search engines do observe the instructions.
But why should I disallow crawlers access to parts of my domain?
It’s very simple. Not all webserver contents should appear in a search engine index. The instructions request the crawler to not execute indexing for certain pathways. This could be the case, for instance, when there are test pages on the webserver that are not yet ready for the public. Or, it could be that not all pictures in a folder should be indexed.
/robots.txt is particularly suited to prevent the indexing of non-relevant HTML pages. However, page URLs can still end up in the index. For example, if pages are to be externally linked. If this is the case, no snippet is displayed in the SERPs. If individual URLs should be excluded from the index, the following methods are suitable.
Two elements of meta tags are useful for controlling crawlers and indexing HTML pages. How the crawler should proceed with the indexing and the links to the HTML pages can be set for every HTML page.
The meta instructions <meta name=”robots” content=”index,follow” /> responds to the crawler for every HTML page individually and gives it the following possible instructions:
|content=”index,follow”||index HTML page, links follow|
|content=”noindex,follow”||do not index HTML page, links follow|
|content=”index,nofollow”||index HTML page, links donot follow|
|content=”noindex,nofollow”||do not index HTML page, links do not follow|
This tells the crawler whether it may take the HTML page into the index and whether it can follow the links in the HTML page. Links from “nofollow” HTML pages do not pass PageRank or other forms of link equity. The “nofollow” attribute can be specifically used to devalue links on an HTML page.
When dealing with documents that have no HEAD area, the X-Robots tag can help. This tag allows non-HTML documents, such as pictures or PDF files, to be limitedly indexed.
Meta tags are best used to prohibit following links or indexing individual HTML pages.
The canonical tag is primarily an aid to prevent duplicate content in the index. Canonicals tell search engines that instead of the page it found, the original (more relevant) page should be in the index.
The canonical tag belongs in the head area of an HTML page and is used as follows:
<link rel=”canonical” href=”http://www.example.de/rightpage.html”>
Duplicate content occurs as follows, for instance:
- URLs can be found with and without www.
- Session IDs are used in the URLs
- similar content to the HTML page
- the same product can be offered in several categories
It is useful to give every HTML page a canonical tag with its own URL so that they refer to themselves. In this way, potential URL tracking parameters don’t cause duplicate content.
All 3 variants help control the crawler. Primarily this is to avoid creating duplicate content and to only index HTML pages that should appear in the index.
The /robots.txt define the basic framework for the crawler. Meta tags refine these and can be used to give precise instructions for individual HTML pages. Canonicals prevent duplicate contents from getting into the index by URL manipulation.
There can be a down-side of course. It is important to not generate contradictory instructions when using them. For example, no reference to HTML pages should be set which are excluded per /robots.txt. The site map contents should also be checked for contradictions. The canonical tag should always be the last-ditch option to avoid duplicate content. It is much better to create clean projects from the very beginning. It is more of a band-aid for bad architecture and content management systems.
Controlling indexing is essential for good search engine results. The three methods are good aids in avoiding errors. The main effort is to give the search engine crawlers instructions via the /robots.txt and the meta tags. Canonicals help protect you from duplicate content in the index.