Saturday, November 15, 2008

Search Engine Spiders

The Internet is big. In fact the Internet is very big. Recently Google announced that their systems that process links on the web to find new content had hit a milestone. They had recorded 1 trillion (as in one thousand billion) unique URLs on the web at once. That's quite some going considering that the first Google index in 1998 had only 26 million pages and by 2000 the Google index had reached only one billion.

Clearly that's a lot of ground to cover. The way that search engines collect that information is by using search engine spiders.

All search engines, including Google, use spidering as a means of providing up-to-date data. Spiders, otherwise known as bots or web crawlers, are software programs that request pages much like regular browsers do. These pages are indexed to provide fast searches.

The usual starting points are lists of heavily used servers and very popular pages called the seeds. The spider will begin with a popular site, collecting information and then indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

There are different types of spiders and different types of spider behaviour. For example, submission checkup spiders will do a simple check to make sure a submitted URL is valid, if the server is available/accessible, if a redirect command is affected, etc. If the URL passes this test, it will typically be stored in a task queue for later crawling.

When it's time for the crawl, the spider will first check the page's head section - title tags, keyword phrases, description tags, meta tags and robot instructions. It will continue to read the page content - headings, alt tags, link titles, keywords and phrases. It then locates the links on the page and follows them repeating the process on destination pages. Alternatively, as the crawler visits these URLs it identifies all the hyperlinks in the subsequent pages and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. Continually crawling the Internet, each spider can keep hundreds of web connections open at one time, crawling thousands of pages a second and generating huge amounts of data that's encoded to save space. It's an endless task, with websites being continually edited, added, deleted and dynamic web pages altered.

The search engines use this information to identify the most relevant web sites, prioritising them based on factors such as link popularity and the quality of its content. Without the crawling, indexing and coding carried out by the spiders the search engines would take a lot longer to retrieve their results.

As well as being employed to collect information to assist in relevant searching, crawlers are also used for automating maintenance tasks, such as checking links or validating HTML code. Crawlers are also used to gather specific types of information from Web pages, for example harvesting e-mail addresses (usually for spam).

Clearly spiders have an essential role in an organization's web presence. Understanding how they work, what they like and don't like plays an important part in applying any search engine optimisation strategy. At SEO Consult we have enormous experience in spider behaviour and their requirements.

It's important to make life as easy as possible for the spiders. Here are some aspects of spider related SEO that we at SEO Consult consider when applying a campaign. Spiders love fresh content. The more frequently a site is updated with new content, the more frequently spiders explore your site.

Each addition or change of content sets off a chain reaction that results in a visit from Google Bot or Slurp for Yahoo! or others. The spiders compare recent changes with the last cached snapshot of your pages, noting revisions and integrating them.

A neglected page or a page with old content on your site that could benefit from a content and link audit suggests to the search engines that on-page factors are not a top priority. It says that the page has become irrelevant.

Regular fresh content is all part of a healthy site synergy. A strategy that ideally promotes the creation of islands of related information in the form of subject pages that in themselves act as compelling visitor destinations. Each should be valid enough to achieve its own weight and rank as well as to contribute to the overall site authority. Traffic and exposure increases and improves as a consequence. Ideally include some juicy outbound links (as well as internal links) to contextually relevant sites of authority.

It's important that you don't make the spiders work too hard to find the really useful information on your site. Spiders have a very limited attention span, only really interested in on average 16% of each site they visit. Make the relevant information accessible by not burying it too deep in the site. Make your entire mission-critical information accessible by applying a flat site architecture, allowing spiders (and humans) to navigate the site with ease.

Add a site map where the spiders can find every single link. The site map should contain text links, not graphics, and should also contain some text relevant to the site. Apart from crafting poor titles with irrelevant naming conventions, the next thing that you could do to cripple your optimisation efforts is to not have a site map clearly linked from every page in your site.

Keep the load time down. With billions of pages to index, the faster the page-load the greater the chances of that page being picked up and indexed.

Databases and dynamic pages are extremely hard work for spiders - in fact they'll rarely bother with them. They like light HTML pages with links and keywords that can be easily sifted through.

Dense amounts HTML code and graphic placeholders act as obstructions. With clean, well-written code spiders can easily access your important information.

No comments: