Saturday, November 15, 2008

How Search Engines Index Websites

The competition's fierce. There's only one first page for any key word or keyword phrase, and ideally you'll be somewhere at the top of it. As well as the quality of the competition you face there's also the quantity; with over 21 billion web pages to choose from the last thing you need is to be an also-ran. Remember, of the 800 million plus search queries a day conducted by many millions of users often searching for products or services that your company may supply, if you're not there on that first page, you can forget it. Of these searchers, nearly 70% would rather click on a natural listing, with more than 50% of online shoppers primarily relying on search engines when trying to find a product to purchase online.

It's a cruel yet simple fact of search engine life that if you aren't on the first couple of pages returned there's little point in being in the engines at all. 80% of web searchers look no further than the end of the third page.

All singing all dancing search engine optimisation (SEO) is all well and good, but to even be the smallest of blips on the search engine radars it is essential that they acknowledge your existence in the first place by indexing you.

Indexing

When using a search engine to look for anything instantly, the search engine will sort through millions of pages and present you with ones that best match your topic. The matches will even be ranked, with most relevant ones coming first.

Of course, the search engines don't always get it right but considering the vast amounts of information they're dealing in and the timescales they operate in, they usually do a pretty amazing job.

WebCrawler founder Brian Pinkerton puts it like this - "Imagine walking up to a librarian and saying, 'travel.' They're going to look at you with a blank face."

So what stops a search engine looking at you with a blank face (or melting or doing whatever search engines do when they're confused) when you enter 'travel' in the query box?

The answer lies in the cunning application of their secret search algorithms to vast collections of indexed data. Search engine software sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.

Bots, robots, web crawlers or spiders are software agents used by the search engines that regularly send them out to 'crawl' or 'spider' web sites and web pages. They suck in page content, following links from page to page, returning to the engines to incorporate the information into the enormous indexes of content maintained by the search engines in their databases.

Everything the spider finds goes into the index. The index, sometimes called the 'catalog', is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with new information.

Whilst continually crawling the Internet, each spider can keep hundreds of web connections open at one time, crawling thousands of pages a second and generating huge amounts of data that's encoded to save space. There are armies of these bots endlessly spidering the web registering amended, new, deleted and dynamic pages as they go.

It's these databases that you search when entering a query on Google, Yahoo! or any of the other search engines used.

Bots can even have names, referred to as 'user agent identifiers'; for example the Googlebot or Yahoo's Slurp. (See the database of bots at robotstxt.org, which may however not always be fully up to date or comprehensive.)

Not everything retrieved by a bot gets immediately indexed. There is usually a time lag between that and the index getting updated with all the info retrieved by all their bots.

It's also unlikely that the bots will spider the whole of each website they visit. Statistics indicate that they actually only index about 16% of the information available, rarely exploring below the third layer of content.

Submitting a site and site maps to the search engines is a simple way of getting the bots attention. However, it doesn't guarantee that they will actually decide to visit you and index you straight away.

Though submitting or adding your site to Google, for example, is better than not doing so, it doesn't necessarily mean you site will be indexed. What you really need are some nice, authoritative inbound links.

The higher 'ranking' and more authority the site commands the better. There are numerous ways to entice or cultivate quality inbound links. Essentially the Internet presence strategy will be based around quality site content, quality being the cornerstone of any successful online campaign. Here are a few examples of how to generate those inbound links to attract the spiders and get rapidly indexed:

  • Profile Sites - New style "Web 2.0" sites often allow for the creation of profiles with links, from MySpace and Yahoo! 360 to Digg
  • Blogsphere - Identify the leading movers and bloggers in your market place. Link to them from your blog but let them know you've done so.
  • University sites - .edu or .ac sites carry great weight.
  • Government sites - again .gov sites carry great weight.
  • Professional contacts - friends' sites, suppliers or industry bodies. It's always good to network your url.
  • Directories - Yahoo, yell.com, DMOZ etc - there numerous paid-for and free directory submission opportunities. At SEO Consult, we manually submit to quality directories that have relevant sub-categories.
  • Linkbaiting - embed copy or media with links contained within the site content.
  • High quality exclusive content - Content is king - high quality, exclusive content reigns supreme.
  • Article submission, syndication and article exchange - Articles like this one are an extremely useful way of link building and generating inbound links.
  • Widgets - screen savers and wallpapers, feed displays, viral applications, niche applications that site owners can include on their own sites are a great way to build links.

These inbound links should quickly yield bot visits. By feeding them a sitemap which sets out the structure of your site by listing the internal URLs that you want them to crawl, you make your site more search engine-friendly. Bots are creatures of habit; once you manage to attract them, as long as you keep the site updated and relevant they will then usually keep on coming by to feed on fresh data.

No comments: