June | 2015 | csharpblog.net

So you have built your great new wine selling site. You made sure you have used only the best practices. You invested time in making sure your software engineers used the best frameworks available to them. Your UI engineers have ensured that your site is fully responsive and will provide your users with the best possible user exprerience on any device. Whether it is a mobile phone, a tablet, a fridge door with internet connectivity or even something as exotic as a desktop computer – they made sure your site is accessible and designed to ensure optimal performance. But what if you are a bot? No really what if you are a bot? A search engine bot – say like googlebot or yandexbot? After all you want your users to find your site on their favourite search engine, so you made sure that all your links are crawlable and provided with a reasonable robots.txt file? But are you sure you haven’t provided far too many links?

So your site sells wines from all over the world. Imagine a global wine selling site, where every vineyard can sell directly to its connoisseurs. Your site allows your users to:

– Search for wines by name, or colour (for simplification reasons)
– View wine prices in local currencies (25 major currencies)
– View wines in their local language (20 major languages)
– And sort these results by price

Clearly you anticipate that the site will be a massive success, hence why you made sure caching is used properly to ensure optimum performance. But how much can you cache?

Let us assume that you have managed to sign up 1,000 vineyards from around the world. And all of these sell three types of wine (white, red and rose). So your site can sell 1,000 x 3 = 3,000 unique bottles of wine. Each of these bottles comes with a great description, ratings and various tags. Let us assume that each wine has 200KB of data attached to it. So far your site can actually return results of 3,000 bottles x 200KB = 600,000KB (600 MB) of data. So great you can cache all of that and your site will be super fast. But what about the currencies and sorting? Ah yes well that will create more unique cached result sets. Actually a lot more! 20 languages x 25 currencies x 2 sort directions x 600 MB = 600,000 MB (585 GB). Can you still cache all of that? No you can’t. But then you most likely don’t need to. Most users will not convert the prices nor change the sorting too often. You can afford to produce these result sets when needed and cache for a short time.

What about bots? Have you made sure that all your links have rel=“nofollow”? Yes all your A tags have that attribute, but what about your input select tags that you included for your mobile users? These cannot have rel=“nofollow”. And that will cause bots to crawl your site for all of these extra links that don’t really alter the results sets and don’t really add any SEO value. Initially your site will perform fine but over time it will start to buckle a bit. If bots are finding all your currency and order parameters in your URLs then your servers will start to cache slowly more and more data. And because it is highly impossible you will have 1TB of RAM you will start running out memory pretty quickly. Which means your system’s page file will start coming into use and that’s when your site will really slow down. Well until of course bots like google realise this and slow down their crawl rate to allow your site to catch up or maybe they don’t? Some bots, like Yandex, will actually do 20-30 simultaneous calls to your site. Can you imagine the load?

So please make sure of the following:
– All your non result altering links (sorts, currency conversions, locales) need to have rel=“nofollow”.
– If you need to provide a select type of link options, then use javascript to construct them. Hence not allowing the bots to crawl them.
– Upload an appropriate robots.txt file to your site. Ensure you exclude params in them and even set the frequency of querying. Some bots, like Yandex, allow you to slow down the crawl by providing extra params in your robots.txt file.

User-agent: Yandex
Crawl-delay: 4.5
Clean-param: curr&rad&locale

By adding the above statements to your robots.txt you are telling Yandex to allow at least 4.5 seconds between calls to your site and to ignore the specified params. This doesn’t mean that your site will be crawled every 4.5 seconds.

I hope this will help you to not allow bots to control and “stretch” your site’s resources.

csharpblog.net

Coding, Apple, and pretty much anything techie

Month: June 2015

Yandex stretching your site like Spandex?!?!