Crawler and Bot Blocking in the Sueetie Addon Pack

We’ve introduced the Client Access Control module in the Sueetie Addon Pack and how we can block entire country access to our community site based on client IP address. Any Client Access Control system worth its salt would also give you the ability to block bots and crawlers as well, which is the latest addition to the Addon Pack and now online at Sueetie.com.

By agent we are referring to crawlers and bots.  Some agents are good while others are bad, so the Agent Access Control system will let the good crawlers in and keep the bad crawlers out.  Below is a reprint of the new Sueetie Wiki page describing bot blocking and filtering coming in the Sueetie Addon Pack.

__________________

Blocking verses Filtering

Sueetie gives you the ability to block crawler and bot access to your community using string excerpts of the Agent description. Here are a few agent string examples: Mail.Ru, googlebot, msnbox and R6_ScrapBox. Some of these agents are good and some are bad. We have two options in how to handle bots and crawlers in Sueetie:

  1. Block them completely by throwing up a 404 page
  2. Let them do their thing on our site, but filter them by not entering the page request into the Sueetie Request Logs

Blocking is easy enough to understand. They don’t get in. Period. Filtering requires a small bit of elaboration.

Why Filter Bot Page Requests

There are two primary reasons to filter bot page requests from being recorded in the Sueetie Request Logs. 1) To slow the growth of our SQL database, and 2) To improve Sueetie Analytics Reporting Data.

Sueetie Analytics (in development) records all page requests. It will provide you with detailed information on your most popular pages. We should not care if bots find our pages popular, we care if PEOPLE find our pages popular, and with Sueetie Analytics, precisely which people in our community. So no bots in our Sueetie Request Logs.

Managing Bot and Crawler Access

Below is a screenshot of the Sueetie Addon Pack’s Agent Access Management page. Here we enter a string excerpt found in the UserAgent request description. Our only remaining decision is to allow site access or not. Allowing site access gives them full reign of the site (according to your /robot.txt file), but does not log any page in the Sueetie Request Logs. Setting crawlers at not having site access in the Agent Access Management page will block them from the site entirely with the Sueetie Addon Pack’s AgentBlocker HttpModule.

Image

Determining Which Bots to Block

That’s coming in Client Access Reporting under development, so check back soon on how to analyze site access to take advantage of Sueetie Agent Blocking and Filtering in the Sueetie Addon Pack.

Article written by

A long time developer, I was an early adopter of Linux in the mid-90's for a few years until I entered corporate environments and worked with Microsoft technologies like ASP, then .NET. In 2008 I released Sueetie, an Online Community Platform built in .NET. In late 2012 I returned to my Linux roots and locked in on Java development. Much of my work is available on GitHub.