Dave Burke : Online Community and Social Business Specialist

Sueetie Analytics Logging Upgrade: Filter Me Gently

Well that didn't take long. With all that great data flowing into the Sueetie Log for Analytics Reporting I had to play with it a bit and do even more with it. I knew from what I was seeing in the log table that there was non-human activity going on I needed to learn more about. I'm interested in what my USERS are doing. Bots and crawlers need not apply.

For Analytics Logging to help me better understand the crawler activity in Sueetie I needed to employ filtering on user agent. And while I was at it, put everything in place to go the additional step and impose blocking based on both user agent and IP if I wanted to.  That is now complete as well.

So in summary, Analytics Logging received an out-of-the-gate upgrade today that I think everyone’s going to like a lot, with tools to easily manage filtering for their Sueetie Communities.  I didn’t just add filtering based on user agent, but on site url.  Real easy tools to manage url filtering, too.  Read on with the reprint of the official Analytics Logging Filtering Document in the Sueetie Wiki below.

___________________

A Clean Analytics Log is a Happy Analytics Log

Sueetie Analytics Logging was online for less than 24 hours when I knew we needed additional data and controls to ensure more accurate Analytics Reports. Bottom line, the logging algorithms needed to be more intelligent. We achieved that by adding User Agent and Remote IP logging, along with creating a url filter file to prevent unwanted urls from being logged.

User Agent Filtering

Sueetie Analytics are called USER Analytics, not crawler analytics. We do not wish to report on crawler page loads, so we need to prevent them from being logged. We could always do a post cleanup with a Sueetie Background Task, but it's more efficient to prevent crawler activity from being logged in the first place.

There is no static, defined list of crawler agents, so we're going to use a new SueetieConfiguration Core CrawlerAgents Property to manage our crawler agent list which will change over time. Here's what the initial list looks like in the Sueetie.config file.  Expect it to change in short order.

 CrawlerAgents="(Reeder|msnbot|Googlebot|Baiduspider|ScrapeBox)"

Before logging the page request we perform a Regex() against the Sueetie Config CrawlerAgents value with the User Agent. If a match is found to indicate the agent is a Web Crawler we do not log the page load.

Url Filtering

There are certain application files we will not want to include in our Analytics Logs, pages like the auto refresh page performed by ScrewTurn Wiki when a user is editing the file. To manage which urls we do not wish to log we've added a NoLog.config file to the Sueetie /util/config directory. Here is the initial NoLog.config file where we are filtering the wiki refresh file.

<?xml version="1.0"?>
<nolog>

    <!-- Enter string returning true on Request.RawUrl.ToLowerInvariant().Contains(uniquePathExcerpt)-->

    <!-- wiki -->

    <url name="wiki_refresh"  uniquePathExcerpt="sessionrefresh.aspx" />

</nolog>

Additional Data Now Logged

To give more knowledge of site traffic, both by human and machine, we are logging two additional request properties: User Agent and Remote IP address. We are logging User Agent so we can learn what it hitting our site and what we need to enter into our CrawlerAgents string. Also, we are logging the Remote IP address. We need to know the origin of suspicious behavior so we can take steps to prevent future attacks. Not yet announced is a Sueetie Add-on Pack (online at Sueetie.com) which includes managed IP blocking. The data gathered here can be added to prevent site access.

Those of you who intimately know Sueetie and one of its core principles being a small footprint of keeping database size small may think logging the User Agent and IP address on each request breaks a major Sueetie Rule. I agree, so we've taken that into consideration in the design of the logging tables. We created a new table called Sueetie_RequestLog which stores User Agent and IP on each request. A Guid serves as a key value between this new RequestLog table and the ReportLog table. We use the Request Log for monitoring site activity and tweaking filtering, but it is not used in Analytics Reporting, so we can create a simple Sueetie Background Task to truncate this table periodically and keep our database size as small as possible.

New and Improved Results

With url and user agent filtering we ensure that our database stays small and our analytics reports are clean and accurate. By using the Sueetie Configuration "CrawlerAgents" property and NoLog.config file, we can manage Sueetie Analytics Log Filtering with ease.

Comments (0) | Post RSS RSS comment feed

Posted on 9/21/2010 3:49:03 PM by Dave Burke
Categories: Sueetie
Tags: |

Related posts


Powered by BlogEngine.NET 2.0.0.36
Theme by Dave Burke

Copyright © 2013 Dave Burke.  All Rights reserved.