RssHarvest

RssHarvest is a feature new to Aggie RC5. It allows Aggie to scrape web pages and create virtual RSS feeds for them.

The Problem

People all over the web generate information daily. Very few of them create RSS feeds that describe their work. Most of them know nothing of RSS (and RSS news aggregators), or couldn't care less.

Now assume there is an author on the web whose work you track regularly, but who does not announce new information items through RSS. You could write to that author and ask for an RSS feed, but the chances are slim it would take you anywhere. Are you doomed to manually poll that author's site every once in a while?

Of course not. There are a few tools on the web that can do that for you. We're not talking of primitive tools that let you know when a web page changes. No, we're talking of sofisticated tools that note when a web page changes, and then create an RSS feed of these changes for you to subscribe to. Two examples are rssify and RssDistiller

RssHarvest is Aggie's integrated version of these tools.

RssHarvest has been inspired by RssDistiller, and stole several good ideas from it.

The Solution

Starting with RC5, Aggie can accept RssHarvest URLs as valid RSS URLs to subscribe to. The RssHarvest URLs (technically called "Virtual Site URLs") indicate to Aggie the information it needs to know in order to generate an RSS feed from a "normal" web site.

Basic Concepts

Before setting Aggie to scrape web pages, you need to familiarize yourself with the following terms.

Virtual Site URL
A URL which provides Aggie with all the information it needs to locate the web page to scrape, and how to scrape the page.
Site URL
The URL of the web page that you want Aggie to scrape. The site URL is a necessary component of the virtual site URL.
Transform File
A web resource or file that describes how to scrape a web page.
Pattern
A particular text pattern to look for in the web page when scraping. Aggie uses several patterns (as described below) which are kept in the transform file.
Named Parts
A named part is a name associated with a value. Aggie derives named parts from the virtual site URL (where they appear as name=value pairs), and from the transform file (where they appear as <name>value</name> elements).

Seting Up a Feed

To setup a virtual feed, you need to:

Here's how you do it.

Create a Virtual Site URL

A virtual site URL is a URL that minimally follows these rules:

As an example (which we will also use for the rest of this document), let's assume that we want to scrape the recently published page from MSDN. This is an HTML page which can be used to track new material as it is published by MSDN. As it does not currently have a corresponding RSS feed, let's create one.

MSDN recently published page is at http://msdn.microsoft.com/recent.asp. (You might want to go now and check this site.)

Say you decide to put your transform file on your web server, at http://server/msdn.xml.

The virtual site URL will have the following form:

RssHarvest:?url=http://msdn.microsoft.com/recent.asp&transform=http://server/msdn.xml

Create a Transform File

Other than url and transform, you also need to provide (in the form of named parts) four patterns that are used when scraping.

startPattern
A pattern whose first appearence in the web page signals the beginning of the area to scrape. (The pattern string itself is not considered part of the area to scrape.)
endPattern
A pattern whose first appearence in the web page signals the ending of the area to scrape. (The pattern string itself is not considered part of the area to scrape.)
itemStartPattern
A pattern whose appearence in the area to scrape defines the starting point for a new item (exclusively).
itemEndPattern
A pattern whose appearence in the area to scrape defines the ending point for an item (exclusively).

Note that only itemStartPattern is mandatory. The others are optional.

While it is possible to provide these parts (indeed, all parts) in the virtual site URL, it is strongly recommended that they be placed in the transform file. Other named parts that may appear in the transform file are:

title
The channel's title.
description
The channel's description.
link
The channel's link.
base
A base URL to use for relative URLs found on the page.
itemHeader
A prefix which is automatically added to the description of each item.
itemFooter
A suffix which is automatically appended to the description of each item.

In case of USS Clueless, here's how the format file looks:

<?xml version="1.0"?>
<transform>
  <title>MSDN Headlines</title>
  <description>Hot from MSDN's oven</description>
  <link>http://msdn.microsoft.com/</link>
  <base>http://msdn.microsoft.com</link>
  
  <startPattern>>&lt;h3&gt;All MSDN Headlines&lt;/h3&gt;</startPattern>
  <stopPattern>&lt;/table&gt;</stopPattern>

  <itemStartPattern>&lt;li type="disc" style="margin-top:0;margin-bottom:0;padding-top:0;padding-bottom:0;"&gt;</itemStartPattern>
  <itemStopPattern>&lt;/li&gt;</itemStopPattern>
</transform>

At this point it would be constructive to go to MSDN Headlines, check out the HTML source code of the site, and verify that you understand why the patterns have been built in this manner.

Subscribe Aggie to the Virtual Site URL

Start Aggie. Make sure that your transfom file is accessible (for example, if it lives on a web server navigate to the server using a browser). In Aggie's URL edit box (the one to the left of the "Add Channel" button), insert the virtual site URL. Then click on the "Add Channel" button. If all goes well, Aggie will refresh the list of subscribed feeds with the new feed you've just created. Congratulations!