I’m very taken with the general move towards more data from primary sources. Councils, government orgs etc. putting stats, facts, figures and information online for us to use and mashup. Those orgs who are savvy enough to drive this stuff through RSS make it even easier for us to harvest this stuff and add an extra dimension to our news gathering.

Of course the public sector moves slowly when it comes to IT and it’s no surprise that there are still a majority of orgs that hide their content away on static pages. No RSS feed to help there. So what do we do?

Well we could resign ourselves to adding them to the list of pages that we bookmark and visit. A bit like those regular calls we make to keep our contacts book fresh; no bad thing. But another solution is to use on of the many RSS services on the web to ‘scrape’ the page for content and convert it in to a feed.

Preston city council (the council nearest to me at work) has a few feeds but none around the basic operation of the council – meetings, decisions etc.  This kind of thing would be great to get a feed of. So I thought I would give it a go with their published decisions page using Feed43

[![No feed for the dull stuff!](https://i0.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Preston-City-Council-•-Decisions-300x217.jpg?resize=300%2C217 "Preston City Council • Decisions")](https://i2.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Preston-City-Council-•-Decisions.jpg)
No feed for the dull stuff!
The first thing I did was set the search so that it showed all results. That way any new ones would show up by default. I did this by using an * in the search box. The * is a standard operator for a wild card or ‘any matches’. So it seemed a logical punt to try it.

The next step was to copy the web address to feed my RSS maker. The URL looks complex but it contains all the information needed to drive the search.

[![Feed43 grabs the whole page for you to explore](https://i2.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-300x187.jpg?resize=300%2C187 "Feed43 _ Edit Feed")](https://i0.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed.jpg)
Feed43 grabs the whole page for you to explore
The first step with Feed43 is to feed it the URL then click *Reload*. It pulls in the whole page and then you get the hard bit. The idea with feed scrapers is to give it enough information about the way the stuff you want is presented that it can ‘spot’ the stuff and ignore the rest. This means trawling through some HTML.

You get two options

The global search pattern looks for HTML that ‘wraps’ the content you want to make in to a feed. It could be the whole table that contains the search results. But this doesn’t really help in this case.

Better to go straight to the second option which defines the specific things to look for to define an item to be added to the feed. Here’s what I put.


In feed43 language {*} means this could be anything, just ignore it. {%} means this is important so store it.

So I can saw from the HTML that each decision in the list looked like this

<a href=”http://preston.moderngov.co.uk/ieDecisionDetails.aspx?ID=348&displaypref=0″ title=”Link to decision details for North West England Regional Spatial Strategy Partial Review Consultation”>North West England Regional Spatial Strategy Partial Review Consultation

So I told feed43 to look for anything between the tags regardless of what ‘class=’ said. Then I told it to grab the href link as the actual weblink, ignore the title and then grab the text between the tag to use as a title.

[![Finding the useful bits on the page means working through the HTML](https://i0.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-2-300x204.jpg?resize=300%2C204 "Feed43 _ Edit Feed-2")](https://i1.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-2.jpg)
Finding the useful bits on the page means working through the HTML
Clicking extract will filter the content and show you the results. You can see they are split in to {%1} for the link and {%2} for the title of the decision.
[![The filtered results display in a list](https://i0.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-3-300x193.jpg?resize=300%2C193 "Feed43 _ Edit Feed-3")](https://i0.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-3.jpg)
The filtered results display in a list
The last step is to define which of these makes up the key parts of the feed. You can see it’s pretty straightforward to fill the gaps at this point. Your[ feed is then ready to go](http://feed43.com/prestoncitycouncildecisions.xml). All you need to do is subscribe in the normal way
[![The filtered results can be added to the feed template](https://i0.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-4-300x240.jpg?resize=300%2C240 "Feed43 _ Edit Feed-4")](https://i1.wp.com/www.andydickinson.net/wp-content/uploads/2009/10/Feed43-_-Edit-Feed-4.jpg)
The filtered results can be added to the feed template
**Moving beyond the basics**

The thing that makes scraping pages difficult is picking through the HTML. Feed43 makes this easier by limiting the number of options to filter by. But if you need to push further in then you will need to explore other options. One to consider is Yahoo pipes which has a page grabber option. But you will also need to invest some time in understanding regular expressions.

I think this kind of stuff is more an more important for orgs and journalists especially when it comes to councils and government orgs. We all know how ‘mundane’ many see this stuff (important as it is). So making it in to a feed would be more conducive to newsgathering by stealth. Encourage more ‘passive aggressive newsgathering’ as Paul Bradshaw once described it.