Enjoy this article? Please SUBSCRIBE to receive all the FREE updates!
I’ve been playing with Blogpulse, Bloglines, Technorati, and Google Blogsearch. I’ve come away feeling underwhelmed. The state of Blog Search is pretty poor, in my opinion. I’d like to change that a bit, but I need some help.
I wrote this hack last week and discussed it in this post (lost indention convention due to blockqoute).
# @author: Pete Abilla
# @date: 7, June, 2006
# @function: crawls hard-coded feed url and
# computes basic linguistic statistics on given word
# in posts.require ‘rubygems’
require ‘feed_tools’
feed = FeedTools::Feed.open
(’http://scobleizer.wordpress.com/feed/’)
keyword = ‘microsoft’
total_occurances = 0
total_posts = 0keyword_pattern = Regexp.new(keyword, Regexp::IGNORECASE)
feed.entries.each do |entry|
puts “Entry: #{entry.title}”
matches = entry.content.scan(keyword_pattern)
if matches != nil
puts “Occurances of ‘#{keyword}’: #{matches.size}”
total_occurances += matches.size
total_posts += 1 if matches.size > 0
end
end
puts “Total number of posts in feed: #{feed.entries.size}”
puts “Total occurances of ‘#{keyword}’: #{total_occurances}”
puts “Total number of posts in which ‘#{keyword}’ appeared: #{total_posts}”
I know that I need to store a cache of the feed in my local MySql DB and run analysis off of that datastore. I have to figure out how to do this in Ruby — I’m just learning it right now. What I’m really interested in is to do some heavy duty computational linguistic analysis on feeds. In the past when I did this type of analysis, I hacked code in Perl or Python and used a corpus like the Brown Corpus as my data set — typically unstructured data. But, feeds are structured, so the game is a little different.
Here’s where I need help:
What are some search use cases that are needed but not provided by the blog search engines? Ideas?
Oh, and if you can throw a bone my way on how to integrate the code above with my local MySql DB, that would be great also. I understand that the code above is just a hack for now, but I want to build it and extend it. Pat, help?
Enjoy this article? Please SUBSCRIBE to receive all the FREE updates!
![]() | ![]() | ![]() | ![]() | ![]() |









{ 2 comments… read them below or add one }
Peter,
If you want to get all existing entries off a blog you can’t simply read their feeds, because (as you already observed with Scoble’s blog) any feed will usually only contain the last n entries. And 50 entries, as seen on the Scobleizer feed, is an unusually high number — more common is 10-20 entries per feed.
So to get more blog entries than are in a feed you have two options:
1. extract articles off the site’s HTML pages
2. subscribe to the site’s feed for a longer time and cache each new article in a database
#1 is usually done faster, but it’s a messy process and requires a little experience (some basic knowledge of the HTTP protocol, because you want to be polite in your page requests; some knowledge of text parsers/regular expressions/etc to extract the data).
#2 is more straightforward, but takes time; you frequently request the same feed (e.g., once a day) and then store new entries in a database. You can use an article’s permalink URL as primary key to make sure you don’t store an article more than once.
FeedTools comes with a feed cache mechanism that stores feeds in a database, but that’s not really what you’re looking for — because this mechanism will discard old entries as they disappear from a feed. You will need to write some database code yourself (I suggest looking into ActiveRecord, which makes database access with Ruby rather painless).
Let me know how you like Ruby (using Rails?) as compared PHP. I’ve only read books but have heard lots of good stuff and am always looking to speed up development.