Before you read the rest of this article: take a look at my attempt to wrangle the nptech feed. It’s only the first step in a long list of steps that I think we need to take as a community towards resolving tagging issues.
I’ve started to read with interest how Michele Martin is trying to create another nptech resource — this time using Rollyo. Like myself, I think she made the mistake of building the search engine before setting up the parameters for successful operation. More on that later. After thinking about this issue for some time, I’ve come to believe that the nptech tag and del.icio.us is our community’s collective blindspot. I think we’ve been seduced by the notion that del.icio.us has an open API without considering the possibility that the supposed open API is actually quite limited and requires some programming skills in order to extract useful data.
As a programmer myself, I’m obsessed with finding ways for nonprogrammers (which comprise a large portion of the nptech community) to participate in tag parsing and tag analysis. It should be clear that we should provide tools for relative nptech novices to enter our tagstream as quickly as possible but at the same time provide for relatively sophisticated data analysis should that novice eventually decide to plunge further into the data. Unfortunately, there are too many reasons why del.icio.us is a not a proper resource for a progressive and open community such as ours.
- The API doesn’t allow for automated extraction over the entire timeline of the tag. That is, we have to screen scrape in order to traverse all the possible tags in the timeline. If you take a look at the Perl code that is shown at unthinkingly.com, you’ll see some of the acrobatics that have to be performed to access the tags themselves. This prevents the entire nptech tag database from being easily accessed via outside services such as the much ballyhooed Yahoo! Pipes. Right now, there’s a del.icio.us bug that prevents something like a thousand tagged items from showing up. This is not good.
- The API doesn’t allow for automated URL extraction from the del.icio.us Web site. This is somewhat the same problem as issue #1 but right now you can’t simply enter in something like http://del.icio.us/tags/nptech/sites to retrieve all the sites that have ever been give the nptech tag. This is a problem but could be easily surmounted with some screenscraping.
I suggest that those issues above will actually lead us to discerning what is important when we maintain a tag database for our community. I propose the following parameters for a “definitive” and “canonical” source of nonprofit technology tags.
- We should have a common easily accessible database of tagged sites and resources. As long as we use del.icio.us as one of our main tag databases, we will not be able to achieve this end. Basically, I want to traverse the entire database without using awkward programming workarounds that limit the ability of our peers who are not programmers from accessing or analyzing that data.
- Any API to the data source should allow us to extract the following pieces of information from each tag:
- URL of the resource
- tagger ID or username
- date tagged
- multiple tag list but with nptech as a the sole required tag
- It should be just as easy and as well-supported as our current workflow. This means support for newly tagged resources going into the database and RSS feeds coming out of the database in the same manner as we are currently working. It would also allow for enhancements to our workflow such as filtering and the opportunity to edit the tag after the tag has been presented to the community.
- Tagged site lists should be easily converted into XML or OPML.
- We should have a way to provide for the re-editing of nptech tags and probably, a Digg-like interface to help us evaluate the more popular of these tags.
All of that should be done WITHOUT screenscraping methods and just through an API method call alone.
All the above means that the del.icio.us tag stream cannot and should not be the ultimate resource for finding about nptech tagged resources. The del.icio.us API is too brittle to be used in a transactional mode and workarounds are too high a barrier of entry. What we should consider is the following:
1. Pligg with an automated RSS feed – this gives us the ability to still use our current workflow (dumping things into Technorati or del.icio.us) but then rate all the existings items via the Pligg
2. Conversion of that RSS feed via a textparser (and no, it can’t be Yahoo! Pipes, that technology is too high-level) into what Google Custom Search Engine calls an augmented feed
3. Automated export of that feed into a Google CSE which then gives us a easily accessible database in which the site list can be exported via XML.
Unfortunately, even in the scenario above we’d have to do some work in order to make the Pligg give up tagged site info in point #2. However, it would only have to be done once as that could be written as an open API call as well.