Yesterday, Dave Sifry announced Technorati's related tag feature. Tags are simple, one word descriptors that people apply to their own and others' web articles for later recall. Services like Technorati, del.icio.us, de.lirio.us, and scuttle aggregate these tags into folksonomies that exhibit useful emergent structure. The simplest example of one such emergent structure arises from assuming that people mean the same thing when they use the same tag and then using the aggregated tags as a sort of navigation hierarchy. All of the services mentioned do this. With technorati's announcement, they all now also offer a related tags feature. The simplest version of related tags is to say that tags are related when enough people use them to label the same set of artifacts. For instance, if one set of users are labeling a set of sites with “baseball” and another group are labeling the same sites with “MLb”, it is reasonable to assume the two tags are related. These relations can be added to enrich the navigation structure.
There are many more examples of such emergence through aggregation, currently less explored, that have potentially greater commercial value. xFolk is an xhtml microformat that makes it easier for people to share their vision of the world by tagging URLs in their web articles. People will derive value from xFolk in two ways: (1) from information that emerges by aggregating individual tags across many people or even just one person; (2) from the ease of being able to tag things for their own personal recall. Note that these two value propositions require xFolk to address two very different concerns. The first is that of aggregators who must collect together all of these tagged artifacts. The second is that of users who must wrestle with the microformat to tag URLs in their web articles.
This article focuses on enabling the first part of the value proposition to users; specifically, what is the underlying data schema that allows for the highest value aggregations? This schema provides an indication of the desired data elements to be included in any xhtml microformat that end-users will employ. The next article in this series, “xhtml microformat for Emergence” will consider how this schema can be translated into a format that end-users and, more likely, end-user tools can employ in a wide variety of use cases.
Underlying Data Formats
The data formats used by most tag aggregation services appear very simple. Consider the following pages for the tag folksonomy:
- http://del.icio.us/tag/folksonomy
- http://de.lirio.us/rubric/entries/tags/folksonomy
- http://www.niallkennedy.com/scuttle/tags.php/folksonomy(Niall Kennedy has kindly agreed to make this instance of scuttle available).
- http://www.technorati.com/tags/folksonomy
Using a perspective inspired by entity-relationship (E-R) modeling, it would appear that we could start with just three base entities:
- tagger — the person doing the tagging, identified by a unique ID.
- tagged — the item tagged, identified by a unique URL.
- tag — a tag uniquely identified by its label.
Examining any of the first three examples (del.icio.us, de.lirio.us, or scuttle), it is apparent that combinations of these entities can be used to uniquely identify the data for each folksonomy entry as follows:
- tagger & tagged — uniquely identify an entry tagged by a specific individual and thus the “title” and “comment” the individual wishes to apply to that specific entry.
- tagger & tagged & tag — uniquely identify one instance of one individual applying a specific tag to a specific resource.
Because folksonomy aggregations are frequently updated, all folksonomy systems use a relational database to efficiently manage their data. Were one to take this analysis and reduce it a set of relational database tables required to implement the back-end of any folksonomy service, the following schema (table structure) expressed in functional notation would suffice:
Tagger(taggerID*) TaggerTags(taggerID*, tagName*, tagURL) where taggerID* links back to Tagger TaggedComment(taggerID*, taggedURL*, title, comment) where taggerID* links back to Tagger TaggedTags(taggerID*, taggedURL*, tagName*) where taggerID* links back to Tagger and taggerID* & tagName* in combination link to TaggerTags
To read this, the names outside the parentheses indicate the table names. Tables contain collections of similar entities such as people tagging (Taggers) or tags applied to web resources (TaggedTags). The items inside the parentheses are the attributes or characteristics of each entity. Within tables, the combination of starred attributes uniquely identifies each entity meaning no entity within a table may have the same combination of starred attribute values as another entity. Specifically for this example, each person tagging an item may only give it one title and one comment, and each person may apply a tag only once to an item.
The links under each table indicate how the tables may be related to facilitate processing and reporting. For intance, the link between TaggedTags and TaggerTags allows taggers to change a specific tag only once and have that change cascade down to all instances of the tags use, a feature available in del.icio.us, de.lirio.us, and scuttle.
Relevant use cases
The schema I just described is designed for the implicit use case built into del.icio.us, de.lirio.us, and scuttle. In all three of these services, the user creates an account and then enters individual bookmarks with a title, a comment, and as many tags (possibly none) as desired. In this use case, the user essentially free associates tags based on all of the things that come to mind as he or she enters the bookmark. Since del.icio.us was the first of the services, we'll refer to this as the del.icio.us use case.
Technorati differs slightly from the del.icio.us use case in that users apply tags to articles they have written in the body of the article. As users write articles, they tell technorati, and technorati processes the web page for the article URL, title, body text, and tags. Considering the last sentence carefully, it is apparent that users of technorati tags are essentially supplying the same information as users of the other three services. Thus, despite differences in mode of interaction, the technorati use case is functionally equivalent to the del.icio.us use case.
The del.icio.us use case and resulting schema are fundamentally limited. It forces people to explicitly enter their bookmarks in an archive. As a result, it fails to capture a lot of tagging behavior that could and does occur as people write web articles with links. Further, it fails to capture the very common case where people list relevant tagged links in link blogs or as part of articles they are writing. Finally, it does not allow the user to indicate a definition for their tags other than the implicit one that is gained by reading all of the bookmarks with that tag.
This first two cases provide contextual information about link usage and the intended meaning of the links in that context. As suggested by Jeremy Zawodny, such information would make it possible to determine things like an individual's most frequently used links and possibly the most relevant links by context. The last case provides an opportunity for people publishing their tags to give some hint, if they want to, of what the tags mean to them.
An alternative schema with more emergent power
The following schema allows for storage of contextual data with links as well as explicit tag definitions.
Tagger(taggerID*) TaggerTags(taggerID*, tagName*, tagURL, tagDef) where taggerID* links back to Tagger TaggedComment(taggerID*, taggedURL*, whereTagged*, title,comment) where taggerID* links back to Tagger TaggedTags(taggerID*, taggedURL*, whereTagged*, tagName*) where taggerID* links back to Tagger and taggerID* & tagName* in combination link to TaggerTags
The three additional attributes are bolded. Note that we simply added a tagDef attribute to TaggerTags to allow users to specify explicit tag definitions if they so desire. We added whereTagged attributes to TaggedComment and TaggedTags as part of the identifier. This addition allows tags, titles, and comments to be accumulated across contexts. If the context is always the same, then the schema reverts to the del.icio.us case.
Next steps
In reformulating xFolk, this post represents an important first step in defining the information needs of folksonomy aggregators. Discovering emergent structure through aggregation is one of folksonomy's key value propositions. The other is providing a service such as bookmarking that is solely of benefit to the end-user and represents an immediate payoff. The next post will refocus on the microformat proper as part of the user experience.
I welcome all feedback, via trackback, comment, or tagback.

Leave a comment