Two-level tagging

Have you ever had trouble deciding where to store a file on your hard disk? Or worse, had trouble finding it later?

When you store a file on your hard disk, you have to decide which folder to put it in. That folder can in turn live inside other folders. This results in a hierarchy, known in computer science as a *tree*.

The main problem with trees is that sometimes you want things to live in multiple places.

Tagging provides an alternate system. Tags are a lot like folders, except that things can belong to multiple tags. However, but the tags can’t themselves belong to anything. So you have just one level of organisation with no nesting.

The main problem with single-level tagging is that it’s too simple. We want to be able to use fine-grained categories (e.g. ‘lesser spotted greeb’) that themselves belong to higher-level categories (e.g. ‘greeb’, or even ‘bird’ or ‘animal’). But we said that tags can’t themselves belong to tags.

Described like this, perhaps the solution will seem obvious to you too. We want things to belong to multiple tags, and for those tags to sometimes belong to other tags.

I built this into Emacs Freex, my note-taking system.

For instance, I have tagged this blog post with ‘data structure’ and ‘blog’. In turn ‘data structure’ is tagged with ‘computer science’ and ‘blog’ is tagged with ‘writing’. So I can find this blog post later in various ways, including by intersecting ‘computer science’ and ‘writing’.

This gives you the best of both worlds: things belong to multiple categories, along with a hierarchy of categories.

It has to be easy, and worth it, for you to add tags

Whoever adopted the idea that “there’s a place for everything, and everything in its place” when it came to organizing files and ideas on a computer suffered from a failure of imagination. Or maybe they were just over-wedded to the desktop and filing cabinet metaphors. Fortunately, the idea of ‘tagging’ (or ‘labels’ in Google’s parlance) blew that whole banal tidiness away. In short, tagging lets you assign things to multiple categories, or if you prefer, put things in multiple places. Rashmi describes this well – tagging is popular because there’s a lower cognitive cost when you can put things in multiple categories, rather than having to decide on just one.

We’ve only just started to scratch the surface of how categorization schemes could work. I’m going to propose a few ways in which things might grow from here, focusing on the restricted case where you’re tagging your own files privately, ignoring all the interesting goodness that happens when those tags are available to others, delicious-style.

N.B. I’m going to use the term ‘category’ rather than ‘tag’, since it’s easier to think of things belonging to categories than being labelled with a tag. The key notion is that things can belong to multiple categories simultaneously.

The more tags the better

Jon Udell has a great post on building up a taxonomy of categories by hand, starting with a smallish corpus of documents, and just letting the taxonomy emerge, combined with a little judicious weeding. The dataset he has in mind is pretty small, and so he’s aiming for 15-40 categories. The kinds of datasets I have in mind are much larger.

For instance, I have a few thousand text files with notes on topics ranging from Ubuntu troubleshooting to the symptoms of schizophrenia to my travel arrangements for the summer. I could maybe try and shoe-horn things into a few tens of categories, with each category holding many items, and each item belonging to maybe one or two categories. But I very quickly found this to be unsatisfying. We want to be able to differentiate things more finely than that. For instance, how would I categorize a document containing hotel bookings in Florence last summer for the HBM conference? Just by ‘travel’? Or also ‘Florence’, ‘conference’, ‘hotel’, ‘HBM’, and ‘2007’. Remember the argument about lower cognitive cost though – it’s much less effort just to include all those categories. If I do that, I’ll end up with many hundreds or even thousands of categories, some of which will have tens or hundreds of members and some of which might only have one or two members. I think one might raise two main objections to this approach:

Can you really be bothered to add a bunch of categories each time you write something?
How do you begin to find anything now? Sometimes filtering by a category doesn’t help because it returns way too many members, and sometimes it doesn’t help because it returns hardly any. Where’s Goldilocks when you need her?

I’ll address these in turn.

Can you be bothered to add a bunch of categories each time?

People are lazy. Any system that requires people to be assiduous book-keepers while they’re writing is doomed. Dave Winer talks about how he should be categorizing all his posts, and yet he doesn’t do it – and this makes him feel guilty. He knows that he won’t be able to trust the categories to find that thing later. The value of the whole system has dropped. Squirrels wouldn’t go to the effort of hoarding nuts for the winter if they knew that they wouldn’t remember where those nuts are when they need them. So what’s the point of hoarding nuts any more? All of a sudden, the system has broken down. We need to find a way to make the system less brittle.

Let’s look at Dave Winer’s guilty confession a little more closely:

“I have a very easy category routing system built-in to my blogging software. To route an item to a category, I just right-click and choose a category from a hierarchy of menus. I can’t imagine that it could be easier. Yet I don’t do it.”

If you ask me, that’s not easy enough. Navigating hierarchical menus with a mouse is slow and distracting. Blogger does it right – there’s a ‘labels’ text box that you can tab to, into which you can write a comma-delimited list of tags. As you type, it auto-suggests – pressing ‘return’ fills in the rest of the tag and puts a comma and space after for you. So that’s step 1.

But it should be even easier. What should happen is that the machine should automatically throw up a list of tags that it thinks might be appropriate for this post. It should put the ones it’s most confident about to the left, and less confident ones to the right, with the cursor positioned at the end to make it easy for the user to delete false positives and add new categories it missed. And if you’re feeling lazy, then you can just accept the machine’s suggestions without glancing at them. The cost of a false positive is low, so it’ll deliberately suggest too many. This brings us neatly to our second concern.

But then how do you find anything?

So now every document belongs to a bajillion categories, none of which is particularly useful on its own. But a conjunction of categories should narrow things down nicely. If I’m trying to find that hotel booking in Florence, I don’t have to worry about remembering whether it’s tagged with ‘travel’, ‘hotel’, ‘Florence’, ‘2007’ or ‘HBM conference’, since it’s tagged with all of them. So I’ll try filtering by the conjunction of ‘hotel’+’Florence’+’2007’ and that’ll probably winnow things down sufficiently for me to pick the file out manually (see also: make tags not trees). .

But maybe we never made a ‘Florence’ category. It seems like such a natural cue to use now, but at the time, ‘Florence’ didn’t spring to mind as a salient category, despite our liberal categorizing policy. If the system auto-completes in a handy way, we’d already know this, and our fingers would already be backspacing and trying ‘Italy’ or ‘HBM conference’. There are many points of failure, but there are also many points of entry. If we make it easy enough to cue for conjunctions of categories, then there’s a very low cognitive cost to having to backtrack once or twice, since our brain effortlessly supplies us with so many possible cues to use.

We could make things even less brittle in lots and lots of ways. Perhaps the system notices that only one item in the whole database is tagged with ‘Florence’, so it’s probably too restrictive a category. No matter. It could just ignore ‘Florence’, or suggest that we omit ‘Florence’ from our search. Better still, and less intrusively, it could now grep through all the files that match one or more of the tags to see if ‘Florence’ appears in the text, and automatically suggest any matches as partial matches.

Conclusions

I keep coming back to the same feeling – for the most part, people don’t write notes because they don’t think they’ll be able to find those notes later when they need them – so why bother writing the notes in the first place?

All of these suggestions are geared towards:

Reducing the cognitive cost at both writing and retrieval. If it’s less effort, you’ll feel less lazy about adding category metadata.
Making the system less brittle, so that if you were lazy about your category metadata, you still have a good chance of finding things later. This is the key to ensuring that you don’t end up losing faith and give up on writing things down in a structured way altogether.

Taken together, I hope that it will become easier to categorize your notes in a way that helps you find them later, which is going to make you much more likely to write them down in the first place.

Make tags not trees – filesystem idea based on tags instead of hierarchical directories

Until recently, it was easier to find something amidst the five zillion pages on the web than it was to find something on your own hard disk. It would be faster to Google for something than to burrow through subdirectories looking for it.

Could this be because the files on my hard disk are poorly organized? Bah. Maybe so. But that’s not my fault – it’s more or less inevitable once you have a lot of files, because hierarchical filesystems require each file to live in a single location. If I download a paper on memory for a class, should I organize by:

the context, e.g. the name of the class, lumping together all my writings and reading materials from that context together – ~/psy330/reading/
or by the type – things I’ve written vs reading materials – ~/reading/psy330/
or by good/bad or date produced or something else entirely?

Whichever decision you make, there’ll be times when you’ll wish things were organized some other way. This is why tagging is so popular. It’s because things inherently belong to multiple categories. And, because tagging is easy.
Google Desktop, Spotlight, Beagle and other offerings have helped considerably with all this. If you want to locate a single file, and you can’t remember where you put it, then full-text search is the way to go. But let’s consider the case where you have files that you want to treat as related, even if their contents aren’t obviously similar. We want this all the time. Take the reading list for a particular course or project as an example. This is why we needed directories and filing cabinets in the first place.
My proposal here is to replace the hierarchical filesystem with a completely flat space and lots of tags. Each file would be tagged with one or more tags, just like on http://del.icio.us/. The ‘save as’ dialog would look a little different. Instead of a list of directories that you can burrow into, there’d be a list of tags. When saving a file, you’d select as few or as many as you like, give the file a name just as now, and you’re done. To open a document, you filter using some tags, watching the list of files that match being winnowed down, and select from an alphabetized list. Or, use wildcards to winnow down by filename directly. Or some combination.
Converting an existing hierarchical filesystem would be easy in most cases. You could just grab all the subdirectory names in a path and treat them like unordered words in a bag. Let’s keep the same ‘/’ file separator we’re used to, but change its implicit meaning from ‘contains-this-directory’ to ‘and-also-this-tag’, so:

~/reading/psy330/hippocampus/blah.pdf

would now be equally accessible from:

~/reading/psy330/hippocampus/blah.pdf
~/psy330/reading/hippocampus/blah.pdf
~/reading/hippocampus/psy330/blah.pdf

All these locations would end up meaning the same thing. In this way, a subdirectory is really a conjunction of tags. In our simple example of storing .doc and .pdf files for documents and reading materials for a class, we’d simply tag some of them ‘doc’ and some of them ‘reading’, and give them both the ‘psy330’ tag for the class.
Upon looking at this, it’s clear you’ve lost some information, but I don’t think it’s information we’d miss much. The assumption underlying a lot of this is that where we now have hierarchy, we could manage just as well with intersecting sets, which would require considerably less effort to memorize.
There are, inevitably, unanswered questions and lurking gotchas.

I think we’d probably want to create a default/preferred way of expressing things, so that tags with more items or that are more discriminative go on the left, or something akin.
You shouldn’t need to specify all the tags for a given file. Just enough to specify it uniquely, given its filename. So, if there are no other blah.pdf files in the ‘reading’ tag, then you should probably be able to access it straightforwardly at ~/reading/blah.pdf though this has the unfortunate implication that if you were to add a new blah.pdf that also had a ‘reading’ tag, the above location would become ambiguous.
If there are multiple blah.pdf files in the reading tag, then the system would need to prompt you with a list of tags that would help disambiguate them. Wikipedia’s interface might have some lessons about disambigation that could be learned from.
At this stage, a tags-not-trees system seems better-suited for home directories (‘My Documents’ for Windows users) than system directories. In home directories, most of the organization is human-generated and needs to be human-readable, whereas /etc directories are mostly machine-generated to be uncomplicatedly machine-readable.
The only way metadata-entry systems work is if they require little work on the user’s part. The nice thing about tagging is that it should be relatively easy for the computer to make guesses about which tags you’ll want to put something in, based on your tagging of previous files. So when you click ‘save as’, it will prompt you with a list of tags that it thinks you’ll want to use, ordered in terms of certainty. You delete a couple, add a couple more, and leave the rest in place.
This is not a trivial problem, but you’ll have a large corpus from which to do your Bayesian learning (or whatever). And you can seed the corpus from day one with information from the existing file hierarchy, and with some clustering applied to the full text of the files.
This is the kind of problem that machine learning can really help with. There’s a decent amount of data, it’s getting feedback on each guess from the user and it’s doesn’t matter if it’s occasionally off-base because it’s only making suggestions.

I like this idea. I even think it might work, though I admit to feeling a little unsettled by the notion that all the files on my hard disk would effectively live in one place. Well, that’s not strictly true. Our notion of ‘space’ in filesystems would have to warp a little. It’s easy enough to imagine a filesystem now as a ramifying rabbit warren. This would require us to think of file locations in terms of boolean queries, and I can’t come up with a nice metaphor. I think it’s easy enough to grasp, but there’s nothing outside the computer that implements tags, because they inherently incorporate the idea of superposition (one thing existing in multiple places).
I would love to see a FUSE implementation of this. It would have to be open source and run on Linux, and I’d consider trying it. The closest I’ve seen (from this list) are:

OpenomyFS – propietary and web-based. Otherwise, looks interesting
TagsFs – seems to be focused on mp3 tags
RelFS – a full relational database
LFS – the most interesting of the bunch

If it turns out that any of those projects are alive and easy to try, I’d be pretty gung-ho about it.

UPDATE: there are some great links and comments below, and also at:

Greg Detre

Category Archives: tagging