Make tags not trees – filesystem idea based on tags instead of hierarchical directories

Until recently, it was easier to find something amidst the five zillion pages on the web than it was to find something on your own hard disk. It would be faster to Google for something than to burrow through subdirectories looking for it.

Could this be because the files on my hard disk are poorly organized? Bah. Maybe so. But that’s not my fault – it’s more or less inevitable once you have a lot of files, because hierarchical filesystems require each file to live in a single location. If I download a paper on memory for a class, should I organize by:

  • the context, e.g. the name of the class, lumping together all my writings and reading materials from that context together – ~/psy330/reading/
  • or by the type – things I’ve written vs reading materials – ~/reading/psy330/
  • or by good/bad or date produced or something else entirely?

Whichever decision you make, there’ll be times when you’ll wish things were organized some other way. This is why tagging is so popular. It’s because things inherently belong to multiple categories. And, because tagging is easy.
Google Desktop, Spotlight, Beagle and other offerings have helped considerably with all this. If you want to locate a single file, and you can’t remember where you put it, then full-text search is the way to go. But let’s consider the case where you have files that you want to treat as related, even if their contents aren’t obviously similar. We want this all the time. Take the reading list for a particular course or project as an example. This is why we needed directories and filing cabinets in the first place.
My proposal here is to replace the hierarchical filesystem with a completely flat space and lots of tags. Each file would be tagged with one or more tags, just like on http://del.icio.us/. The ‘save as’ dialog would look a little different. Instead of a list of directories that you can burrow into, there’d be a list of tags. When saving a file, you’d select as few or as many as you like, give the file a name just as now, and you’re done. To open a document, you filter using some tags, watching the list of files that match being winnowed down, and select from an alphabetized list. Or, use wildcards to winnow down by filename directly. Or some combination.
Converting an existing hierarchical filesystem would be easy in most cases. You could just grab all the subdirectory names in a path and treat them like unordered words in a bag. Let’s keep the same ‘/’ file separator we’re used to, but change its implicit meaning from ‘contains-this-directory’ to ‘and-also-this-tag’, so:

  • ~/reading/psy330/hippocampus/blah.pdf

would now be equally accessible from:

  • ~/reading/psy330/hippocampus/blah.pdf
  • ~/psy330/reading/hippocampus/blah.pdf
  • ~/reading/hippocampus/psy330/blah.pdf

All these locations would end up meaning the same thing. In this way, a subdirectory is really a conjunction of tags. In our simple example of storing .doc and .pdf files for documents and reading materials for a class, we’d simply tag some of them ‘doc’ and some of them ‘reading’, and give them both the ‘psy330’ tag for the class.
Upon looking at this, it’s clear you’ve lost some information, but I don’t think it’s information we’d miss much. The assumption underlying a lot of this is that where we now have hierarchy, we could manage just as well with intersecting sets, which would require considerably less effort to memorize.
There are, inevitably, unanswered questions and lurking gotchas.

  • I think we’d probably want to create a default/preferred way of expressing things, so that tags with more items or that are more discriminative go on the left, or something akin.
  • You shouldn’t need to specify all the tags for a given file. Just enough to specify it uniquely, given its filename. So, if there are no other blah.pdf files in the ‘reading’ tag, then you should probably be able to access it straightforwardly at ~/reading/blah.pdf though this has the unfortunate implication that if you were to add a new blah.pdf that also had a ‘reading’ tag, the above location would become ambiguous.
    If there are multiple blah.pdf files in the reading tag, then the system would need to prompt you with a list of tags that would help disambiguate them. Wikipedia’s interface might have some lessons about disambigation that could be learned from.
  • At this stage, a tags-not-trees system seems better-suited for home directories (‘My Documents’ for Windows users) than system directories. In home directories, most of the organization is human-generated and needs to be human-readable, whereas /etc directories are mostly machine-generated to be uncomplicatedly machine-readable.
  • The only way metadata-entry systems work is if they require little work on the user’s part. The nice thing about tagging is that it should be relatively easy for the computer to make guesses about which tags you’ll want to put something in, based on your tagging of previous files. So when you click ‘save as’, it will prompt you with a list of tags that it thinks you’ll want to use, ordered in terms of certainty. You delete a couple, add a couple more, and leave the rest in place.
    This is not a trivial problem, but you’ll have a large corpus from which to do your Bayesian learning (or whatever). And you can seed the corpus from day one with information from the existing file hierarchy, and with some clustering applied to the full text of the files.
    This is the kind of problem that machine learning can really help with. There’s a decent amount of data, it’s getting feedback on each guess from the user and it’s doesn’t matter if it’s occasionally off-base because it’s only making suggestions.

I like this idea. I even think it might work, though I admit to feeling a little unsettled by the notion that all the files on my hard disk would effectively live in one place. Well, that’s not strictly true. Our notion of ‘space’ in filesystems would have to warp a little. It’s easy enough to imagine a filesystem now as a ramifying rabbit warren. This would require us to think of file locations in terms of boolean queries, and I can’t come up with a nice metaphor. I think it’s easy enough to grasp, but there’s nothing outside the computer that implements tags, because they inherently incorporate the idea of superposition (one thing existing in multiple places).
I would love to see a FUSE implementation of this. It would have to be open source and run on Linux, and I’d consider trying it. The closest I’ve seen (from this list) are:

  • OpenomyFS – propietary and web-based. Otherwise, looks interesting
  • TagsFs – seems to be focused on mp3 tags
  • RelFS – a full relational database
  • LFS – the most interesting of the bunch

If it turns out that any of those projects are alive and easy to try, I’d be pretty gung-ho about it.

UPDATE: there are some great links and comments below, and also at:

Advertisements

15 thoughts on “Make tags not trees – filesystem idea based on tags instead of hierarchical directories

  1. Good article!
    One of the other reasons that files “live” in folders is for security (permission) settings. Drop everything in that folder that only the “admin”-(user)group can access and you have set permissions on all those files. This is something that could probably be done with tagging as well (permission tagging) but I am curious to learn what your take is on this?

  2. I totally agree; in fact, I posted something similar back in 2007 (here), and even then the idea had been around for ages. It's good to see that people are working on implementations of one sort and another, and I look forward to seeing how they progress.

  3. @uzair: thanks for the link to Tagsistant. That looks very interesting. However, I'm not convinced about the inclusion of the disjunctive boolean operator (OR).

    Our filesystem hierarchies solely provide a kind of conjunctive (AND) search. Adding the possibility for disjunctive (OR) searches might add some power and flexibility, but only if you also add infrastructure for precedence operators (i.e. parentheses).

    If you were to remove the OR operators, then you wouldn't need to specify the boolean operators at all (i.e. you could assume AND all the time, as all modern search engines do). Less typing, less complexity!

  4. @Bram: The question about security is a good one. Hmmmm. It's trickier than it seems at first.

    My knee-jerk response is that you might have to resort to a file-level permissions system, but maybe someone cleverer than come up with a consistent and stable system based on tags.

  5. @Robert Rossney: There are two good reasons to keep the filename:

    – My aim here was to describe an interface that provided the power of tagging, while still allowing all existing applications to work perfectly. Changing/removing the notion of the filename would massively disrupt all our existing applications, and confuse users.

    – I think you still need filenames to provide unique disambiguation, kind of like a primary key. It's easy to imagine having two files that have the same tags – at that point, you'd need some extra tag or identifier to separate them. That might as well be the filename.

  6. Good article. I have no doubt that Tags-not-trees is the future of file systems.

    There are a lot of ways to achieve good performance that wouldn't require complex relational database schemes. For example, you could maintain an a*b* index for files that have tags starting with both a and b (and so on).

    I agree that the tags “xxx, yyy” would almost always work for things in both folders xxx/yyy and yyy/xxx. However, hierarchy does sometime matter for relationships. For example, “Ted\friends” contains friends of Ted's while “friends\Ted” contains information about friends named Ted. Maybe the way to handle this when converting from a tree system is to encode the immediate parent folder as a tag. So, “friends\Ted” files get tags of “friends, Ted, friends.Ted” and the other gets “friends, Ted, Ted.friends”.

    Likewise, disambiguating file names that share tags is a challenge (as you said), especially for homographs. For example, searching for “grant” might find:
    ~/friends/grant/info.txt and ~/funding/grant/info.txt

    To help, the search results could cluster results by common, shared tags. So, the above would lump results into “friends” and “funding” groups making it easy to dismiss one group or the other.

    Also, when saving new files, the “File Save” dialog could prompt for more tags and/or a different name when name collisions occur. As Windows does, it could default to a non-colliding name like “info~copy2.txt” by default.

    Despite the shortcomings, I'd be willing to put up with the minor annoyances for all the benefits. Sure, you get quick searches but tags open up other possibilities like simple, built-in versioning (tag: _ver:12.34) or modification history (tag: _modified:##-##-##) for very little effort.

  7. I'd like to add here that there is one production filesystem that explored metadata this way (but didn't encode it into the path): the Be FileSystem, as used in BeOS. Dominic Giampaolo wrote an interesting book on the topic, and I'd highly recommend that people read up on it, as many of the ideas explored in BeFS could find application in a “tagging” filesystem.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s