organizing-knowledge

Until recently, it was easier to find something amidst the five zillion pages on the web than it was to find something on your own hard disk. It would be faster to Google for something than to burrow through subdirectories looking for it.

Could this be because the files on my hard disk are poorly organized? Bah. Maybe so. But that’s not my fault – it’s more or less inevitable once you have a lot of files, because hierarchical filesystems require each file to live in a single location. If I download a paper on memory for a class, should I organize by:

the context, e.g. the name of the class, lumping together all my writings and reading materials from that context together – ~/psy330/reading/
or by the type – things I’ve written vs reading materials – ~/reading/psy330/
or by good/bad or date produced or something else entirely?

Whichever decision you make, there’ll be times when you’ll wish things were organized some other way. This is why tagging is so popular. It’s because things inherently belong to multiple categories. And, because tagging is easy.
Google Desktop, Spotlight, Beagle and other offerings have helped considerably with all this. If you want to locate a single file, and you can’t remember where you put it, then full-text search is the way to go. But let’s consider the case where you have files that you want to treat as related, even if their contents aren’t obviously similar. We want this all the time. Take the reading list for a particular course or project as an example. This is why we needed directories and filing cabinets in the first place.
My proposal here is to replace the hierarchical filesystem with a completely flat space and lots of tags. Each file would be tagged with one or more tags, just like on http://del.icio.us/. The ‘save as’ dialog would look a little different. Instead of a list of directories that you can burrow into, there’d be a list of tags. When saving a file, you’d select as few or as many as you like, give the file a name just as now, and you’re done. To open a document, you filter using some tags, watching the list of files that match being winnowed down, and select from an alphabetized list. Or, use wildcards to winnow down by filename directly. Or some combination.
Converting an existing hierarchical filesystem would be easy in most cases. You could just grab all the subdirectory names in a path and treat them like unordered words in a bag. Let’s keep the same ‘/’ file separator we’re used to, but change its implicit meaning from ‘contains-this-directory’ to ‘and-also-this-tag’, so:

~/reading/psy330/hippocampus/blah.pdf

would now be equally accessible from:

~/reading/psy330/hippocampus/blah.pdf
~/psy330/reading/hippocampus/blah.pdf
~/reading/hippocampus/psy330/blah.pdf

All these locations would end up meaning the same thing. In this way, a subdirectory is really a conjunction of tags. In our simple example of storing .doc and .pdf files for documents and reading materials for a class, we’d simply tag some of them ‘doc’ and some of them ‘reading’, and give them both the ‘psy330’ tag for the class.
Upon looking at this, it’s clear you’ve lost some information, but I don’t think it’s information we’d miss much. The assumption underlying a lot of this is that where we now have hierarchy, we could manage just as well with intersecting sets, which would require considerably less effort to memorize.
There are, inevitably, unanswered questions and lurking gotchas.

I think we’d probably want to create a default/preferred way of expressing things, so that tags with more items or that are more discriminative go on the left, or something akin.
You shouldn’t need to specify all the tags for a given file. Just enough to specify it uniquely, given its filename. So, if there are no other blah.pdf files in the ‘reading’ tag, then you should probably be able to access it straightforwardly at ~/reading/blah.pdf though this has the unfortunate implication that if you were to add a new blah.pdf that also had a ‘reading’ tag, the above location would become ambiguous.
If there are multiple blah.pdf files in the reading tag, then the system would need to prompt you with a list of tags that would help disambiguate them. Wikipedia’s interface might have some lessons about disambigation that could be learned from.
At this stage, a tags-not-trees system seems better-suited for home directories (‘My Documents’ for Windows users) than system directories. In home directories, most of the organization is human-generated and needs to be human-readable, whereas /etc directories are mostly machine-generated to be uncomplicatedly machine-readable.
The only way metadata-entry systems work is if they require little work on the user’s part. The nice thing about tagging is that it should be relatively easy for the computer to make guesses about which tags you’ll want to put something in, based on your tagging of previous files. So when you click ‘save as’, it will prompt you with a list of tags that it thinks you’ll want to use, ordered in terms of certainty. You delete a couple, add a couple more, and leave the rest in place.
This is not a trivial problem, but you’ll have a large corpus from which to do your Bayesian learning (or whatever). And you can seed the corpus from day one with information from the existing file hierarchy, and with some clustering applied to the full text of the files.
This is the kind of problem that machine learning can really help with. There’s a decent amount of data, it’s getting feedback on each guess from the user and it’s doesn’t matter if it’s occasionally off-base because it’s only making suggestions.

I like this idea. I even think it might work, though I admit to feeling a little unsettled by the notion that all the files on my hard disk would effectively live in one place. Well, that’s not strictly true. Our notion of ‘space’ in filesystems would have to warp a little. It’s easy enough to imagine a filesystem now as a ramifying rabbit warren. This would require us to think of file locations in terms of boolean queries, and I can’t come up with a nice metaphor. I think it’s easy enough to grasp, but there’s nothing outside the computer that implements tags, because they inherently incorporate the idea of superposition (one thing existing in multiple places).
I would love to see a FUSE implementation of this. It would have to be open source and run on Linux, and I’d consider trying it. The closest I’ve seen (from this list) are:

OpenomyFS – propietary and web-based. Otherwise, looks interesting
TagsFs – seems to be focused on mp3 tags
RelFS – a full relational database
LFS – the most interesting of the bunch

If it turns out that any of those projects are alive and easy to try, I’d be pretty gung-ho about it.

UPDATE: there are some great links and comments below, and also at:

Greg Detre

Category Archives: organizing-knowledge

Two-level tagging

Make tags not trees – filesystem idea based on tags instead of hierarchical directories