Why use hashing to create pathnames for large collections of files?

Hash/B:Tree A hash has the advantage of being faster to look at when you're only going to use the "=" operator for searchs If you're going to use things like "" or anything else than "=", you'll want to use a B:Tree because it will be able to do that kind of searchs Directory structure If you have hundreds of thousands of files to store on a filesystem and you put them all in a single directory, you'll get to a point where the directory inode will grow so fat that it will takes minutes to add/remove a file from that directory, and you might even get to the point where the inode won't fit in memory, and you won't be able to add/remove or even touch the directory You can be assured that for hashing method foo, foo("something") will always return the same thing, say, "grbezi". Now, you use part of that hash to store the file, say, in gr/be/something. Next time you need that file, you'll just have to compute the hash and it will be directly available.

Plus, you gain the fact that with a good hash function, the distribution of hashes in the hash space is pretty good, and, for a large number of files, they will be evenly distributed inside the hierarchy, thus splitting the load.

Hash/B:Tree A hash has the advantage of being faster to look at when you're only going to use the "=" operator for searchs. If you're going to use things like "" or anything else than "=", you'll want to use a B:Tree because it will be able to do that kind of searchs. Directory structure If you have hundreds of thousands of files to store on a filesystem and you put them all in a single directory, you'll get to a point where the directory inode will grow so fat that it will takes minutes to add/remove a file from that directory, and you might even get to the point where the inode won't fit in memory, and you won't be able to add/remove or even touch the directory.

You can be assured that for hashing method foo, foo("something") will always return the same thing, say, "grbezi". Now, you use part of that hash to store the file, say, in gr/be/something. Next time you need that file, you'll just have to compute the hash and it will be directly available.

Plus, you gain the fact that with a good hash function, the distribution of hashes in the hash space is pretty good, and, for a large number of files, they will be evenly distributed inside the hierarchy, thus splitting the load.

I liked your original answer better. Why did you change it? – Stephen Dec 3 '08 at 22:29 Hum, yes, I was not pleased with it, I'll mix the two of them :-) – mat Dec 3 '08 at 22:54 thanks, does this technique of using a hash to determine a location in the filesystem have an name?(is it a pattern?) – Stephen Dec 3 '08 at 23:29 I have no idea if it has a name :-) – mat Dec 3 '08 at 0:53.

I think we need a little bit closer look at what you're trying to do. In general, a hash and a B-Tree abstractly provide two common operations: "insert item", and "search for item". A hash performs them, asymptotically, in O(1) time as long as the hash function is well behaved (although in most cases, a very poorly behaved hash against a particular workload can be as bad as O(n).) A B tree, by comparison, requires O(log n) time for both insertions and searches.

So if those are the only operations you perform, a hash table is the faster choice (and considerably simpler than implementing a B tree if you must write it yourself. ) The kicker comes in when you want to add operations. If you want to do anything that requires ordering (which means, say, reading the elements in key order), you have to do other things, the simplest being to copy and sort the keys, and then access the keys using that temporary table.

The problem there is that the time complexity of sorting is O(n log n), so if you have to do it very oten, the hash table no longer has a performance advantage.

A hash is faster to check than it is to traverse a B-tree. So if frequent existence checks are made, this method might be useful. Other than that, I don't really understand the situation because hash tables don't preserve ordering or hierarchies.

Therefore, storing a directory structure in them doesn't seem feasable if directories need to be traversed individually.

I don't think he's talking about a hash table, but rather hashing some aspect of the data and using it as a filename in a directory structure. I would ordinarily think that it's really generating a GUID rather than a hash, but I'd need more detail on what the actual problem is. – tvanfosson Dec 3 '08 at 22:20 No existance checks are ever made.

Generally you put stuff in and keep its location in your db. – Stephen Dec 3 '08 at 22:21.

Hashes also gives a unique'ness to the pathname. Very few name clashes.

Thinking that would be a very bad idea, if a hypothetical hashing function, say, one which returns modulo 2**64, which will gives you a 64 bit number, you won't have clashes for the numbers from 0 to 2^64-1, then, when you get to 2**64, it will clash with 0, and so on. – mat Dec 3 '08 at 23:07 Yeah, "very few name clashes"! = uniqueness.

– C. Dragon 76 Dec 3 '08 at 23:40 Hash tables always have to account for the odd nameclash. Nothing new.It's still better than whatever else your choose as filenames.

– joveha Dec 4 '08 at 15:25.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions