I am stunned at the number of responses that imply that such a thing is impossible.
I am stunned at the number of responses that imply that such a thing is impossible. Have these people never heard of "compressed file systems", which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology? I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.
Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details. But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression. Also check out: "StackOverflow: Compression formats with good support for random access within archives?
I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole. E.g.
We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially.
You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes.
The data you need may span multiple chunks, so you have to deal with that scenario. Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller.
You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given. This is basically how compressed file systems work.
You might be better off turning on file system compression for your files and just read/write to them normally.
I had posted an answer about Huffman coding. Reading your response made me pause and think about how Huffman coding is done, and you're right, random writes would spoil the encoding. – Bill the Lizard?
Oct 25 '08 at 13:55 In the case of writes you will never need extra padding. You just will have to re-compress both blocks who share the boundary that is crossed. This is because there is no API that will insert data into a position of a file.
– Brian R. Bondy Oct 25 '08 at 13:57 @Brian R. Bondy: Surely writes are worse than that because they can change the size of the compressed file (even if the uncompressed data remains the same size).
– Hugh Allen Oct 25 '08 at 14:07 Ah right, thanks Hugh. – Brian R. Bondy Oct 25 '08 at 14:18 @Brian - I was thinking you could remap the block to a different position in the file.
– Ferruccio Oct 25 '08 at 14:20.
A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors. If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.
I think Stephen Denne might be onto something here. Imagine: zip-like compression of sequences to codes a dictionary mapping code -> sequence file will be like a filesystem each write generates a new "file" (a sequence of bytes, compressed according to dictionary) "filesystem" keeps track of which "file" belongs to which bytes (start, end) each "file" is compressed according to dictionary reads work filewise, uncompressing and retrieving bytes according to "filesystem" writes make "files" invalid, new "files" are appended to replace the invalidated ones this system will need: defragmentation mechanism of filesystem compacting dictionary from time to time (removing unused codes) done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.
This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.
No compression scheme will allow fine-grained random access, for two related reasons: you can't know exactly how far into the compressed file your desired piece of data is, therefore there is no way to know where a symbol starts (at what bit position for Huffman, worse for arithmetic coding). I can only suggest treating the file like a broadcast stream and inserting frequent synchronization / position markers, with obvious overhead (the sync marks not only take up space themselves, but complicate the encoding because it has to avoid "accidental" sync marks! ).
Alternatively, and to avoid seeking being something like a binary search (with the optimization that you can take a better guess where to start than the middle), you could include a "table of contents" at the start or end of the file. As for random-access writing... I can't think of any neat solution :(.
Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access. However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream.
You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. No LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding. ) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.As for random-access writing... that's all but impossible.
As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit. However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression.
On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.
The razip format supports random access reads with better performance than gzip/bzip2 which have to be tweaked for this support: sourceforge.net/projects/razip.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.