Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python You say that "the word can be 1-5 words long".
What is the average length of the key field in BYTES? Are the ids all integer? If so what are the min and max integer?
If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file? Looking at your code, a 1-line file word1,1 will create a dict d'1' = 'word1 isn't that bassackwards?
Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields? Update 4 You asked how to most efficiently store key/value pairs in memory with python and nobody's answered that yet with any accuracy You have a 168 Mb file with 6.5 million lines.
That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line. Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word".
Does that match your expectation? So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table Let's assume a 32-bit CPython 2.6 platform K = sys. Getsizeof('123456789012345678') >>> V = sys.
Getsizeof('1234567') >>> K, V (42, 31) Note that sys. Getsizeof(str_object) => 24 + len(str_object) Tuples were mentioned by one answerer. Note carefully the following: sys.
Getsizeof(()) 28 >>> sys. Getsizeof((1,)) 32 >>> sys. Getsizeof((1,2)) 36 >>> sys.
Getsizeof((1,2,3)) 40 >>> sys. Getsizeof(("foo", "bar")) 36 >>> sys. Getsizeof(("fooooooooooooooooooooooo", "bar")) 36 Conclusion: sys.
Getsizeof(tuple_object) => 28 + 4 * len(tuple_object) it only allows for a pointer to each item, it doesn't allow for the sizes of the items A similar analysis of lists shows that sys. Getsizeof(list_object) => 36 + 4 * len(list_object) again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call.
For sufficiently large size (like 6.5 million! ) the overallocation is 12.5 percent -- see the source (Objects/listobject. C).
This overallocation is not done with tuples (their size doesn't change) Here are the costs of various alternatives to dict for a memory-based look-up table: List of tuples: Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents. So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that Total for list of tuples: 36 + N * (40.5 + K + v) That's 26 + 113.5 * N ( about 709 MB when is 6.5 million) Two parallel lists: (36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N) i.e. 72 + N * (9 + K + V) Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million Value stored as int not str: But that's not all.
If the IDs are actually integers, we can store them as such sys. Getsizeof(1234567) 12.
Lots of ideas. However, if you want practical help, edit your question to show ALL of your code. Also tell us what is the "it" that shows memory used, what it shows when you load a file with zero entries, and what platform you are on, and what version of Python.
You say that "the word can be 1-5 words long". What is the average length of the key field in BYTES? Are the ids all integer?
If so what are the min and max integer? If not, what is the average length if ID in bytes? To enable cross-achecking of all of above, how many bytes are there in your 6.5M-line file?
Looking at your code, a 1-line file word1,1 will create a dict d'1' = 'word1' ... isn't that bassackwards? Update 3: More questions: How is the "word" encoded? Are you sure you are not carrying a load of trailing spaces on any of the two fields?
Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy. You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line.
Knock off 1 byte for the comma and 1 byte for the newline (assuming it's a *x platform) and we're left with 25 bytes per line. Assuming the "id" is intended to be unique, and as it appears to be an integer, let's assume the "id" is 7 bytes long; that leaves us with an average size of 18 bytes for the "word". Does that match your expectation?
So, we want to store an 18-byte key and a 7-byte value in an in-memory look-up table. Let's assume a 32-bit CPython 2.6 platform. >>> K = sys.
Getsizeof('123456789012345678') >>> V = sys. Getsizeof('1234567') >>> K, V (42, 31) Note that sys. Getsizeof(str_object) => 24 + len(str_object) Tuples were mentioned by one answerer.
Note carefully the following: >>> sys. Getsizeof(()) 28 >>> sys. Getsizeof((1,)) 32 >>> sys.
Getsizeof((1,2)) 36 >>> sys. Getsizeof((1,2,3)) 40 >>> sys. Getsizeof(("foo", "bar")) 36 >>> sys.
Getsizeof(("fooooooooooooooooooooooo", "bar")) 36 >>> Conclusion: sys. Getsizeof(tuple_object) => 28 + 4 * len(tuple_object) ... it only allows for a pointer to each item, it doesn't allow for the sizes of the items. A similar analysis of lists shows that sys.
Getsizeof(list_object) => 36 + 4 * len(list_object) ... again it is necessary to add the sizes of the items. There is a further consideration: CPython overallocates lists so that it doesn't have to call the system realloc() on every list.append() call. For sufficiently large size (like 6.5 million!) the overallocation is 12.5 percent -- see the source (Objects/listobject.
C). This overallocation is not done with tuples (their size doesn't change). Here are the costs of various alternatives to dict for a memory-based look-up table: List of tuples: Each tuple will take 36 bytes for the 2-tuple itself, plus K and V for the contents.
So N of them will take N * (36 + K + V); then you need a list to hold them, so we need 36 + 1.125 * 4 * N for that. Total for list of tuples: 36 + N * (40.5 + K + v) That's 26 + 113.5 * N (about 709 MB when is 6.5 million) Two parallel lists: (36 + 1.125 * 4 * N + K * N) + (36 + 1.125 * 4 * N + V * N) i.e.72 + N * (9 + K + V) Note that the difference between 40.5 * N and 9 * N is about 200MB when N is 6.5 million. Value stored as int not str: But that's not all.
If the IDs are actually integers, we can store them as such. >>> sys. Getsizeof(1234567) 12 That's 12 bytes instead of 31 bytes for each value object.
That difference of 19 * N is a further saving of about 118MB when N is 6.5 million. Use array. Array('l') instead of list for the (integer) value: We can store those 7-digit integers in an array.
Array('l'). No int objects, and no pointers to them -- just a 4-byte signed integer value. Bonus: arrays are overallocated by only 6.25% (for large N).
So that's 1.0625 * 4 * N instead of the previous (1.125 * 4 + 12) * N, a further saving of 12.25 * N i.e.76 MB. So we're down to 709 - 200 - 118 - 76 = about 315 MB.N.B.Errors and omissions excepted -- it's 0127 in my TZ :-(.
Ok I made the edit. Hope that helps show something I'm doing dumb :) – James Feb 6 '10 at 4:21 doh, yea I wrote it wrong, I edited it to show how the data is stored on disk 1,word – James Feb 6 '10 at 4:38 yes it's all stripped before it gets saved using wordstr.strip() – James Feb 6 '10 at 4:46 @beagleguy: Please post the code that you are ACTUALLY RUNNING, exactly i.e. Prefer copy/paste.
And answer the other questions. – John Machin Feb 6 '10 at 4:47 that code is the code that's actually running to load the term file – James Feb 6 '10 at 4:47.
Take a look (Python 2.6, 32-bit version)...: >>> sys. Getsizeof('word,1') 30 >>> sys. Getsizeof(('word', '1')) 36 >>> sys.
Getsizeof(dict(word='1')) 140 The string (taking 6 bytes on disk, clearly) gets an overhead of 24 bytes (no matter how long it is, add 24 to its length to find how much memory it takes). When you split it into a tuple, that's a little bit more. But the dict is what really blows things up: even an empty dict takes 140 bytes -- pure overhead of maintaining a blazingly-fast hash-based lookup take.To be fast, a hash table must have low density -- and Python ensures a dict is always low density (by taking up a lot of extra memory for it).
The most memory-efficient way to store key / value pairs is as a list of tuples, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict). Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be! -).
– John Machin Feb 6 '10 at 11:55 protocol=-1 is a good suggestion, but I don't know what "started dying" means -- probably just the trashing behavior causing slow operation once you exhaust physical memory. So the next thing to try is a real database, whether relational or not -- bsddb is one (non-relational), and implicitly used by anydbm (and, with one indirection, shelve), but the version that comes with Python is not the best or fastest. I'd try the sqlite that does come with Python, first, to avoid installing any further pieces; if still not enough, third-party DBs are next.
– Alex Martelli Feb 6 '10 at 16:44.
Convert your data into a dbm (import anydbm, or use berkerley db by import bsddb ...), and then use dbm API to access it. The reason to explode is that python has extra meta information for any objects, and the dict needs to construct a hash table (which would require more memory). You just created so many objects (6.5M) so the metadata becomes too huge.
Import bsddb a = bsddb. Btopen('a. Bdb') # you can also try bsddb.
Hashopen for x in xrange(10500) : a'word%d' %x = '%d' %x a.close() This code takes only 1 second to run, so I think the speed is OK (since you said 10500 lines per second). Btopen creates a db file with 499,712 bytes in length, and hashopen creates 319,488 bytes. With xrange input as 6.5M and using btopen, I got 417,080KB in ouput file size and around 1 or 2 minute to complete insertion.So I think it's totally suitable for you.
The reason I moved to in memory from the database is I had it in a mysql table with the word as the primary key (there can only be one unique word or phrase). I'm doing around 20 million queries to process one hour of data. Moving it to memory saved around 10 minutes off the processing time.Do you have an idea that you were thinking of to help with that situation?
– James Feb 6 '10 at 4:11 1 You don't need SQL database (they are powerful but slower). The DBM are hash-value style databases so they should be fast (very similiar to dict objects) and need smaller memory footprint than SQL. Berkerley db would be my best choice (import bsddb).
– Francis Feb 6 '10 at 4:17 1 @beagleguy: dbm-style databases are key-value oriented, not SQL. Modiying your program to work with one of them, now that you've got ir working on a dictionary, won't take much time (vs. going from MySQL to dict), and then you can compare with MySQL.It will always be better than having 1.3G of RAM taken over by your data :) – Heim Feb 6 '10 at 4:18 Also, if you plan to analyze various datasets concurrently from a number of different computers, you may want to have a look to key-value databases like Redis – Heim Feb 6 '10 at 4:20 1 Well, I think you don't need to worry about 3.0 unless you are very determined to use 3.0 "now". Shelve is more like a python native storage, so it should work similiar but there's no promise on performance.
– Francis Feb 6 '10 at 5:04.
Update 4 ... You asked "how to most efficiently store key/value pairs in memory with python" and nobody's answered that yet with any accuracy. You have a 168 Mb file with 6.5 million lines. That's 168 * 1.024 ** 2 / 6.5 = 27.1 bytes per line.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.