I figured that Python would store the first part only once if it would occur in combination with different positions Not necessarily: ("S", "NP", "VP") is ("S", "NP", "VP") False You might want to intern all strings referring to non-terminals, since you seem to be creating a lot of these in rcgrules. Py If you want to intern a tuple, then turn it into a string first: intern("S NP VP") is intern(' '. Join('S', 'NP', 'VP')) True Otherwise, you'll have to "copy" the tuples instead of constructing them afresh (If you're new to C++, then rewriting such an algorithm in it is unlikely to provide much of a memory benefit.
You'd have to evaluate various hash table implementations first and learn about the copying behavior in its containers. I've found boost::unordered_map to be quite wasteful with lots of small hashtables. ).
I figured that Python would store the first part only once if it would occur in combination with different positions Not necessarily: >>> ("S", "NP", "VP") is ("S", "NP", "VP") False You might want to intern all strings referring to non-terminals, since you seem to be creating a lot of these in rcgrules.py. If you want to intern a tuple, then turn it into a string first: >>> intern("S NP VP") is intern(' '. Join('S', 'NP', 'VP')) True Otherwise, you'll have to "copy" the tuples instead of constructing them afresh.(If you're new to C++, then rewriting such an algorithm in it is unlikely to provide much of a memory benefit.
You'd have to evaluate various hash table implementations first and learn about the copying behavior in its containers. I've found boost::unordered_map to be quite wasteful with lots of small hashtables. ).
That has the unfortunate drawback of requiring splits in lots of places. I wish there were an equivalent of intern for tuples. I already have interned nonterminals in my current code.
The memory usage of rcgules. Py alone was 67M, 78M and 100M for the different values of n respectively, so it is not the culprit. – Andreas Mar 21 at 16:42 @Andreas: since this is a variant of CKY, do you need an explicit agenda?(I couldn't run your code btw, because dopg.Py was missing.
The version on your website doesn't work with disco-dop. ) – larsmans Mar 21 at 16:51 dopg.Py is in the repository eodop, also on my github. In other parsers you can walk through the sentence from left to right, but with this formalism supporting discontinuities and word-order variations that is not possible, so the agenda is needed I think.
Either way, I straighforwardedly implemented this parser from a publication, although they also discussed heuristics which do help. However, I'm convinced I'm having a problem with overhead due to something in Python, I just don't know what. I made a function for interning tuples by storing them in a global dict, but it didn't help.
– Andreas Mar 21 at 17:34.
PyPy is a lot smarter than CPython about noticing commonalities and avoiding the memory overhead associated with duplicating things unnecessarily. It's worth trying, anyway: pypy.org.
Yeah I did and if I remember correctly it consumed about twice as much memory ... Either way, I discovered some new optimizations and memory is no longer a concern, but speed is. I'm trying some more things with Cython. – Andreas Mar 27 at 19:09 pypy is proven faster than cpython for tasks like this, if you have the memory then its about time to look into memorisation.
– Jakob Bowyer Mar 29 at 14:57.
The first to do in these cases is always to profile: 15147/297 0.032 0.000 0.041 0.000 tree. Py:102(__eq__) 15400/200 0.031 0.000 0.106 0.001 tree. Py:399(convert) 1 0.023 0.023 0.129 0.129 plcfrs_cython.
Pyx:52(parse) 6701/1143 0.022 0.000 0.043 0.000 heapdict. Py:45(_min_heapify) 18212 0.017 0.000 0.023 0.000 plcfrs_cython. Pyx:38(__richcmp__) 10975/10875 0.017 0.000 0.035 0.000 tree.Py:75(__init__) 5772 0.016 0.000 0.050 0.000 tree.
Py:665(__init__) 960 0.016 0.000 0.025 0.000 plcfrs_cython. Pyx:118(deduced_from) 46938 0.014 0.000 0.014 0.000 tree.Py:708(_get_node) 25220/2190 0.014 0.000 0.016 0.000 tree. Py:231(subtrees) 10975 0.013 0.000 0.023 0.000 tree.
Py:60(__new__) 49441 0.013 0.000 0.013 0.000 {isinstance} 16748 0.008 0.000 0.015 0.000 {hasattr} The First thing I noticed is that very few functions are from the cython module itself. Most of them come from the tree. Py module and maybe is that the bottleneck.
Focusing on the cython side I see the richcmp function: we can optimize it simply by adding the type of the values in the method declaration def __richcmp__(ChartItem self, ChartItem other, int op): .... This brings down the value ncalls tottime percall cumtime percall filename:lineno(function) .... 18212 0.011 0.000 0.015 0.000 plcfrs_cython. Pyx:38(__richcmp__) Adding the elif syntax instead of the single if will enable the switch optimization of cython if op == 0: return self. Label other.
Label or self. Vec > other. Vec elif op == 5: return self.
Label >= other. Label or self. Vec >= other.
Vec obtaining: 17963 0.002 0.000 0.002 0.000 plcfrs_cython. Pyx:38(__richcmp__) trying to figure out where that tree. Py:399 convert comes from I found out that this function inside dopg.
Py takes all that time def removeids(tree): """ remove unique IDs introduced by the Goodman reduction """ result = Tree. Convert(tree) for a in result. Subtrees(lambda t: '@' in t.
Node): a. Node = a.node. Rsplit('@', 1)0 if isinstance(tree, ImmutableTree): return result.freeze() return result Now I am not sure if each node in the tree is a ChartItem and if the getitem value is being used somewhere else but adding this changes : cdef class ChartItem: cdef public str label cdef public str root cdef public long vec cdef int _hash __slots__ = ("label", "vec", "_hash") def __init__(ChartItem self, label, int vec): self.
Label = intern(label) #. Rsplit('@', 1)0) self. Root = intern(label.
Rsplit('@', 1)0) self. Vec = vec self. _hash = hash((self.
Label, self. Vec)) def __hash__(self): return self. _hash def __richcmp__(ChartItem self, ChartItem other, int op): if op == 0: return self.
Label other. Label or self. Vec > other.
Vec elif op == 5: return self. Label >= other. Label or self.
Vec >= other. Vec def __getitem__(ChartItem self, int n): if n == 0: return self. Root elif n == 1: return self.
Vec def __repr__(self): #would need bitlen for proper padding return "%s%s" % (self. Label, bin(self. Vec)2:::-1) and inside of mostprobableparse: from libc cimport pow def mostprobableparse... ... cdef dict parsetrees = defaultdict(float) cdef float prob m = 0 for n,(a,prob) in enumerate(derivations): parsetreesa += pow(e,prob) m += 1 I get: 189345 function calls (173785 primitive calls) in 0.162 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 6701/1143 0.025 0.000 0.037 0.000 heapdict.
Py:45(_min_heapify) 1 0.023 0.023 0.120 0.120 plcfrs_cython. Pyx:54(parse) 960 0.018 0.000 0.030 0.000 plcfrs_cython. Pyx:122(deduced_from) 5190/198 0.011 0.000 0.015 0.000 tree.
Py:102(__eq__) 6619 0.006 0.000 0.006 0.000 heapdict.Py:67(_swap) 9678 0.006 0.000 0.008 0.000 plcfrs_cython. Pyx:137(concat) so the next steps are to optimize heapify and deduced_from deduce_from can be optimized a bit more: cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen): cdef str I = Ih. Label cdef int Ir = Ih.
Vec cdef list result = cdef dict Cx = pyCx cdef dict unary = pyunary cdef dict lbinary = pylbinary cdef dict rbinary = pyrbinary cdef ChartItem Ilh cdef double z cdef double y cdef ChartItem I1h for rule, z in unaryI: result. Append((ChartItem(rule00, Ir), ((x+z,z), (Ih,)))) for rule, z in lbinaryI: for I1h, y in Cxrule02.items(): if concat(rule1, Ir, I1h. Vec, bitlen): result.
Append((ChartItem(rule00, Ir ^ I1h. Vec), ((x+y+z, z), (Ih, I1h)))) for rule, z in rbinaryI: for I1h, y in Cxrule01.items(): if concat(rule1, I1h. Vec, Ir, bitlen): result.
Append((ChartItem(rule00, I1h. Vec ^ Ir), ((x+y+z, z), (I1h, Ih)))) return result I will stop here although I am confident that we can keep optimizing as more insight is acquired on the problem. A series of unittest would be useful to assert that each optimization don't introduce any subtle error.
A side note, try to use spaces instead of tabs.
I had wanted to give you the bounty, but now it has already been assigned automatically :( Anyway, your suggestions improved the runtime from 7:59 to 7:44 with a grammar of 3600 sentences; a little less than I had expected. I plan to do the bit operations using inline assembly, and I should replace all of the tuples with cdef classes. Thanks for the suggestions.
– Andreas Mar 30 at 21:45.
Stackoverflow.com/questions/.../memory-usage-of-a-probabilistic-parser‎CachedMar 21, 2011 – I am writing a CKY parser for a Range Concatenation Grammar. Www.ling.ohio-state.edu/~schuler/paper-tag+10.pdf‎Cachedby W Schuler - Cited by 1 - Related articlesmemory usage during parsing. Https://github.com/wavii/pfp‎Cachedpfp - Pretty fast parser for probabilistic context free grammars.
... Parse times and maximum memory usage were recorded on a c1. Research.microsoft.com/pubs/102534/parsingecml.pdf‎Cachedby K Toutanova - Cited by 8 - Related articlesprobabilistic systems — history-based generative parsing models. Www.verious.com/qa/memory-usage-of-a-probabilistic-parser/‎CachedMar 21, 2011 – I am writing a CKY parser for a Range Concatenation Grammar.
I want to use a treebank as grammar, so the grammar will be large. Www.inf.ed.ac.uk/teaching/courses/mi/practicals/prob_grammars.pdf‎Cachedby R Bod - Cited by 9 - Related articlesAbstract. Linglit194.linglit.tu-darmstadt.de/.../StanfordParser/StanfordParser.pdf‎CachedUsed to parse input data written in several languages such as ... In most cases, the probabilistic context-free grammar (PCFG) parser will be ... memory usage.
Stackoverflow.com/questions/.../memory-usage-of-a-probabilistic-parser‎CachedMar0 W Schuler - Cited by 24 - Related articlesThis transform reduces memory usage in incremental (left to right) ... then combined with probabilistic information about parsing strategies to yield a set of ...
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.