Edit distance algorithm in Haskell - performance tuning?

Ignoring that this is a bad algorithm (should be memoizing, I get to that second).

Ignoring that this is a bad algorithm (should be memoizing, I get to that second)... Use O(1) Primitives and not O(n) One problem is you use a whole bunch calls that are O(n) for lists (haskell lists are singly linked lists). A better data structure would give you O(1) operations, I used Vector: import qualified Data. Vector as V -- standard levenshtein distance between two lists editDistance :: Eq a => a -> a -> Int editDistance s1 s2 = editDistance' 1 1 1 (V.

FromList s1) (V. FromList s2) -- weighted levenshtein distance -- ins, sub and del are the costs for the various operations editDistance' :: Eq a => Int -> Int -> Int -> V. Vector a -> V.

Vector a -> Int editDistance' del sub ins s1 s2 | V. Null s2 = ins * V. Length s1 | V.

Null s1 = ins * V. Length s2 | V. Last s1 == V.

Last s2 = editDistance' del sub ins (V. Init s1) (V. Init s2) | otherwise = minimum editDistance' del sub ins s1 (V.

Init s2) + del -- deletion , editDistance' del sub ins (V. Init s1) (V. Init s2) + sub -- substitution , editDistance' del sub ins (V.

Init s1) s2 + ins -- insertion The operations that are O(n) for lists include init, length, and last (though init is able to be lazy at least). All these operations are O(1) using Vector. While real benchmarking should use Criterion, a quick and dirty benchmark: str2 = replicate 15 'a' ++ replicate 25 'b' str1 = replicate 20 'a' ++ replicate 20 'b' main = print $ editDistance str1 str2 shows the vector version takes 0.09 seconds while strings take 1.6 seconds, so we saved about an order of magnitude without even looking at your editDistance algorithm.

Now what about memoizing results? The bigger issue is obviously the need for memoization. I took this as an opportunity to learn the monad-memo package - my god is that awesome!

For one extra constraint (you need Ord a), you get a memoization basically for no effort. The code: import qualified Data. Vector as V import Control.Monad.

Memo -- standard levenshtein distance between two lists editDistance :: (Eq a, Ord a) => a -> a -> Int editDistance s1 s2 = startEvalMemo $ editDistance' (1, 1, 1, (V. FromList s1), (V. FromList s2)) -- weighted levenshtein distance -- ins, sub and del are the costs for the various operations editDistance' :: (MonadMemo (Int, Int, Int, V.

Vector a, V. Vector a) Int m, Eq a) => (Int, Int, Int, V. Vector a, V.

Vector a) -> m Int editDistance' (del, sub, ins, s1, s2) | V. Null s2 = return $ ins * V. Length s1 | V.

Null s1 = return $ ins * V. Length s2 | V. Last s1 == V.

Last s2 = memo editDistance' (del, sub, ins, (V. Init s1), (V. Init s2)) | otherwise = do r1 for "calls" and eD for editDistance'): eD s1 s2 --> eD s1 (init s2) -- The parent , eD (init s1) s2 , eD (init s1) (init s2) eD (init s1) s2 --> eD (init s1) (init s2) -- The first "child" , eD (init (init s1)) s2 , eD (init (init s1)) (init s2) eD s1 (init s2) --> eD s1 (init (init s2)) , eD (init s1) (init s2) , eD (init s1) (init (init s2)) Just by considering the parent and two immediate children we can see the call ed (init s1) (init s2) is done three times.

The other child share calls with the parent too and all children share many calls with each other (and their children, cue Monty Python skit). It would be a fun, perhaps instructive, exercise to make a runMemo like function that returns the number of cached results used.

Wow, that's great. I've heard about momoization before, but I'd never guessed that it is that easy! When you say "Ignoring that this is a bad algorithm (should be memoizing, I get to that second)...", do you mean the overall structure of the algorithm or only the fact that it should use memoizing?

To me, the algorithm itself looks quite nice. :) – bzn Apr 1 '11 at 21:14 bzn: I just ment that fact that it wasn't memoizing. If you haven't seen memoization before then see either the Haskell wiki, a CS algorithm book, or both.

Without memoization you compute most values many times but with memoization you compute them only once and lookup previously-computed results otherwise. For example, to compute the first element of your list (editDist s1 (init s2)) the function will eventually compute editDist (init s1) (init s2). That's the 2nd element in the callers list and the third element in the callees list!

– Thomas M. DuBuisson Apr 1 '11 at 22:09 @bzn I added an edit talking about why the issue is "obviously" memoization. – Thomas M.

DuBuisson Apr 1 '11 at 22:39 Thanks, your answer is getting better and better! Coming from C++ land, memoization somehow never crossed my way, but I guess that's due to the completely different ways algorithms are written in imperative languages (just look at the imperative levenshtein implementation). – bzn Apr 2 '11 at 6:32 @bzn It's been fun playing with monad-memo.

Just a note, the algorithm you linked is memoizing too - most memoization is done using a mutable array in exactly that manner. The monad-memo implementation just uses Data. Map instead of an array and hides all the uglyness for you.(Also note altering monad-memo to use unordered-containers gives up to 8x performance increase over what we already did in the answer).

– Thomas M. DuBuisson Apr 2 '11 at 7:48.

You need to memoize editDistance'. There are many ways of doing this, e.g. , a recursively defined array.

– Thomas M. DuBuisson Apr 1 '11 at 16:06 I wish I could claim any responsibility for the unicorn. – augustss Apr 1 '11 at 18:41 @TomMD it's an SO april fools gift.

– sclv Apr 1 '11 at 20:26 Thanks for the answer! I think I'll use the Memo-Monad TomMD proposed. – bzn Apr 1 '11 at 21:17.

As already mentioned, memoization is what you need. In addition you are looking at edit distance from right to left, wich isn't very efficient with strings, and edit distance is the same regardless of direction. That is: editDistance (reverse a) (reverse b) == editDistance a be For solving the memoization part there are very many libraries that can help you.In my example below I chose MemoTrie since it is quite easy to use and performs well here.

Import Data. MemoTrie(memo2) editDistance' del sub ins = memf where memf = memo2 f f s1 = ins * length s1 f s2 = ins * length s2 f (x:xs) (y:ys) | x == y = memf xs ys | otherwise = minimum del + memf xs (y:ys), sub + memf (x:xs) ys, ins + memf xs ys As you can see all you need is to add the memoization. The rest is the same except that we start from the beginning of the list in staid of the end.

Thanks for the hint. – bzn Apr 7 '11 at 19:47 +1 for MemoTrie. This one is awesome!

– Rotsor Jul 16 '11 at 22:25 But why did you use f instead of memf in the first equation for f (x:xs) (y:ys)? – Rotsor Jul 16 '11 at 22:28 @Rotsor, it was a mistake (although it could add no more then a O(min(length xs,length ys) factor to the running time). Changed it so that memf is used in all the recursive cases.

– HaskellElephant Jul 17 '11 at 8:59.

I'd suggest you try simply devising a mapping from token to character. There are 10,646 characters at your disposal, after all.

Thanks, but for now I'm going to use my solution, since tuning it like TomMD proposed, should make it fast enough - and that's all I need after all. :P – bzn Apr 1 '11 at 21:20.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions