Avoiding accidental removal of duplicates when mapping a Set?

You might wish to use the scalaz foldMap for this purpose, as it works on anything for which there is a Foldable typeclass available. The usage in your case will look like this.

Up vote 22 down vote favorite 5 share g+ share fb share tw.

I really like functional programming concepts, but I've been bitten on two separate occasions now by the same gotcha, when mapping across a collection which happens to be a Set (i.e. Automatically removes duplicates). The issue is that after transforming the elements of such a set, the output container is also a set, and so removes any duplicates of the tranformed output.

A very brief REPL session to illustrate the issue: scala> case class Person(name: String, age: Int) defined class Person scala> val students = Set(Person("Alice", 18), Person("Bob", 18), Person("Charles", 19)) students: scala.collection.immutable. SetPerson = Set(Person(Alice,18), Person(Bob,18), Person(Charles,19)) scala> val totalAge = (students map (_. Age)).

Sum totalAge: Int = 37 I would of course expect the total age to be 18 + 18 + 19 = 55, but because the students were stored in a Set, so were their ages after the mapping, hence one of the 18s disappeared before the ages were summed. In real code this is often more insidious and harder to spot, especially if you write utility code which simply takes a Traversable and/or use the output of methods which are declared to return a Traversable (the implementation of which happens to be a Set). It seems to me that these situations are almost impossible to spot reliably, until/unless they manifest as a bug.

So, are there any best practices which will reduce my exposure to this issue? Am I wrong to think about map-ping over a general Traversable as conceptually transforming each element in place, as opposed to adding the transformed elements in turn into some new collection? Should I call .

ToStream on everything before mapping, if I want to keep this mental model? Any tips/recommendations would be greatly appreciated. Update: Most of the answers so far have focused on the mechanics of including the duplicates in the sum.

I'm more interested in the practices involved when writing code in the general case - have you drilled yourself to always call toList on every collection before calling map? Do you fastidiously check the concrete classes of all the collections in your app before calling methods on them? Etc.

Fixing up something that's already been identified as a problem is trivial - the hard part is preventing these errors from creeping in in the first place. Scala collections functional-programming link|improve this question edited Apr 3 at 15:20 asked Apr 3 at 11:27Andrzej Doyle30.8k34686 93% accept rate.

1, nice example. Leaving as a Set will obviously bring surprising results, so you "must" do something, creating a new collection, or other means to ensure that what-you-see-is-what-you-will-get – virtualeyes Apr 3 at 11:36 2 Exact duplicate of: stackoverflow.com/questions/7040806/…. Though the title is chosen better here.

– ziggystar Apr 3 at 12:18 Thanks for the pointer, @ziggy. I did try to search for existing questions first, but didn't come up with anything (it's difficult to arrive at a definitive description). – Andrzej Doyle Apr 3 at 14:45.

You might wish to use the scalaz foldMap for this purpose, as it works on anything for which there is a Foldable typeclass available. The usage in your case will look like this: persons foldMap (_. Age) The signature of foldMap is as follows: trait MAM_, A { val value: MA def foldMapB(f: A => B)(implicit f: FoldableM, m: MonoidB) } So; as long as you have some collection CCA where CC can be folded over (i.e.

Traversed), a function from A => B where B is a monoid, you can accumulate a result.

I like this - Dave Griffith's answer was on the right tracks but is still quite specific. This covers every situation where a collection is mapped and then combined into a single result (via Monoid), which in practice is where I've seen this problem. – Andrzej Doyle Apr 3 at 15:19 (0 /: students) { case (sum, s) => sum + s.

Age } – Viktor Klang Apr 3 at 15:28 1 (0 /: students)(_ + _. Age) is even shorter – oxbow_lakes Apr 3 at 15:55 1 I think the issue with the standard fold is that it's less readable when used in place (e.g. As an argument to a method). It also requires a break in the flow of thinking (for me).

That is, you think I want to perform an operations on my collection, so you want cc.op. I suppose cc. Foldl(0)(_ + _.

Age) would do that – oxbow_lakes Apr 3 at 15:57 2 Yup, you _ +_. Age is shorter, but I think there's a balance to be had between terseness and readability. – Viktor Klang Apr 3 at 16:28.

As not to drag extra dependencies into the picture: (0 /: students) { case (sum, s) => sum + s. Age }.

– Andrzej Doyle Apr 3 at 15:45 Your example doesn't need map at all, using map also introduces the creating of an intermediate data structure, which in your example is unnecessary. So in this example you solve the problem by more carefully and efficiently expressing your intent. – Viktor Klang Apr 3 at 15:52 That's a good point; most if not all of these problems arise when I immediately transform the mapped collection into something else (e.g. The .

Sum here). In which case the intermediate collection doesn't need to exist and so you might be right that avoiding . Map altogether is a practical solution.

– Andrzej Doyle Apr 3 at 16:03.

You can breakOut the collection type scala> import collection. BreakOut import collection. BreakOut scala> val ages = students.

Map(_. Age)(breakOut): ListInt ages: ListInt = List(18, 18, 19) Then you can sum as expected Based on the update to the question, the best practice to prevent these types of errors is good unit test coverage with representative data, together with sensible API's coupled with knowledge of how the scala compiler maintains source types through map/for generators etc. If you're returning a Set of something you should make that obvious, as returning a Collection/Traversable is hiding a relevant implementation detail.

2 You can just say toList - I think the point is that the OP keeps forgetting about it – oxbow_lakes Apr 3 at 15:00.

You might like to use the toIterable or toList methods to convert the set to another datastructure first. scala-lang.org/api/current/scala/collect... (Note that toIterable may return any Iterable, although the reference implementation will not, according to the linked documentation. @Debilski informs me in the comments that it nevertheless returns a Set.

).

ToIterable may still yield a Set. – Debilski Apr 3 at 12:22 @Debilski The linked documentation suggests that the reference implementation will not do that. I will also slightly update the answer.

– Marcin Apr 3 at 12:25 1 (students. ToIterable map (_. Age)) == Set(18, 19) on Scala 2.9.1-1.

Also, I don’t see where it says otherwise in the documentation. – Debilski Apr 3 at 12:39 Why creating an intermediate collection? Set.

Iterator is enough when used with sum. – Antoras Apr 3 at 13:23 @Antoras the question asks about mapping. – Marcin Apr 3 at 13:46.

If you find yourself repeatedly hitting the same error, your first problem isn't the error, but rather that you're repeating yourself. Map(). Sum is a common enough use case (particularly in data analysis contexts) to merit it's own method on Traversable.

From my personal, never-go-anywhere-without-it Traversable pimp class. Implicit def traversable2RichTraversableA(t: TraversableA) = new { ///many many methods deleted def sumOfC: Numeric(g: A => C): C = t.view.toList. Map(g).

Sum ///many many more methods deleted } (That . View may not be necessary, but can't hurt. ).

2 sumBy seems like a name that's more in line with the std lib. – ziggystar Apr 3 at 13:10 1 +1, I would remove the . View unless there is actually an example where it helps.

– huynhjl Apr 3 at 13:46 2 could you share your personal, never-go-anywhere-without-it Traversable pimp class? – Adam Rabung Apr 3 at 14:36 I use "sumBy" for my pimped traversable method that's equivalent (but more efficient than) traversable.toList. GroupBy(f).

MapValues(g). Sum . This splits a traversable into buckets based on some function f, then sums some function g over the values in each bucket.

Again, very common for data analysis (more so than sumOf, actually). – Dave Griffith Apr 3 at 15:27 It will take some work to get that class into shape for public viewing (and get work permission to make it public), but I'll see what I can do. As an exercise, based on what you've seen so far, ponder what goodies it might contain (and then write a few of them yourself).

– Dave Griffith Apr 3 at 15:31.

A clumsy but possibly faster way of transforming it (as compared to an explicit toList/toSeq) would be to use collection. BreakOut (more information) with a type ascription (students map (_. Age))(collection.

BreakOut) : SeqInt.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions