christangrant christangrant - 2 months ago 8
Scala Question

Example of the Scala aggregate function

I have been looking and I cannot find an example or discussion of the

function in Scala that I can understand. It seems pretty powerful.

Can this function be used to reduce the values of tuples to make a multimap-type collection? For example:

val list = Seq(("one", "i"), ("two", "2"), ("two", "ii"), ("one", "1"), ("four", "iv"))

After applying aggregate:

Seq(("one" -> Seq("i","1")), ("two" -> Seq("2", "ii")), ("four" -> Seq("iv"))

Also, can you give example of parameters
, and
? I'm unclear on what these parameters do.


The aggregate function does not do that (except that it is a very general function, and it could be used to do that). You want groupBy. Close to at least. As you start with a Seq[(String, String)], and you group by taking the first item in the tuple (which is (String, String) => String), it would return a Map[String, Seq[(String, String)]). You then have to discard the first parameter in the Seq[String, String)] values.



There you get a Map[String, Seq[(String, String)]. If you want a Seq instead of Map, call toSeq on the result. I don't think you have a guarantee on the order in the resulting Seq though

Aggregate is a more difficult function.

Consider first reduceLeft and reduceRight. Let as be a non empty sequence as = Seq(a1, ... an) of elements of type A, and f: (A,A) => A be some way to combine two elements of type A into one. I will note it as a binary operator @, a1 @ a2 rather than f(a1, a2). as.reduceLeft(@) will compute (((a1 @ a2) @ a3)... @ an). reduceRight will put the parentheses the other way, (a1 @ (a2 @... @ an)))). If @ happens to be associative, one does not care about the parentheses. One could compute it as (a1 @... @ ap) @ (ap+1 @...@an) (there would be parantheses inside the 2 big parantheses too, but let's not care about that). Then one could do the two parts in parallel, while the nested bracketing in reduceLeft or reduceRight force a fully sequential computation. But parallel computation is only possible when @ is known to be associative, and the reduceLeft method cannot know that.

Still, there could be method reduce, whose caller would be responsible for ensuring that the operation is associative. Then reduce would order the calls as it sees fit, possibly doing them in parallel. Indeed, there is such a method.

There is a limitation with the various reduce methods however. The elements of the Seq can only be combined to a result of the same type: @ has to be (A,A) => A. But one could have the more general problem of combining them into a B. One starts with a value of type b of type B, and combine it with every elements of the sequence. The operator @ is (B,A) => B, and one computes (((b @ a1) @ a2) ... @ an). foldLeft does that. foldRight does the same thing but starting with an. There, the @ operation has no chance to be associative. When one writes b @ a1 @ a2, it must mean (b @ a1) @ a2, as (a1 @ a2) would be ill-typed. So foldLeft and foldRight have to be sequential.

Suppose however, that each A can be turned into a B, let's write it with !, a! is of type B. Suppose moreover that there is a + operation (B,B) => B, and that @ is such that b @ a is in fact b + a!. Rather than combining elements with @, one could first transform all of them to B with !, then combine them with +. That would be!).reduceLeft(+). And if + is associative, then that can be done with reduce, and not be sequential:!).reduce(+). There could be an hypothetical method as.associativeFold(b, !, +).

Aggregate is very close to that. It may be however, that there is a more efficient way to implement b@a than b+a! For instance, if type B is List[A], and b@a is a::b, then a! will be a::Nil, and b1 + b2 will be b2 ::: b1. a::b is way better than (a::Nil):::b. To benefit from associativity, but still use @, one first splits b + a1! + ... + an!, into (b + a1! + ap!) + (ap+1! + ..+ an!), then go back to using @ with (b @ a1 @ an) + (ap+1! @ @ an). One still needs the ! on ap+1, because one must start with some b. And the + is still necessary too, appearing between the parantheses. To do that, as.associativeFold(!, +) could be changed to as.optimizedAssociativeFold(b, !, @, +).

Back to +. + is associative, or equivalently, (B, +) is a semigroup. In practice, most of the semigroups used in programming happen to be monoids too, i.e they contain a neutral element z (for zero) in B, so that for each b, z + b = b + z = b. In that case, the ! operation that make sense is likely to be be a! = z @ a. Moreover, as z is a neutral element b @ a1 ..@ an = (b + z) @ a1 @ an which is b + (z + a1 @ an). So is is always possible to start the aggregation with z. If b is wanted instead, you do b + result at the end. With all those hypotheses, we can do as.aggregate(z, @, +). That is what aggregate does. @ is the seqop argument (applied in a sequence z @ a1 @ a2 @ ap), and + is combop (applied to already partially combined results, as in (z + a1@...@ap) + (z + ap+1@...@an)).

To sum it up, as.aggregate(z)(seqop, combop) computes the same thing as as.foldLeft(z)( seqop) provided that

  • (B, combop, z) is a monoid
  • seqop(b,a) = combop(b, seqop(z,a))

aggregate implementation may use the associativity of combop to group the computations as it likes (not swapping elements however, + has not to be commutative, ::: is not). It may run them in parallel.

Finally, solving the initial problem using aggregate is left as an exercise to the reader. A hint: implement using foldLeft, then find z and combo that will satisfy the conditions stated above.