God of Source God of Source - 24 days ago 10
C# Question

Reason for 11 overloads of string.Concat

I just noticed that there are 11 overloads of the method

string.Concat()


public static string Concat(IEnumerable<string> values);
public static string Concat<T>(IEnumerable<T> values);
public static string Concat(object arg0);
public static string Concat(params object[] args);
public static string Concat(params string[] values);
public static string Concat(object arg0, object arg1);
public static string Concat(string str0, string str1);
public static string Concat(object arg0, object arg1, object arg2);
public static string Concat(string str0, string str1, string str2);
public static string Concat(object arg0, object arg1, object arg2, object arg3);
public static string Concat(string str0, string str1, string str2, string str3);


What is the reason for that? The both

public static string Concat(params object[] args);
public static string Concat<T>(IEnumerable<T> values);


should be the only needed because they are the same convenient/powerful. MSDN doesn't give a answer on that and if you remove the 9 "duplicate" overloads from the framework, noone would notice.

Answer

The primary motivator of this implementation decision is performance.

As you correctly note, there could be only two:

public static string Concat(params object[] args);
public static string Concat<T>(IEnumerable<T> values);

And if C# implemented the "params enumerable" feature -- whereby a variadic method could have an IEnumerable<T> instead of a T[] as the expanded parameter -- then we could get down to only one. Or, we could lose the enumerable overload and just use the object array version.

Suppose we did the latter. You say

string x = Foo();
string y = Bar();
string z = x + y;

and what happens? In a world with only variadic ToString this could only be codegen'd as

string x = Foo();
string y = Bar();
object[] array = new string[2];
array[0] = x;
array[1] = y;
string z = string.Concat(array);

So: let's review. Presumably the calls allocate one string each. Then we allocate a short-lived array, copy references over to it, pass that to the variadic method, etc. That method needs to be written to handle an array of any size, to handle a null array, and so on.

We've not only added new short-lived garbage to the gen zero heap; we've also created two new edges in the liveness analysis graph that might have to be traversed. We may have either decreased the amount of time between collections by adding pressure, or increased the cost of collections by adding edges, or, most likely, collections become both more frequent and more expensive: a double-whammy.

But wait, there's more. We must consider what the implementation of the called Concat method looks like.

The object array is -- surprise -- an array of objects, not an array of strings. So what do we need to do? The callee needs to convert each to a string. By calling ToString on each? No, that might crash. By checking for null first, and then calling ToString.

We've passed in strings, but the callee doesn't know that. ToString is an identity for strings, but the compiler doesn't know that, and the call is virtualized so the jitter can't easily optimize it away either. So we've taken on another unnecessary few nanoseconds of checks and indirections. Not to mention that we needed to check the array for null, get the length of the array, make a loop to loop over each element of the array, and so on.

These expenses are very small but they are per concatenation, and they could add up to real time spent and memory pressure.

A great many programs have their performance gated upon string manipulations and memory pressure. How can we eliminate or mitigate these costs?

We could observe that most string concatenations are two strings, so it makes sense to create an overload specifically to handle that situation:

static string Concat(string, string)

Now we can codegen the fragment above as:

string x = Foo();
string y = Bar();
string z = string.Concat(x, y);

Now there is no array created, so there is no extra garbage created, no collection pressure, no new edges in the reference graph. In the callee, the strings need to be checked for nullity but we don't need to call ToString in the implementation because we have the type system to enforce that the operands are already strings, we don't need to check an array for nullity, we don't need a loop variable being checked against the array length, and so on.

Thus we have a good justification for having two overloads: one that takes a params array, and one that takes exactly two strings.

And now we repeat that analysis for another scenario that is common and could be more performant. Each additional overload is designed to produce a more efficient alternative for a common scenario. As more common scenarios are identified that can be made faster and less resource intensive, there is incentive to produce more overloads, and to fix compilers so that they generate code that takes advantage of these overloads. The net result is about a dozen seemingly redundant overloads, each of which can be tuned for high performance; these cover the cases most commonly seen in real programs.

If this subject interests you, I've written a short series of articles on how I redesigned the string concatenation optimizer back in 2006.

https://ericlippert.com/2013/06/17/string-concatenation-behind-the-scenes-part-one/

Comments