Dylan Cristy Dylan Cristy - 8 days ago 6
C# Question

Faster way to get distinct values in LINQ?

I have a web part in SharePoint, and I am trying to populate a drop-down control with the unique/distinct values from a particular field in a list.

Unfortunately, due to the nature of the system, it is a text field, so there is no other definitive source to get the data values (i.e., if it were a choice field, I could get the field definition and just get the values from there), and I am using the chosen value of the drop-down in a subsequent CAML query, so the values must be accurate to what is present on the list items. Currently the list has arpprox. 4K items, but it is (and will continue) growing slowly.

And, it's part of a sandbox solution, so it is restricted by the user code service time limit - and it's timing out more often than not. In my dev environment I stepped through the code in debug, and it seems like the line of LINQ where I actually get the distinct values is the most time consuming, and I then commented out the call to this method entirely, and the timeouts stop, so I am fairly certain this is where the problem is.

Here's my code:

private void AddUniqueValues(SPList list, SPField filterField, DropDownList dropDownControl)
{
SPQuery query = new SPQuery();
query.ViewFields = string.Format("<FieldRef Name='{0}' />", filterField.InternalName);
query.ViewFieldsOnly = true;

SPListItemCollection results = list.GetItems(query); // retrieves ~4K items

List<string> uniqueValues = results.Cast<SPListItem>().Select(item => item[filterField.Id].ToString()).Distinct().ToList(); // this takes too long with 4K items

uniqueValues.Sort();

dropDownControl.Items.AddRange(uniqueValues.Select(itm => new ListItem(itm)).ToArray());
}


As far as I am aware, there's no way to get "distinct" values directly in a CAML query, so how can I do this more quickly? Is there a way to restructure the LINQ to run faster?

Is there an easy/fast way to do this from the client side? (REST would be preferred, but I'd do JSOM if necessary).




Thought I'd add some extra information here since I did some further testing and found some interesting results.

First, to address the questions of whether the
Cast()
and
Select()
are needed: yes, they are.

SPListItemCollection
is
IEnumerable
but not
IEnumerable<T>
, so we need to cast just to be able to get to use LINQ at all.

Then after it's cast to
IEnumerable<SPListItem>
,
SPListItem
is a fairly complex object, and I am looking to find distinct values from just one property of that object. Using
Distinct()
directly on the
IEnumerable<SPListItem>
yields.. all of them. So I have to
Select()
just the single values I want to compare.

So yes, the
Cast()
and
Select()
are absolutely necessary.

As noted in the comments by M.kazem Akhgary, in my original line of code, calling
ToString()
every time (for 4K items) did add some time. But in testing some other variations:

// original
List<string> uniqueValues = results.Cast<SPListItem>().Select(item => item[filterField.Id].ToString()).Distinct().ToList();

// hash set alternative
HashSet<object> items = new HashSet<object>(results.Cast<SPListItem>().Select(itm => itm[filterField.Id]));

// don't call ToString(), just deal with base objects
List<object> obs = results.Cast<SPListItem>().Select(itm => itm[filterField.Id]).Distinct().ToList();

// alternate LINQ syntax from Pieter_Daems answer, seems to remove the Cast()
var things = (from SPListItem item in results select item[filterField.Id]).Distinct().ToList();


I found that all of those methods took multiple tens of seconds to complete. Strangely, the
DataTable
/
DataView
method from Pieter_Daems answer, to which I added a bit to extract the values I wanted:

DataTable dt = results2.GetDataTable();
DataView vw = new DataView(dt);
DataTable udt = vw.ToTable(true, filterField.InternalName);
List<string> rowValues = new List<string>();
foreach (DataRow row in udt.Rows)
{
rowValues.Add(row[filterField.InternalName].ToString());
}
rowValues.Sort();


took only 1-2 seconds!

In the end, I am going with Thriggle's answer, because it deals nicely with SharePoint's 5000 item list view threshold, which I will probably be dealing with some day, and it is only marginally slower (2-3 seconds) than the
DataTable
method. Still much, much faster than all the LINQ.

Interesting to note, though, that the fastest way to get distinct values from a particular field from a
SPListItemCollection
seems to be the
DataTable
/
DataView
conversion method.

Answer

You're potentially introducing a significant delay by retrieving all items first before checking for distinctness.

An alternative approach would be to perform multiple CAML queries against SharePoint; this would result in one query per unique value (plus one final query that returns no results).

  1. Make sure your list has column indexing applied to the field whose values you want to enumerate.
  2. In your initial CAML query, sort by the field you want to enumerate and impose a row limit of one item.
  3. Get the value of the field from the item returned by that query and add it to your collection of unique values.
  4. Query the list again, sorting by the field and imposing a row limit of 1, but this time add a filter condition such that it only retrieves items where the field value is greater than the field value you just detected.
  5. Add the value of the field in the returned item to your collection of unique values.
  6. Repeat steps 4 and 5 until the query returns an empty result set, at which point your collection of unique values should contain all current values of the field (assuming more haven't been added since you started).

Will this be any faster? That depends on your data, and how frequently duplicate values occur.

If you have 4000 items and only 5 unique values, you'll be able to gather those 5 values in only 6 lightweight CAML queries, returning a total of 5 items. This makes a lot more sense than querying for all 4000 items and enumerating through them one at a time to look for unique values.

On the other hand, if you have 4000 items and 3000 unique values, you're looking at querying the list 3001 times. This might well be slower than retrieving all the items in a single query and using post-processing to find the unique values.