Enfyve Enfyve - 11 months ago 39
ASP.NET (C#) Question

Optimum way to validate DataTable for duplicate or invalid fields in a specific column with LINQ

I am trying to find the best way to determine if a


  1. Contains duplicate data in a specific column


  1. If the fields within said column are not found in an external
    Dictionary<string, string>
    and the resulting value matches a string literal.

This is what I've come up with:

List<string> dtSKUsColumn = _dataTable.Select()
.Select(x => x.Field<string("skuColumn"))

bool hasError = dtSKUsColumn.Distinct().Count() != dtSKUsColumn.Count() ||
.All(r => allSkuTypes
.Any(s => s.Value == "normalSKU" &&
s.Key == r.Field<string>("skuColumn")));

is a
Dictionary<string, string>
where the key is the SKU itself, and the value is the SKU type.

I cannot just operate on a 'distinct'
, because there is a column that must contain identical fields (Said column cannot be removed and inferred, since I need to preserve the state of _dataTable).

So my question:

Am I handling this in the best possible way, or is there a simpler and faster method?


The DataTable is not obtained via an SQL query, rather it is generated by a set of rules from an spreadsheet or csv. I have to make do with only the
objects as my only 'outside information.'

Answer Source

Your solution is not optimal.

Let N = _dataTable.Rows.Count and M = allSkuTypes.Count. Your algorithm has O(2 * N) space complexity (the memory allocated by ToList and Disctinct calls) and O(N * M) time complexity (due to linear search in the allSkuTypes for each _dataTable record).

Here is IMO the optimal solution. It uses single pass over the _dataTable records, a HashSet<string> for detecting the duplicates and TryGetValue method of the Dictionary for checking the second rule, thus ending up with O(N) space and time complexity:

var dtSkus = new HashSet<string>();
bool hasError = false;
foreach (var row in _dataTable.AsEnumerable())
    var sku = row.Field<string>("skuColumn");
    string type;
    if (!dtSkus.Add(sku) || !allSkuTypes.TryGetValue(sku, out type) || type != "normalSKU")
        hasError = true;

The additional benefit is that you have the row with the broken rule and the code can easily be modified to take different actions depending of the which rule is broken, collect/count only the first or all invalid records etc.