fmrjrhd fmrjrhd - 1 month ago 10
C# Question

Find duplicates in datatable with multiple columns except two

I am new at coding and trying to check a spreadsheet for duplicate rows. The spreadsheet has 50 columns and every column has to be compared except two. If the rows is duplicated, it will combined them to one row and the amounts in columns REQNUM and AUTHNUM will be summed. Most of the samples I found use "Field("a column name")". Because of the large amount of columns, I want to use a variable that excluded the two I don't need in compare.

Example:

Before. The dots represent more columns

COL1|COL2|COL3|...|REQNUM|AUTHNUM

:-----: | :-----: | :----: |...| :----------: | :-----------: |....

x | y | z |...| 1 | 1

x | y | z |...| 2 | 3


After

COL1|COL2|COL3|...|REQNUM|AUTHNUM

------- | ------ | ------ | ...|------------ | ------------|....

x | y | z |...| 3 | 4

This is the code I have and it seems close but not quite right. I was expecting a result of just duplicate rows, so later I can run it through a foreach that will sum and delete extra rows. dtrow gets me the columns I want.(Thanks to Linq Excluding a column). When I try to use this the variable in my query, I get no results and if I remove the "g.Count() > 1" I get all the rows with them missing the two columns. I would like to keep the all the two columns in the results and not have to add them back in later.

var dtRow = dtExcel.Columns.Cast<DataColumn>().Where(c => c.ColumnName != "REQNUM" && c.ColumnName != "AUTHNUM").ToList();

var checkExcel = dtExcel.Rows.Cast<DataRow>()
.GroupBy(x => dtRow.Select(c => x[c]))
.Where(g => g.Count() > 1)
.Select(gr => gr);
//.CopyToDataTable();


Thank to Ken for help. This worked great for what I needed. I used the groupby clause so I can combine the duplicate into one row and add the number fields. also group by create a key that I use in an IF statement.

var dtRow = dtExcel.Columns.Cast<DataColumn>().Where(c => c.ColumnName != "REQNUM" && c.ColumnName != "AUTHNUM").ToList();

var excelDup = dtExcel.Rows.Cast<DataRow>()
.GroupBy(x => String.Join("", dtRow.Select(c => x[c])))
.Select(g =>
{
var row = g.First();
row.SetField("REQNUM", g.Sum(x => x.Field<double>("REQNUM")));
row.SetField("AUTHNUM", g.Sum(x => x.Field<double>("AUTHNUM")));
return row;
})
.CopyToDataTable();


I also used a where clause to create a variable for datarow compare and no key needed.
//Creates variable with all columns except three. It is used in next query
var dtExcelRow = dtExcel.Columns
.Cast().Where(c => c.ColumnName != "TITLE" && c.ColumnName != "REQSTR" && c.ColumnName != "AUTHSTR").ToList();
var dtListRow = dtList.Columns
.Cast().Where(c => c.ColumnName != "TITLE" && c.ColumnName != "REQSTR" && c.ColumnName != "AUTHSTR").ToList();

// Querys create datarow list for compare
IEnumerable<DataRow> eRow = dtExcel.AsEnumerable()
.Where(w => dtExcelRow.Select(c => w[c]).Any())
.Select(x => x);
IEnumerable<DataRow> lRow = dtList.AsEnumerable()
.Where(w => dtListRow.Select(c => w[c]).Any())
.Select(x => x);

// 1st compare gets list of new records that have changes or are new. 2nd is list of old records being change.
var newRecords = eRow.AsEnumerable().Except(lRow.AsEnumerable(), DataRowComparer.Default);
var oldRecords = lRow.AsEnumerable().Except(eRow.AsEnumerable(), DataRowComparer.Default);

Answer

You cannot just group the data by dtRow.Select(c => x[c]) because it is a IEnumerable, they may have the same content but they are still different IEnumerable.

If they are string, you may group the data by the joined string:

x => String.Join("", dtRow.Select(c => x[c]))