H.Fadlallah H.Fadlallah - 10 days ago 8
Vb.net Question

Remove Empty rows and Columns From Excel Files Faster using Interop

I have a lot of excel files that contains data and it contains empty rows and empty columns.
like shown bellow

Excel preview

I am trying to remove Empty rows and columns from excel using interop.
I create a simple winform application and used the following code and it works fine.

Dim lstFiles As New List(Of String)
lstFiles.AddRange(IO.Directory.GetFiles(m_strFolderPath, "*.xls", IO.SearchOption.AllDirectories))

Dim m_XlApp = New Excel.Application
Dim m_xlWrkbs As Excel.Workbooks = m_XlApp.Workbooks
Dim m_xlWrkb As Excel.Workbook

For Each strFile As String In lstFiles
m_xlWrkb = m_xlWrkbs.Open(strFile)
Dim m_XlWrkSheet As Excel.Worksheet = m_xlWrkb.Worksheets(1)
Dim intRow As Integer = 1

While intRow <= m_XlWrkSheet.UsedRange.Rows.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(intRow, 1).EntireRow) = 0 Then
m_XlWrkSheet.Cells(intRow, 1).EntireRow.Delete(Excel.XlDeleteShiftDirection.xlShiftUp)
Else
intRow += 1
End If
End While

Dim intCol As Integer = 1
While intCol <= m_XlWrkSheet.UsedRange.Columns.Count
If m_XlApp.WorksheetFunction.CountA(m_XlWrkSheet.Cells(1, intCol).EntireColumn) = 0 Then
m_XlWrkSheet.Cells(1, intCol).EntireColumn.Delete(Excel.XlDeleteShiftDirection.xlShiftToLeft)
Else
intCol += 1
End If
End While
Next

m_xlWrkb.Save()
m_xlWrkb.Close(SaveChanges:=True)

Marshal.ReleaseComObject(m_xlWrkb)
Marshal.ReleaseComObject(m_xlWrkbs)
m_XlApp.Quit()
Marshal.ReleaseComObject(m_XlApp)


But when cleaning big excel files it takes a lot of time.
Any suggestions for optimizing this code? or another way to clean this excel files faster? Is there a function that can delete empty rows in one click?

I don't have problem if answers are using C#

EDIT:

I uploaded a sample file Sample File. But not all files have same structure.

Answer

H.Fadlallah… you did say…. I don't have problem if answers are using C#. Well this one is… I hope it helps.

I found that looping through the excel worksheet can take some time if the worksheet is large. So my solution tried to avoid any looping in the worksheet. A more detailed explanation is below. To avoid looping through the worksheet, I made a 2 dimensional object array from the cells returned from usedRange with:

Excel.Range targetCells = worksheet.UsedRange;
object[,] allValues = (object[,])targetCells.Cells.Value;

This is the array I loop through to get the indexes of the empty rows and columns. I make 2 int lists, one keeps the row indexes to delete the other keeps the column indexes to delete.

List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);

These lists will be sorted from high to low to simplify deleting rows from the bottom up and deleting columns from right to left. Then simply loop through each list and delete the appropriate row/col.

DeleteRows(emptyRows, worksheet);
DeleteCols(emptyCols, worksheet);

Finally after all the empty rows and columns have been deleted, I SaveAs the file to a new file name.

Hope this helps.

EDIT:

Addressed the UsedRange issue such that if there are empty rows at the top of the worksheet, those rows will now be removed. Also this will remove any empty columns to the left of the starting data. This allows for the indexing to work properly even if there are empty rows or columns before the data starts. This was accomplished by taking the address of the first cell in UsedRange this will be an address of the form “$A$1:$D$4”. This will allow the use of an offset if the empty rows at the top and empty columns to the left are to remain and not be deleted. In this case I am simply deleting them. To get the number of rows to delete from the top can be calculated by the first “$A$4” address where the “4” is the row that the first data appears. So we need to delete the top 3 rows. The Column address is of the form “A”, “AB” or even “AAD” this required some translation and thanks to How to convert a column number (eg. 127) into an excel column (eg. AA) I was able to determine how many columns on the left need to be deleted.

class Program
{
  static void Main(string[] args)
  {
    Excel.Application excel = new Excel.Application();
    string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
    Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
    Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
    Excel.Range usedRange = worksheet.UsedRange;

    RemoveEmptyTopRowsAndLeftCols(worksheet, usedRange);

    DeleteEmptyRowsCols(worksheet);

    string newPath = @"H:\ExcelTestFolder\Book1_Test_Removed.xls";
    workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);

    workbook.Close();
    excel.Quit();
    System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
    Console.WriteLine("Finished removing empty rows and columns - Press any key to exit");
    Console.ReadKey();
  }

  private static void DeleteEmptyRowsCols(Excel.Worksheet worksheet)
  {
    Excel.Range targetCells = worksheet.UsedRange;
    object[,] allValues = (object[,])targetCells.Cells.Value;
    int totalRows = targetCells.Rows.Count;
    int totalCols = targetCells.Columns.Count;

    List<int> emptyRows = GetEmptyRows(allValues, totalRows, totalCols);
    List<int> emptyCols = GetEmptyCols(allValues, totalRows, totalCols);

    // now we have a list of the empty rows and columns we need to delete
    DeleteRows(emptyRows, worksheet);
    DeleteCols(emptyCols, worksheet);
  }

  private static void DeleteRows(List<int> rowsToDelete, Excel.Worksheet worksheet)
  {
    // the rows are sorted high to low - so index's wont shift
    foreach (int rowIndex in rowsToDelete)
    {
      worksheet.Rows[rowIndex].Delete();
    }
  }

  private static void DeleteCols(List<int> colsToDelete, Excel.Worksheet worksheet)
  {
    // the cols are sorted high to low - so index's wont shift
    foreach (int colIndex in colsToDelete)
    {
      worksheet.Columns[colIndex].Delete();
    }
  }

  private static List<int> GetEmptyRows(object[,] allValues, int totalRows, int totalCols)
  {
    List<int> emptyRows = new List<int>();

    for (int i = 1; i < totalRows; i++)
    {
      if (IsRowEmpty(allValues, i, totalCols))
      {
        emptyRows.Add(i);
      }
    }
    // sort the list from high to low
    return emptyRows.OrderByDescending(x => x).ToList();
  }

  private static List<int> GetEmptyCols(object[,] allValues, int totalRows, int totalCols)
  {
    List<int> emptyCols = new List<int>();

    for (int i = 1; i < totalCols; i++)
    {
      if (IsColumnEmpty(allValues, i, totalRows))
      {
        emptyCols.Add(i);
      }
    }
    // sort the list from high to low
    return emptyCols.OrderByDescending(x => x).ToList();
  }

  private static bool IsColumnEmpty(object[,] allValues, int colIndex, int totalRows)
  {
    for (int i = 1; i < totalRows; i++)
    {
      if (allValues[i, colIndex] != null)
      {
        return false;
      }
    }
    return true;
  }

  private static bool IsRowEmpty(object[,] allValues, int rowIndex, int totalCols)
  {
    for (int i = 1; i < totalCols; i++)
    {
      if (allValues[rowIndex, i] != null)
      {
        return false;
      }
    }
    return true;
  }

  private static void RemoveEmptyTopRowsAndLeftCols(Excel.Worksheet worksheet, Excel.Range usedRange)
  {
    string addressString = usedRange.Address.ToString();
    int rowsToDelete = GetNumberOfTopRowsToDelete(addressString);
    DeleteTopEmptyRows(worksheet, rowsToDelete);
    int colsToDelete = GetNumberOfLeftColsToDelte(addressString);
    DeleteLeftEmptyColumns(worksheet, colsToDelete);
  }

  private static void DeleteTopEmptyRows(Excel.Worksheet worksheet, int startRow)
  {
    for (int i = 0; i < startRow - 1; i++)
    {
      worksheet.Rows[1].Delete();
    }
  }

  private static void DeleteLeftEmptyColumns(Excel.Worksheet worksheet, int colCount)
  {
    for (int i = 0; i < colCount - 1; i++)
    {
      worksheet.Columns[1].Delete();
    }
  }

  private static int GetNumberOfTopRowsToDelete(string address)
  {
    string[] splitArray = address.Split(':');
    string firstIndex = splitArray[0];
    splitArray = firstIndex.Split('$');
    string value = splitArray[2];
    int returnValue = -1;
    if ((int.TryParse(value, out returnValue)) && (returnValue >= 0))
      return returnValue;
    return returnValue;
  }

  private static int GetNumberOfLeftColsToDelte(string address)
  {
    string[] splitArray = address.Split(':');
    string firstindex = splitArray[0];
    splitArray = firstindex.Split('$');
    string value = splitArray[1];

    return ParseColHeaderToIndex(value);
  }

  private static int ParseColHeaderToIndex(string colAdress)
  {
    int[] digits = new int[colAdress.Length];
    for (int i = 0; i < colAdress.Length; ++i)
    {
      digits[i] = Convert.ToInt32(colAdress[i]) - 64;
    }
    int mul = 1; int res = 0;
    for (int pos = digits.Length - 1; pos >= 0; --pos)
    {
      res += digits[pos] * mul;
      mul *= 26;
    }
    return res;
  }

}

EDIT 2: For testing I made a method that loops thru the the worksheet and compared it to my code that loops thru an object array. It shows a significant difference.

enter image description here

Method to Loop thru the worksheet and delete empty rows and columns.

enum RowOrCol { Row, Column };
private static void ConventionalRemoveEmptyRowsCols(Excel.Worksheet worksheet)
{
  Excel.Range usedRange = worksheet.UsedRange;
  int totalRows = usedRange.Rows.Count;
  int totalCols = usedRange.Columns.Count;

  RemoveEmpty(usedRange, RowOrCol.Row);
  RemoveEmpty(usedRange, RowOrCol.Column);
}

private static void RemoveEmpty(Excel.Range usedRange, RowOrCol rowOrCol)
{
  int count;
  Excel.Range curRange;
  if (rowOrCol == RowOrCol.Column)
    count = usedRange.Columns.Count;
  else
    count = usedRange.Rows.Count;

  for (int i = count; i > 0; i--)
  {
    bool isEmpty = true;
    if (rowOrCol == RowOrCol.Column)
      curRange = usedRange.Columns[i];
    else
      curRange = usedRange.Rows[i];

    foreach (Excel.Range cell in curRange.Cells)
    {
      if (cell.Value != null)
      {
        isEmpty = false;
        break; // we can exit this loop since the range is not empty
      }
      else
      {
        // Cell value is null contiue checking
      }
    } // end loop thru each cell in this range (row or column)

    if (isEmpty)
    {
      curRange.Delete();
    }
  }
}

Then a Main for testing/timing the two methods.

enum RowOrCol { Row, Column };

static void Main(string[] args)
{
  Excel.Application excel = new Excel.Application();
  string originalPath = @"H:\ExcelTestFolder\Book1_Test.xls";
  Excel.Workbook workbook = excel.Workbooks.Open(originalPath);
  Excel.Worksheet worksheet = workbook.Worksheets["Sheet1"];
  Excel.Range usedRange = worksheet.UsedRange;

  // Start test for looping thru each excel worksheet
  Stopwatch sw = new Stopwatch();
  Console.WriteLine("Start stopwatch to loop thru WORKSHEET...");
  sw.Start();
  ConventionalRemoveEmptyRowsCols(worksheet);
  sw.Stop();
  Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");

  string newPath = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruWorksheet.xls";
  workbook.SaveAs(newPath, Excel.XlSaveAsAccessMode.xlNoChange);
  workbook.Close();
  Console.WriteLine("");

  // Start test for looping thru object array
  workbook = excel.Workbooks.Open(originalPath);
  worksheet = workbook.Worksheets["Sheet1"];
  usedRange = worksheet.UsedRange;
  Console.WriteLine("Start stopwatch to loop thru object array...");
  sw = new Stopwatch();
  sw.Start();
  DeleteEmptyRowsCols(worksheet);
  sw.Stop();

  // display results from second test
  Console.WriteLine("It took a total of: " + sw.Elapsed.Milliseconds + " Miliseconds to remove empty rows and columns...");
  string newPath2 = @"H:\ExcelTestFolder\Book1_Test_RemovedLoopThruArray.xls";
  workbook.SaveAs(newPath2, Excel.XlSaveAsAccessMode.xlNoChange);
  workbook.Close();
  excel.Quit();
  System.Runtime.InteropServices.Marshal.ReleaseComObject(workbook);
  System.Runtime.InteropServices.Marshal.ReleaseComObject(excel);
  Console.WriteLine("");
  Console.WriteLine("Finished testing methods - Press any key to exit");
  Console.ReadKey();
}
Comments