user3440244 user3440244 - 1 month ago 15
R Question

count the unique values in one column in EXCEL 2010 or R with 1 million rows

After searching the forum, I did not find a good solution for this question. If I missed it, please tell me.

I need to count the unique values in one column in EXCEL 2010.

The worksheet has 1 million rows and 10 columns. All cell values are string or numbers.

I used the solution at Count unique values in a column in Excel

=SUMPRODUCT((A2:A1000000<>"")/COUNTIF(A2:A100000,A2:A1000000&""))


But, it runs so long time that the EXCEL is almost frozen. And, it generates 25 processes in Win 7.

Are there more efficient ways to do it?

Also, in the column, all values have for format of

AX_Y

here, A is a character, X is an integer, Y is an integer from 1 to 10.

For example, A5389579_10


I need to cut off the part after (including) undersocre. for the example,

A5389579


This is what I need to count as unique values in all cells in one column.

For example, A5389579_10
A1543848_6
A5389579_8


Here, the unique value has 2 after removing the part after underscore.

How to do it in EXCEL VBA and R (if no efficient solution for EXCEL)?

Answer

If you want to do this by VBA, you can take advantage of the Collection object. Since collections can only contain unique values, trying to add all of your input data to a collection will result in an array of unique values. The code below takes all the variables in a selected range and then outputs an array with distinct values to an other sheet (in this case a sheet named Output).

Sub ReturnDistinct()
    Dim Cell As Range
    Dim i As Integer
    Dim DistCol As New Collection
    Dim DistArr()
    Dim OutSht As Worksheet
    Dim LookupVal As String

    Set OutSht = ActiveWorkbook.Sheets("Output") '<~~ Define sheet to putput array

    If TypeName(Selection) <> "Range" Then Exit Sub

    'Add all distinct values to collection
    For Each Cell In Selection
        If InStr(Cell.Value, "_") > 0 Then
            LookupVal = Mid(Cell.Value, 1, InStr(Cell.Value, "_") - 1)
        Else
            LookupVal = Cell.Value
        End If
        On Error Resume Next
        DistCol.Add LookupVal, CStr(LookupVal)
        On Error GoTo 0
    Next Cell

    'Write collection to array
    ReDim DistArr(1 To DistCol.Count, 1 To 1)
    For i = 1 To DistCol.Count Step 1
        DistArr(i, 1) = DistCol.Item(i)
    Next i

    'Outputs distinct values
    OutSht.Range("A1:A" & UBound(DistArr)).Value = DistArr
End Sub

Note that since this code writes all the distinct values to a single column in the OutSht-sheet, this will return an error if there are more than 1,048,576 distinct values in your dataset. In that case you would have to split the data to be filled into multiple output columns.