Jon Koeter Jon Koeter - 27 days ago 8
C# Question

Millions of rows in the database, only so much needed

Problem summary:


  • C# (MVC), entity framework 5.0 and Oracle.

  • I have a couple of million rows in a view which joins two tables.

  • I need to populate dropdownlists with filter-posibilities.

  • The options in these dropdownlists should reflect the actual contents
    of the view for that column, distinct.

  • I want to update the dropdownlists whenever you select something, so
    that the new options reflect the filtered content, preventing you
    from choosing something that would give 0 results.

  • Its slow.



Question: whats the right way of getting these dropdownlists populated?

Now for more detail.

-- Goal of the page --

The user is presented with some dropownlists that filter the data in a grid below. The grid represents a view (see "Database") where the results are filtered.

Each dropdownlist represents a filter for a column of the view. Once something is selected, the rest of the page updates. The other dropdownlists now contain the posible values for their corresponding columns that complies to the filter that was just applied in the first dropdownlist.

Once the user has selected a couple of filters, he/she presses the search button and the grid below the dropdownlists updates.

-- Database --

I have a view that selects almost all columns from two tables, nothing fancy there. Like this:

SELECT tbl1.blabla, tbl2.blabla etc etc
FROM table1 tbl1, table2 tbl2
WHERE bsl.bvz_id = bvz.id AND bsl.einddatum IS NULL;


There is a total of 22 columns. 13 VARCHARS (mostly small, 1 - 20, one of em has a size of 2000!), 6 DATES and 3 NUMBERS (one of them size 38 and one of them 15,2).

There are a couple of indexes on the tables, among which the relevant ID's for the WHERE clause.

Important thing to know: I cannot change the database. Maybe set an index here and there, but nothing major.

-- Entity Framework --

I created a Database first EDMX in my solution and also mapped the view. There are also classes for both tables, but I need data from both of them, so I don't know if I need them. The problem by selecting things from either table would be that you can't apply half of the filtering, but maybe there are smart way's I didn't think of yet.

-- View --

My view is strongly bound to a viewModel. In there I have a IEnumerable for each dropdownlist. The getter for these gets its data from a single IEnumerable called NameOfViewObjects. Like this:

public string SelectedColumn1{ get; set; }

private IEnumerable<SelectListItem> column1Options;
public IEnumerable<SelectListItem> Column1Options
{
get
{
if (column1Options == null)
{
column1Options= NameOfViewObjects.Select(item => item.Column1).Distinct()
.Select(item => new SelectListItem
{
Value = item,
Text = item,
Selected = item.Equals(SelectedColumn1, StringComparison.InvariantCultureIgnoreCase)
});
}
return column1Options;
}
}


The two solutions I've tried are:

- 1 -
Selecting all columns in a linq query I need for the dropdownlists (the 2000 varchar is not one of them and there are only 2 date columns), do a distinct on them and put the results into a Hashset. Then I set NameOfViewObjects to point towards this hashset. I have to wait for about 2 minutes for that to complete, but after that, populating the dropdownlists is almost instant (maybe a second for each of them).

model.Beslissingen = new HashSet<NameOfViewObject>(dbBes.NameOfViewObject
.DistinctBy(item => new
{
item.VarcharColumn1,
item.DateColumn1,
item.DateColumn2,
item.VarcharColumn2,
item.VarcharColumn3,
item.VarcharColumn4,
item.VarcharColumn5,
item.VarcharColumn6,
item.VarcharColumn7,
item.VarcharColumn8
}
)
);


The big problem here is that the object NameOfViewObject is probably quite large, and even though using distinct here, resulting in less than 100.000 results, it still uses over 500mb of memory for it. This is unacceptable, because there will be a lot of users using this screen (a lot would be... 10 max, 5 average simultaniously).

- 2 -
The other solution is to use the same linq query and point NameOfViewObjects towards the IQueryable it produces. This means that every time the view wants to bind a dropdownlist to a IEnumerable, it will fire a query that will find the distinct values for that column in a table with millions of rows where most likely the column it's getting the values from is not indexed. This takes around 1 minute for each dropdownlist (I have 10), so that takes ages.

Don't forget: I need to update the dropdownlists every time one of them has it's selection changed.

-- Question --
So I'm probably going at this the wrong way, or maybe one of these solutions should be combined with indexing all of the columns I use, maybe I should use another way to store the data in memory, so it's only a little, but there must be someone out there who has done this before and figured out something smart. Can you please tell me what would be the best way to handle a situation like this?

Acceptable performance:


  • having to wait for a while (2 minutes) while the page loads, but
    everything is fast after that.

  • having to wait for a couple of seconds every time a dropdownlist
    changes

  • the page does not use more than 500mb of memory


Answer Source

Of course you should have indexes on all columns and combinations in WHERE clauses. No index means table scan and O(N) query times. Those cannot scale under any circumstance.

You do not need millions of entries in a drop down. You need to be smarter about filtering the database down to manageable numbers of entries.

I'd take a page from Google. Their type ahead helps narrow down the entire Internet graph into groups of 25 or 50 per page, with the most likely at the top. Maybe you could manage that, too.

Perhaps a better answer is something like a search engine. If you were a Java developer you might try Lucene/SOLR and indexing. I don't know what the .NET equivalent is.