Counts over a large set of records in DB

Counts over a large set of records in DB - database

I have a table [ID,ITEM_NAME,ITEM_PRICE,ITEM_STATUS,ITEM_TYPE,ITEM_OWNER,ITEM_DATE]
The application can query the table with any number of search conditions like with item date and/or item owner etc.
In the resultset, I also need to fetch the counts by different status in ITEM_STATUS.
Its often causing timeouts when I try to get the counts based on status.
How is this case generally handled in large volume applications. Say Y mail. I have counts of how many are in inbox, how many are read/unred/sent and what not..almost instantly. How can such an experience be achieved?

Other than indexing, and bitmap indexes are the most flexible and performant for this kind of thing if you can deal with the concurrency issues in maintaining them, consider materialised views to cover the most common aggregation levels.
Defining multi-level materialised views can give you almost instant response times, and even allow effective indexing on HAVING clauses.

Related

How to structure DB for performant, customizable aggregations?

I am planning to setup a database for some rudimentary aggregations. The plan is to provide queries like SELECT SUM(energy) WHERE... to our users.
energy is a plain numerical field. The WHERE clause is more interesting, because we expose some (limited) customizability to the user, basically just ANDing some fields for equality. Like A=3, A=3 AND B=92, etc.
I'm no DBA, but my performance sense is tingling. As it stands, I'm anticipating a load of O(user * record) on the database, should the queries be fired off all at once. Is there a way to optimize this better?
If the WHERE condition were fixed, then we could simply provide a view or otherwise precalculate and cache the sum. Unfortunately, in this case, we will be providing limited ability to customize the WHERE expression, basically offering some fields for the users to AND at will.
Looks to me like each of these aggregation queries would traverse basically the whole table, or significant subsections of the table, one traversal per user query. Does that make sense?
What are some ways to optimize for this kind of workflow? Should I just shard over many of my fields? I'm considering how many replicas to use, though I am unsure if |replicas| would be able to outpace growth of users or data, due to the data totality of the aggregation queries.
In terms of low-level performance, would it make sense to structure these queries as SELECT SUM(energy)>N WHERE..., and hope that PostgreSQL is smart enough to terminate early when the subtotal is found to already exceed the threshold N?
Finally, would a NoSQL or TSDB offer advantages for this workflow, or would their performance be comparable to a SQL database?
Update
Since most of these queries will be run on a schedule, I think I will stagger them to spread the load throughout the day. But I'm still keen on finding better ways to optimize the tables for this load, should a bunch of active users suddenly submit aggregation queries all at once.

Querying large not indexed tables

we are developing a CRUD like web interface for out application. For this, we need to show data from different tables. Some are huge and very "alive", with many rows (millions). Some are small, configuration tables.
Now we want to allow our users filtering, refinement, sorting, pagination etc. on grids we show. As a result of user selection - we are building select queries.
For obvious reasons, filtering on not indexed fields will produce a rather long running query. On the other hand, indexing every column of a table, looks a bit "weird". And we do have tables with more than 50 rows.
We are looking into Apache Lucene, but as far as I understand - it well help us solve text indexing. But what about numbers, dates, ranges? Is there any solutions, discussions available for said issue?
Also, I must point that this issue is UX specific only. For all applications own needs, we do good.

You are correct, in general, you don't want to allow random predicates on non indexed fields, however how much effect this has is very dependent on table size, database engine being used and machine being used to drive the database. Some engines are not too bad with non indexed columns, but in worst case each will degenerate to a sequential scan. Sequential scans aren't always as bad as they sound either.
Some ideas
Investigate using a column store database engine, these store data columnwise rather than row wise which can be much faster for random predicates on non indexed columns. Column stores aren't a universal solution though if you often need all fields on a row
Index the main columns that will be queried by users and indicate in the UX layer that queries on some columns will be slower. Users will be more accepting, especially if they know in advance that a column query will be slow
If possible, just throw memory at it. Engines like oracle or sql/server will be pretty good while most of your database fits in memory. Only problem is that once your database exceeds the memory performance will fall off a cliff (without warning)
Consider using vertical partitioning if possible. This lets you split a row into 2 or more pieces for storage, which can reduce IO for predicates.
Sure you know this, but make sure columns used for joins are indexed.

Database Structure for hierarchical data with horizontal slices

We're currently looking at trying to improve performance of queries for our site, the core hierarchical data-structure has 5 levels, each type has about 20 fields.
level1: rarely added, updated infrequently, ~ 100 children
level2: rarely added, updated fairly infrequently, ~ 200 children
level3: added often, updated fairly often, ~ 1-50 children (average ~10)
level4: added often, updated quite often, ~1-50 children (average <10)
level5: added often, updated often (a single item might update once a second)
We have a single data pipeline which performs all of these updates and inserts (ie. we have full control over data going in).
The queries we need to do on this are:
fetch single items from a level + parents
fetch a slice of items across a level (either by PK, or sometimes filtering criteria)
fetch multiple items from level3 and parts of their children (usually by complex criteria)
fetch level3 and all children
We read from this datasource a lot, as-in hundreds of times a second. All of the queries we need to perform are known and optimised as well as they can be to the current data structure.
We're currently using MySQL queries behind memcached for this, and just doing additional queries to get children/parents, I'm thinking that some sort of Tree-based or Document based database might be more suitable.
My question is: what's the best way to model this data for efficient read performance?

Sounds like your data belongs in an OLAP (On-Line Analytical Processing) database. The way you're describing levels, slices, and performance concerns seems to lend itself to OLAP. It's probably modeled fine (not sure though), but you need a different tool to boost performance.
I currently manage a system like this. We have a standard relational database for input, and then copy the pertinent data for reporting to an OLAP server. Our combo is Microsoft SQL Server (input, raw data), Microsoft Analysis Services (pre-calculates then stores the analytical data to increase speed), and Microsoft Excel/Access Pivot Tables and/or Tableau for reporting.
OLAP servers:
http://en.wikipedia.org/wiki/Comparison_of_OLAP_Servers
Combining relational and OLAP:
http://en.wikipedia.org/wiki/HOLAP
Tableau:
http://www.tableausoftware.com/
*Tableau is a superb product, and can probably replace an OLAP server if your data isn't terribly large (even then it can handle a lot of data). It will make local copies as necessary to improve performance. I strongly advise giving it a look.
If I've misunderstood the issue you're having, then by all means please ignore this answer :\
UPDATE: After more discussion, an Object DB might be a solution as well. Your data sounds multi-dimensional in nature, one way or the other, but I think the difference would be whether you're doing analytic aggregate calculations and retrieval (SUMs, AVGs), or just storing and fetching categorical or relational data (shopping cart items, or friends of a family member).
ODBMS info: http://en.wikipedia.org/wiki/Object_database
InterSystem's Cache is one Object Database I know of that sounds like a more appropriate fit based on what you've said.
http://www.intersystems.com/cache/
If conversion to a different system isn't feasible (entirely understandable), then you might have to look at normalization and the types of data your queries are processing in order to gain further improvements in speed. In fact, that's probably a good first step before jumping to a different type of system (sorry I didn't get to this sooner).
In my case, I know on MS SQL that a switch we did from having some core queries use a VARCHAR field to using an INTEGER field made a huge difference in speed. Text data is one of the THE MOST expensive types of data to process. So for instance, if you have a query doing a lot of INNER JOINs on text fields, you might consider normalizing to the point where you're using INTEGER IDs that link to the text data.
An example of high normalization could be using ID numbers for a person's First or Last Name. Most DB designs store these names directly and don't attempt to reduce duplication, but you could normalize to the point where Last Name and/or First Name have their own tables (or one table to hold both First and Last names) and IDs for each unique name.
The point in your case would be more for performance than de-duplication of data, but something like switching from VARCHAR to INTEGER might have huge gains. I'd try it with a single field first, measure the before and after cases, and make your decision carefully from there.
And of course, in general you should be sure to have appropriate indexes on your data.
Hope that helps.

Document/Tree based database is designed to perform hierarchical queries. Do you have any hierarchical queries in your design -- I fail to see any? Querying one level up and down doesn't count: it is a simple join. Please have in mind that going "Document/Tree based database" route you would compromise your general querying ability. To summarize, just hire a competent db specialist who would analyze your performance bottlenecks -- they are usually cured with mundane index addition.

there's not really enough info here to say much useful - you'd need to measure things, look at "explains", etc - but one option that goes beyond the usual indexing would be to shard by level 3 instances. that would give you better performance on parallel queries that hit different shards, at its simplest (separate disks), or you could use separate machines if you want to throw more resources at each shard.
the only reason i mention this really is that your use cases suggest sharding at that level would work quite well (it looks like it would be simple enough to do in your application layer, if you wanted - i have no idea what tools mysql has for this).
and if your data volume isn't so high then with sharding you might be able to get it down to ssds...

Does storing aggregated data go against database normalization?

On sites like SO, I'm sure it's absolutely necessary to store as much aggregated data as possible to avoid performing all those complex queries/calculations on every page load. For instance, storing a running tally of the vote count for each question/answer, or storing the number of answers for each question, or the number of times a question has been viewed so that these queries don't need to be performed as often.
But does doing this go against db normalization, or any other standards/best-practices? And what is the best way to do this, e.g., should every table have another table for aggregated data, should it be stored in the same table it represents, when should the aggregated data be updated?
Thanks

Storing aggregated data is not itself a violation of any Normal Form. Normalization is concerned only with redundancies due to functional dependencies, multi-valued dependencies and join dependencies. It doesn't deal with any other kinds of redundancy.

The phrase to remember is "Normalize till it hurts, Denormalize till it works"
It means: normalise all your domain relationships (to at least Third Normal Form (3NF)). If you measure there is a lack of performance, then investigate (and measure) whether denormalisation will provide performance benefits.
So, Yes. Storing aggregated data 'goes against' normalisation.
There is no 'one best way' to denormalise; it depends what you are doing with the data.
Denormalisation should be treated the same way as premature optimisation: don't do it unless you have measured a performance problem.

Too much normalization will hurt performance so in the real world you have to find your balance.
I've handled a situation like this in two ways.
1) using DB2 I used a MQT (Materialized Query Table) that works like a view only it's driven by a query and you can schedule how often you want it to refresh; e.g. every 5 min. Then that table stored the count values.
2) in the software package itself I set information like that as a system variable. So in Apache you can set a system wide variable and refresh it every 5 minutes. Then it's somewhat accurate but your only running your "count(*)" query once every five minutes. You can have a daemon run it or have it driven by page requests.
I used a wrapper class to do it so it's been while but I think in PHP was was as simple as:
$_SERVER['report_page_count'] = array('timeout'=>1234569783, 'count'=>15);
Nonetheless, however you store that single value it saves you from running it with every request.

Killing Two Birds with One Stone in RIA Services with Silverlight

Here's the issue:
The database is highly normalized, and one particular query relies on the multiple relationships in the database. The query is designed to join all the tables, construct the entire object, and then return a list of those objects.
In other words this particular query does a lot of work.
Now, the query does only return X number of items as it supports pagination, but we also need to know the total count of items that are there.
Currently these two tasks are independent, but highly similar queries in our Domain Service. Ideally what I'd like to do is combine these two queries so that the call to the server happens once, rather than twice, and that the joins happen only once.
Output/Reference parameters don't work, and since the function is designed to return an IQueryable of items, I'm stuck on how to return this list of items as well as the total count.
I'm sure someone's come across this before - any thoughts?

A count of item joined tables is not the same thing as returning a subset of those records. They just happen to share a certain amount of SQL code (specifically to join the tables). RIA does the actual paging server-side so you are actually getting a slightly different query for every paging call.
A count operation would also operate much faster than the record query as SQL counts can often be performed using database indexes only (although Linq may well optimise this for you to the same end result... Clever Linq coders!).
As you would only be requesting the total count once (on page load I assume), then you begin paging through multiple queries on different portions of the data, you are hitting different parts of the database with every call.
You are better off treating them as two distinct functions (as you were) and wear the slight overhead of an additional server call. There is always somewhere else you could make bigger gains (caching etc).
When in doubt: Do not overcomplicate any process for the sake of only a very small gain.

If the problem is with the client server communication, you can put the count result on the header of the result response.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight