LIFO / sorted database design pattern - database

I want to store data (as an archive) in two seperate lists one is to be a sort of LIFO stack where new data just gets pushed on top and the other is sorted by a temporally independent value. Data may be retreived at a later point in time, but I'm generally only interested in the topmost N values. Both lists can get very long but contain very simple values (document ids with priority). Is there a database to implement this pattern efficiently? I hear HBase does sorted storage, would it be useful for this kind of application?
At least the LIFO storage could be implemented as a plain file. Is this wise?
Or is this concern about retreival speed premature optimization, i.e. are there commands in SQL with which i can retreive first N by time of insertion / sorted by a value . Or should I shard / paginate?

Rows or "tuples" if you like, are specifically not ordered in a relational database. It is considered an implementation detail. Of course, we often need to impose an order of the rows anyway, but we have to do it when we query the data, not when we store it.
I have no knowledge of hbase, but I noticed it was free, so if you can consider MySQL an alternative, here is one way to do what you want.
Create an InnoDB table with an auto-incrementing primary key. InnoDB tables are clustered on the primary key, meaning that the rows are stored sorted by the key. Since you use an auto-incrementing key, newer rows will always have higher values, and rows added in sequence will be stored "near" each other. Those properties make for fast retreival of the X newest or oldest rows, since they will likely be co-located on the same data pages (reduces I/O).
It would be something like this:
create table mytab(
id int not null auto_increment
,the int
,rest varchar
,of char
,your tinyint
,columns varchar
,primary key(id)
)Engine=InnoDB;
To get the 10 latest rows added, you would query it like:
select *
from mytab
order
by id desc
limit 10;
Note that even if you are deleting the rows, the ID will keep on increasing. So if the MAX(id) is 5000, it doesn't mean you have 5000 rows.

Related

Covering index including rowversion? Good or bad

I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.

Querying a High Cardinality Field

I am designing a data model for our orders for our upcoming Cassandra migration. An order has an orderId (arcane UUID field) and an orderNumber (user-friendly number). A getOrder query can be done by using any of the two.
My partition key is the orderId, so getByOrderId is not a problem. By getByOrderNumber is - there's a one-to-one mapping b/w the orderId and the orderNumber (high-cardinality field), so creating a local secondary index on each node would slow down my queries.
What I was wondering was that I could create a new table with the orderNumber as the partition key and the orderId as the only column (kind of a secondary index but maintained by me). So now, a getByOrderNumber query can be resolved in two calls.
Bear with me if the above solution is egregiously wrong, I am extremely new to Cassandra. As I understand, for such a column, if I used local secondary indices, Cassandra would have to query each node for a single order. So I thought why not create another table that stores the mapping.
What would I be missing on by managing this index myself? One thing I can see if for every write, I'll now have to update two tables. Anything else?
I thought why not create another table that stores the mapping.
That's okay. From Cassandra documentation:
Do not use an index in these situations:
On high-cardinality columns because you then query a huge volume of
records for a small number of results. See Problems using a
high-cardinality column index below.
Problems using a high-cardinality column index
If you create an index on a high-cardinality column, which has many
distinct values, a query between the fields incurs many seeks for very
few results. In the table with a billion songs, looking up songs by
writer (a value that is typically unique for each song) instead of by
their recording artist is likely to be very inefficient..
It would probably be more efficient to manually maintain the table as
a form of an index instead of using the built-in index. For columns
containing unique data, it is sometimes fine performance-wise to use
an index for convenience, as long as the query volume to the table
having an indexed column is moderate and not under constant load.
Conversely, creating an index on an extremely low-cardinality column,
such as a boolean column, does not make sense. Each value in the index
becomes a single row in the index, resulting in a huge row for all the
false values, for example. Indexing a multitude of indexed columns
having foo = true and foo = false is not useful.
It's normal for Cassandra data modelling to have a denormalized data.

Clustered index consideration in regards to distinct valus and large result sets and a single vertical table for auditing

I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.

Database optimization: Hashing all the values

Typically, the databases are designed as below to allow multiple types for an entity.
Entity Name
Type
Additional info
Entity name can be something like account number and type could be like savings,current etc in a bank database for example.
Mostly, type will be some kind of string. There could be additional information associated with an entity type.
Normally queries will be posed like this.
Find account numbers of this particular type?
Find account numbers of type X, having balance greater than 1 million?
To answer these queries, query analyzer will scan the index if the index is associated with a particular column. Otherwise, it will do a full scan of all the rows.
I am thinking about the below optimization.
Why not we store the hash or integral value of each column data in the actual table such that the ordering property is maintained, so that it will be easy for comparison.
It has below advantages.
1. Table size will be lot less because we will be storing small size values for each column data.
2. We can construct a clustered B+ tree index on the hash values for each column to retrieve the corresponding rows matching or greater or smaller than some value.
3. The corresponding values can be easily retrieved by having B+ tree index in the main memory and retrieving the corresponding values.
4. Infrequent values will never need to retrieved.
I am still having more optimizations in my mind. I will post those based on the feedback to this question.
I am not sure if this is already implemented in database, this is just a thought.
Thank you for reading this.
-- Bala
Update:
I am not trying to emulate what the database does. Normally indexes are created by the database administrator. I am trying to propose a physical schema by having indexes on all the fields in the database, so that database table size is reduced and its easy to answer few queries.
Updates:(Joe's answer)
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
In a typical table, all the physical data will be stored. But now by generating a hash value on each column data, I am only storing the hash value in the actual table. I agree that its not reducing the size of the database, but its reducing the size of the table. It will be useful when you don't need to return all the column values.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
There can be only one clustered index on a table and all other indexes have to unclustered indexes. With my approach I will be having clustered index on all the values of the database. It will improve query performance.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
The basic idea is that instead of creating a separate table for each column for efficient access, we are doing it at the physical level.
Now the table will look like this.
Row1 - OrderedHash(Column1),OrderedHash(Column2),OrderedHash(Column3)
Google for "hash index". For example, in SQL Server such an index is created and queried using the CHECKSUM function.
This is mainly useful when you need to index a column which contains long values, e.g. varchars which are on average more than 100 characters or something like that.
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
I don't think your approach is very helpful.
Hash values only help for equality/inequality comparisons, but not less than/greater than comparisons, compared to pretty much every database index.
Even with (in)equality hash functions do not offer 100% guarantee of having given you the right answer, as hash collisions can happen, so you will still have to fetch and compare the original value - boom, you just lost what you wanted to save.
You can have the rows in a table ordered only one way at a time. So if you have an application where you have to order rows differently in different queries (e.g. query A needs a list of customers ordered by their name, query B needs a list of customers ordered by their sales volume), one of those queries will have to access the table out-of-order.
If you don't want the database to have to work around colums you do not use in a query, then use indexes with extra data columns - if your query is ordered according to that index, and your query only uses columns that are in the index (coulmns the index is based on plus columns you have explicitly added into the index), the DBMS will not read the original table.
Etc.

Approaches to table partitioning in SQL Server

The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.

Resources