How to implement database engine independent paging? - database

Task: implement paging of database records suitable for different RDBMS. Method should work for mainstream engines - MSSQL2000+, Oracle, MySql, etc.
Please don't post RDBMS specific solutions, I know how to implement this for most of the modern database engines. I'm looking for the universal solution. Only temporary tables based solutions come to my mind at the moment.
EDIT:
I'm looking for SQL solution, not 3rd party library.

There would have been a universal solution if SQL specifications had included paging as a standard. The requirement for any RDBMS language to be called an RDBMS language does not include paging support as well.
Many database products support SQL with proprietary extensions to the standard language. Some of them support paging like MySQL with the limit clause, Rowid with Oracle; each handled differently. Other DBMS's will need to add a field called rowid or something like that.
I dont think you can have a universal solution (anyone is free to prove me wrong here;open to debate) unless it is built into the database system itself or unless there is a company say ABC that uses Oracle, MySQL, SQL Server and they decide to have all the various database systems provide their own implementation of paging by their database developers providing a universal interface for the code that uses it.

The most natural and efficient way to do paging is using the LIMIT/OFFSET (TOP in Sybase world) construct. A DBindependent way would have to know which engine it's running on and apply the proper SQL construct.
At least, that's the way I've seen it done in DB independent libraries' code. You can abstract away the paging logic once you get the data from the engine with the specific query.
If you really are looking for a single, one SQL sentence solution, could you show what you have in mind? Like the SQL for the temp table solution. That would probably get you more relevant suggestions.
EDIT:
I wanted to see what were you thinking because I couldn't see a way to do it with temp tables and not use a engine specific construct. You used specific constructs in the example. I still don't see a way to implement paging in the database with only (implemented) standard SQL. You could bring the whole table in standard SQL and page in the application, but that is obviously stupid.
So the question would now be more like "Is there a way to implement paging without using LIMIT/OFFSET or equivalent?" and I guess that the answer is "Sanely, no." You could try using cursors but you'll fall prey to database specific sentences/behavior there as well.
A wacko (read stupid) idea that just occurred to me would be to add a page column to the table, say create table test (id int, name varchar, phone varchar, page int) and then you can get page 1 with select * from table where page = 1. But that means having to add code to maintain that column, which, again could only be done by either bringing the whole database or using database specific constructs. That besides having to add a different column per each possible ordering and many other flaws.
I can't provide proof, but I really think you just can't do it sanely.

Proceed as usual:
Start by implementing it according to the standard. And then handle the corner cases, i.e. the DBMSes which don't implement the standard. How to handle the corner cases depends on your development environment.
You are looking for a "universal" approach. The most universal way to paginate is through the use of cursors, but cursor-based pagination don't fit very well with a non-stateful environment like a web application.
I've written about the standard and the implementations (including cursors) here:
http://troels.arvin.dk/db/rdbms/#select-limit-offset

SubSonic can do this for you if you if you can tolerate Open Source...
http://subsonicproject.com/querying/webcast-using-paging/
Other than that I know NHib does as well

JPA lets you do it with the Query class:
Query q = ...;
q.setFirstResult (0);
q.setMaxResults (10);
gives you the first 10 results in the result set.
If you want a DBMS independent raw SQL solution, I'm afraid you're out of luck. All the vendors do it differently.

#Vinko Vrsalovic,
as I wrote in question I know how to do it in most DBs. I what to find universal solution or get a proof that it doesn't exist.
Here is one stupid solution based on temporary table. It's obviously bad, so no need to comment on it.
N - upper bound
M - lower bound
create #temp (Id int identity, originalId int)
insert into #temp(originalId)
select top N KeyColumn from MyTable
where ...
select MyTable.* from MyTable
join #temp t on t.originalId = MyTable.KeyColumn
where Id between M and M
order by Id asc
drop #temp

Related

Databases: Effectively implement string contains query

I need a way to effectively do a string contains query like:
# In SQL
LIKE '%some-string%'
# In mongo
{ $regex: /some-string/ }
But its very slow when the dataset size is big. Eg. I tried in a dummy DB (with and without an index - no index is surprisingly faster on mongo) and generate 100m rows (in reality theres more). Seems reasonable if I use ElasticSearch, but I am wondering if theres a DB or way I can structure my data to optimise this use case? I asked and I really need contains instead of a prefix match ...
Postgresql offers so-called trigram indexes. Those indexes can accelerate SQL col LIKE '%search%' predicates efficiently enough. Notice that indexing can, in all makes of server, speed up col LIKE 'string%' (without the leading wildcard character).
MySQL / Mariadb have FULLTEXT indexes that work with a distinctive SQL syntax. That feature works word-by-word unlike, well, LIKE which works character-by-character. Microsoft SQL Server has a similar feature with different syntax. It also works word-by-word.
So, there's no SQL standard way to do this efficiently, and different makes of server do it differently.
If you haven't yet chosen a particular make of server, you should figure out whether one of the full text schemes will serve your purpose. If you must get good performance from LIKE,
postgresql's trigram indexing is the way to go.
There's no general solution to this that works for all database systems i think. As another answer already explains, there are fulltext search extensions to a lot of popular database systems that, while they're far from being able to do what stuff like Lucene/ElasticSearch can do, should be enough to massively speed up your use case.
Let me explain this from a database internals perspective. Let's say that your selectivity is high a.k.a only a very small percentage of your tuples actually match your condition then you would generally want to have some kind of index structure. The kind of index structure you would need for this kind of query is some kind of Radix-Tree/Trie but that's not a standard data structure implemented in all SQL databases. The only data structure that is actually implemented in almost all SQL databases is a B-Tree. But a B-Tree can only do Prefix queries something like LIKE 'test%'. The only chance you have for LIKE '%test%' if your database doesn't have such indexes is having a very fast runtime system which none of the traditional (open source) database systems has...

Does writing the full path in SELECT statements enhance performance SQL?

Is the performance of queries impacted when writing the full query path. And what is the best practice when writing such queries ? Assuming the script is way more complex and longer than the following.
Example #1:
SELECT Databasename.Tablename.NameofColumn
FROM databasename.tablename
Example #2:
SELECT NameofColumn
FROM tablename
OR using aliases - example #3:
SELECT t.NameofColumn
FROM tablename t
There are a number of considerations when you're writing queries that are going to be released into a production environment, and how and when to use fully qualified names is one of those considerations.
A fully qualified table name has four parts: [Server].[Database].[Schema].[Table]. You missed Schema in your examples above, but it's actually the one that makes the most difference. SQL Server will allow you to have objects with the same name in different schemas; so you could have dbo.myTable and staging.myTable in the same database. SQL Server doesn't care, but your query probably does.
Even if there aren't identically named objects, adding the schema still helps the engine find the object you're querying a little bit faster, so there's your performance boost, albeit a small one, and only in the compile/execution plan step.
Besides performance, though, you need to worry about readability for your own sake when you need to revisit your code, and conventionality for when somebody else needs to look at your code. Conventions vary slightly from shop to shop, but here are a couple of generalities that will at least make your code easier to look at, say, on Stack Overflow.
1. Use table aliases.
This gets almost unreadable after about three column names:
SELECT
SchemaName.Tablename.NameofColumn1,
SchemaName.Tablename.NameofColumn2,
SchemaName.Tablename.NameofColumn3
FROM SchemaName.TableName
This is just easier on the brain:
SELECT
tn.NameofColumn1,
tn.NameofColumn2,
tn.NameofColumn3
FROM SchemaName.TableName as tn
2. Put the alias in front of every column reference, everywhere in your query.
There should never be any ambiguity about which table a particular column is coming from, either for you, when you're trying to troubleshoot it at 3:00 AM, or for anyone else, when you're sipping margaritas on the beach and your buddy's on call for you.
3. Make your aliases meaningful.
Again, it's about keeping things straight in your head later on. Aaron Bertrand wrote the definitive post on it almost ten years ago now.
4. Include the database name in the FROM clause if you want, but...*
If you have to restore a database using a different name, your procedures won't run. In my shop, we prefer a USE statement at the top of each proc. Fewer places to change a name if need be.
tl;dr
Your example #3 is pretty close. Just add the table schema to the FROM clause.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Which relational databases exist with a public API for a high level language? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
We typically interface with a RDBMS through SQL. I.e. we create a sql string and send it to the server through JDBC or ODBC or something similar.
Are there any RDBMS that allow direct interfacing with the database engine through some API in Java, C#, C or similar? I would expect an API that allows constructs like this (in some arbitrary pseudo code):
Iterator iter = engine.getIndex("myIndex").getReferencesForValue("23");
for (Reference ref: iter){
Row row = engine.getTable("mytable").getRow(ref);
}
I guess something like this is hidden somewhere in (and available from) open source databases, but I am looking for something that is officially supported as a public API, so one finds at least a note in the release notes, when it changes.
In order to make this a question that actually has a 'best' answer: I prefer languages in the order given above and I will prefer mature APIs over prototypes and research work, although these are welcome as well.
------------------ Update ----------------
Looks like I haven't been clear enough.
What I am looking at is a lower level API, sort of what the RDBMS probably use internally. RDBMS have the concept of an execution plan, and the API I am looking for would allow us to actually execute an execution plan without specifying the intended result, using SQL or similar.
The very vague idea behind this is to implement a DSL which translates directly to RDBMS system calls, without going through SQL or similar.
Trying to explain it in yet a different way: When e.g. Oracle gets fed with a SQL statement, it parses that statement, creates an execution plan out of it and finally executes the execution plan using some internal API, which probably allows things like: retrieving a specific row from a table, retrieving a range or rowids from an index, joining to sets of rows using a hash join and so on. I am looking for that API (or something similar for an RDBMS where this is available)
---------- Another update after comment by Neil ----------------
I think it would be appropriate to consider the API I am looking for the 'ISAM' level as in the second bullet point on this article: http://en.wikipedia.org/wiki/ISAM
You may want to check out the following Wikipedia article for a list of interfacing alternatives to SQL for relational databases:
Wikipedia: SQL: Alternatives to SQL
The list includes (in alphabetical order):
.QL - object-oriented Datalog
4D Query Language (4D QL)
Datalog
Hibernate Query Language (HQL) - A Java-based tool that uses modified SQL
IBM Business System 12 (IBM BS12)
ISBL
Java Persistence Query Language (JPQL)
LINQ
Object Query Language
QBE (Query By Example)
Quel
Tutorial D
XQuery
There is LINQ for .NET/C#. An example of what that would look like is:
var results = from row in db.SomeTable
where row.Key == 23
select row;
Which you can write in "natual" C# like so (that is, the above is syntactic sugar for):
var results = db.SomeTable.Where(row => row.Key == 23);
JDBC and ODBC already have what you need.
Try looking at cursors, for example this.
The syntax will not be as compact as in LINQ for example, but you'll get your rows through which you'll be able to iterate.
You might also want to check for object relational mapping, because I assume that's what you are ultimately looking for.
Furthermore, you will probably find that LINQ and other approaches work well with single row updates and with simple retrievals, but that for general database work (multiple row updates, complicated retrievals, aggregates, etc) there is nothing like SQL. SQL is and will be the most natural way to communicate to RDBMS, since it is the native language for RDBMS.
If you don't want to use SQL you will have to either use
a) one of approaches described at here or here. Keep in mind that these are full blown API's which map to SQL and that at the end they are essential no simpler and where they differ in pardigm, usually not as effective (unnatural fro RDBMS).
b) Write your own data abstraction layer (possibly using one of the frameworks mentioned under a) providing natural methods for your objects to talk to the database. This way you'll get the best of both worlds and exactly what you need. Although this really shines in larger projects.
The stuff I was looking for is called ISAM as correctly pointed out by Neil Butterworth.
So the wikipedia article http://en.wikipedia.org/wiki/Isam gives some points to start further research.
If you happen to find this useful please upvote Neils comment above.

Relational database data explorer / visualization?

Is there a tool that can let one browse relational data as a graph of connected nodes?
For example, i'm faced with trying to cleanse some anomolous data. i can start with two offending rows. In this particular example, the TransactionID should, by business rules, be unique to the table, but i find a transaction that violates that rule:
SELECT * FROM LCTTrans
WHERE TransactionID = 1075048
LCTID TransactionID
========= =============
4358 1075048
4359 1075048
2 row(s) affected
But really what i want to begin to hunt down all the related data, to try to see which is right. So this hypothetical software would start by showing me these two rows:
Next, i want to see that transaction that is linked into this table:
Now that transaction points to an MAL, so show me that:
Now lets add those two LCTs, that the transaction is "on". A transaction can be on only one LCT, yet this one is pointing to two:
Okay computer, both of those LCTs point to an MAL and the transaction that created them, show me those:
Those last two transactions, they also point at an MAL, and they themselves point to an LCT, show me those:
Okay, now are there any entries in LCTTrans that point to LCTs 4358 or 4359?...
And so on, and so on.
Now i did all this manually, running single selects, copying and pasting uniqueidentifier keys and converting them into friendly id numbers so i could easily see the relationships.
Is there software that can do this?
Ok, well I liked this idea so much that I've written it.
It's not released yet, but when it is it will be free.
Edit
Ok, it's now released. Free relational database exploring goodness # http://www.atlantis-interactive.co.uk/products/datasurf/default.aspx
Edit
Although initially free, this is now part of Pragmatic Works' DBA xPress package.
DBeauty is a powerful data browser (similar to Matt Whitfield's excellent DataSurf but more powerful). It is Java based, so you need to download the JDBC driver. I've found this tool invaluable in quickly navigating data (I fell in love with Microsoft's Quadrant before they killed it off and have been looking for a replacement ever since).
Old but good and free DB subsetting tool Jailer should be able to answer the question.
http://jailer.sourceforge.net/
Yes, i would advice you to look into DbSchema, it's a neet database management tool that will help you.
I can think of a few for relational data (RDF, Topic Map and conceptual graph browsers), but none off-hand for SQL. You could try and translate your queries to a relational language the browsers understand. You also might be able to build something on top of skyrails. Most of the visualisations I've tagged on delicious are for graph or relational data, but again tend to be schema free rather than SQL.
Basically you write a dedup tool where you show both records onthe screen side by side with the ability to pick the record you wan to keep but to check individual data from the other record to keep as well. Since deduping is very differnt from database to database and highly dependant on the specific table structure and business rules you have (as well as knowledge about which things must be looked at for the type of deduping you are doing as they typically only show the most important relationship tables on screen), I have never seen one that wasn't built in house.
But if you want a quick look at all the data write a query that left joins to all the child tables and shows all the fields for both transactionids. Then read through your results.
More importantly, how did you end up with a dup if you hav ea business rule that requires the transactionid to be uninique. Did you forget that all of these types of rules must be enfoced through the datbase and not the application? Why was there no unique index on that field?
I've looked for open source software that can do this sort of link analysis, without much success. If you have enough of a budget to go proprietary, you might consider talking to Palantir Technologies, Centrifuge Systems, i2, etc. about analytics platforms and visualization technologies.
Try This tool - it is in russian, but interface is comprehensive http://sourceforge.net/projects/basescan/. Navigation in base is through drag and drop.

Resources