In-Database Memoization - a good idea? Any experiences? - database

I have an idea I have yet to implement, because I have some fear I may be barking up the wrong tree... mainly because Googling on the topic returns so few results.
Basically I have some SQL queries that are slow, in large part because they have subqueries that are time-consuming. For example, they might do things like "give me a count of all bicycles that are red and ridden by boys between the ages of 10-15". This is expensive as it sloshes through all of the bicycles, but the end result is a single number. And, in my case, I don't really need that number to be 100% up to date.
The ultimate solution for problems of this sort seems to be to apply an OLAP-based engine to pre-cache these permutations. However, in my case I'm not really trying to slice and dice the data around a ton of metrics, and I'd love not to have to complicate my architecture with yet another process/datastore running.
So... my idea was basically memoizing these subqueries in the database. I might have a table called "BicycleStatistics" and it might store the output of that subquery above as a name value pair of it's inputs and outputs.
Ex name: "c_red_g_male_a_10-15" value: 235
And have a mechanism that memoizes those values to that table as the queries are run.
Has anyone been in this situation and tried anything similar? The reason I think a solution like this is valuable over the "throw a lot of RAM in your DB and let the database handle it" is (A) my database is bigger than the amount of RAM I can conveniently throw at it, and (B) the database is going to ensure I get the exact right number for these statistics, and my big win, above, is that I'm ok with the numbers being a day or two out of date.
Thanks for any thoughts/feedback.
Tom

Materialized views are a way of achieving this requirement, if your DBMS supports them.

Related

Strategies for UK Postal Address Matching

I have 2 tables of UK postal addresses (around 300000 rows each) and need to match one set to another in order to return a unique ID contained in first set for each address.
The problem is there's a lot of variation in the formats of the addresses and in the spellings.
I've written a lot of t-sql scripts to pick off the east matches (exact postcode + house number + street name, etc) but there are many unmatched records left that are proving difficult to handle. I might end up having as many sql scripts as there are exceptions!
I've look at Levenstein function and ranking word for word but these methods are unreliable and problematic too.
Does anyone have any experience of doing similar work and what was your approach & success rate?
Thank you!
I agree with the commenters that this is largely a business rule thing rather than a programming question, but for what it's worth...
I had a somewhat similar problem with a catalogue many years ago. Entries weren't always consistent in the way we'd hoped, different editions came up weirdly and with a wide variety of variations. All had to be linked.
What I did in the end was a fuzzy matcher. Broke the item down into components. Normalised the data where I could - removing spaces from fields that didn't always have them and could live without them for example. Worked out the distance between near misses - bar and car being 1 apart, for example. I stemmed words - see http://snowball.tartarus.org/algorithms/english/stemmer.html for more info. Think I even played with SQL Server's SOUNDEX matching.
I then went through and scripted the job to produce a list of candidate matches. Anything above a certain level got presented to an administrator, who was shown what the program thought was the best match along with other likely matches. They picked the one that looked best, ticked it and went on to the next one.
At the start of the list everyone thought the job was far too huge to be manageable. They then started going through it, and found it was much faster than they thought and much easier than they'd feared to stay on top of the new data as it came in.
The script to do it all programmatically will never be perfect, and will end up being nearly as long as the source list with as many objections as it'll generate. Don't try to automate it perfectly; automate the easy stuff, put a human in the loop for the uncertain cases. Much easier and safer.

SQLite database questions, problems with design (indexing/multiple fields)

I use stackoverflow a lot, but this is my first question here, so if i'm doing anything wrong just let me know. I'm not a programmer (I just do programming for my own needs) so I'm open to tutorial suggestions etc. I won't be offended if you just give me something to read and find the answers myself.
OK, to the point - I'm trying to write simple application to track my personal expenses and I have a problem with database design. I'm using VStudio to create the database (SQLite). I attached a diagram with my design and I have some questions.
My SQLite diagram
I don't know exactly how to design "Transactions" table. Fields like Date, Payment Type etc. seems to be easy enough but the idea was to store in this table information about transactions so I need to store multiple products there. I've read about it and created table "Transactions_Products" that will help with that. My problem is : where do I put quantity of products in the transaction? I can't think of a place to put it. I tried to find similar databases but couldn't find anything.
Second thing. I've read about indexing a lot, but I still can't grasp the idea. I don't know when to use it. Should I use it only on fields that I will be "querying" a lot?
Last one - is it better for such a small application just for myself to store my account balance in a separate table or should I just calculate it every time?
As I said, I don't need answers like: "do this, do that". If you just give me some good tutorials/articles I think I can find answers on my own, but I couldn't find it. Maybe I'm searching for it wrong.
Thank you in advance for any information.
where do I put quantity of products in the transaction?
Transactions is a bad table name as it's vague and has multiple meanings. Consider "payments", "purchase invoices", etc. See https://dba.stackexchange.com/questions/12991/ready-to-use-database-models-example/23831#23831 for some existing patterns.
Should I use [indexes] only on fields that I will be "querying" a lot?
There's no free lunch. Indexes take space, and can slow down inserts. Start with indexes on your primary keys (which is the default for SQLite), measure what is slow (looking at query plans) and add indexes if they help and if you have room.
is it better for such a small application just for myself to store my account balance in a separate table or should I just calculate it every time?
For an operational/transactional database like you describe, avoid storing calculated values. SQLite can count numbers quickly :)
Premature optimization is premature. Make it work first with full normalization. If you have performance problems, analyze what is really causing the slow-down and go from there.

Searching a nvarchar(max) field

Our application connects to a SQL Server database. There is a column that is nvarchar(max) that has been added an must be included in the search. The number of records in the this DB is only in the 10s of thousands and there are only a few hundred people using the application. I'm told to explore Full Text Search, is this necessary?
This is like asking, I work 5 miles away, and I was told to consider buying a car. Is this necessary? Too many variables to give you a simple and correct answer to your question. For example, is it a nice walk? Is there public transit available? Is your schedule flexible? Do you have to go far for lunch or run errands after work?
Full-Text Search can help if your typical searches are going to be WHERE col LIKE '%foo%' - but whether it is necessary depends on how large this column will get, whether your searches are true wildcard searches, your tolerance for millisecond vs. nanosecond queries, the amount of concurrency, even seemingly extraneous stuff like whether the data is always in memory and can be searched more efficiently.
The better answer is that you should try it. Populate a table with a copy of your data, add a full-text index, and see if your typical workload improves by using full-text queries instead of LIKE. It probably will, but there's no way for us to know for sure even if you add more specifics than ballpark row counts.
In a similar situation I ended up making a table structure that was more search friendly and indexable, then setting up a batch job to copy records from the live database to the reporting one.
In my case the original data didn't come close to needing an nvarchar(max) column so I could get away with that. Your mileage may vary. In any case, the answer is "try a few things and see what works for you".

db design: efficiency consideration when adding an intermediate class into a Many-Many relationship

I understand an intermediate class is often introduced to capture information in a situation where for example, a team has many players, and a player plays for many teams over the years. The intermediate class introduced is contract with cardinality as shown:
Team -1----N- Contract -N----1- Player
Let's say however that 98% of all queries only want current information and don't care about historical information. Given the name of a player, they want to know information about his current team, and perhaps current contract.
Given the above relationship, should all the contracts always be looked through to find the current one first, and then from there access information about the team? Or should an optimization be made with direct linkage between the player and his current team?
Thanks
If it is assured that there is only one team for each player at given time, you just add
currentTeam column to the Player table and that's it. But remember you must update it every time you update the Contracts table! And it must be done within the transaction, so that the database is kept consistent at any time.
You violate some normal form this way, but you know what and why you are doing that - for efficiency and optimization. I do this trick many times.
This seems to be under the context of some kind of ORM, so I'll run with that. (Even if it isn't, keep reading.)
Objects are useful for modeling complex operations. For example, adding a new Contract causes all sorts of crazy things to happen to both the Team, the Players, and various PayChecks (I made the last one up, but you get the point). This is the perfect kind of thing to be handled in code than in, say, a hideously complex T-SQL stored procedure.
But when it comes to querying, I find that it often makes sense to write a view/SQL statement/projection that is shamelessly tailored to the set of information that you need to perform a function. As long as you do this for reading data, and not for writing it, then you're not really subverting your object model; you are just looking at it a different way, and you're just making a pragmatic observation that most of the time, you only need the information from a IPlayerCurrentContractQuery and not the whole list of Contracts within the Player. Since it is a method that is called a bajillion times, you've written an integration test to make sure that the SQL produces correct results, and you've looked closely at its query plan to make sure that it's not doing awful things like table scans to the database. This commonly-used screen in your app is fast and everyone is happy.
One could make the case that creating such a separate query is a premature optimization, but it probably isn't. I mean, if a player usually only has a few Contracts, then it might not be worth separating out the query and interface. Sucking down all of the Contracts from the database to loop through them and pluck out the current one is going to perform worse than selecting the right one from the database first, but if it's just a handful of Contracts, then a "yeah I'm fully aware it's kinda dumb but it's fast enough" approach is probably good enough, just move on. But if these Contracts stretch back years or are large objects, then separating out the query becomes a no-brainer.
If that starts performing badly because of the joins (which is unlikely unless you start seeing significant traffic), then you add a cache. And if that doesn't work due to lots of writes, then you can start denormalizing your database by adding a direct reference. But unless you are writing the next Facebook of baseball then YAGNI, and at that point you're sharding across servers and throwing away most of the benefits of the relational model anyway so who cares.
A similar situation is posed in my answer to this question.
(If this question isn't about ORM, and really is just about modeling how the tables are designed, then you make sure that you have an index that covers the query that selects the current contract--such as start and stop dates--and you are pretty much done unless you have really exceptional scaling requirements as mentioned above. If you're writing a particular set of joins very often, then you might write a function or stored procedure to remove the boilerplate.)
That's my brain dump. Hope this helps!
Given the above relationship, should all the contracts always be
looked through to find the current one first, and then from there
access information about the team?
A modern query optimizer will use the most selective index first. Assuming that player_id is in that index in a usable position, the optimizer will probably find all the rows for that player first--and there won't be many, right?--then do another index scan on the contract dates to find the current contract.
If I were you, I'd create a view that returns only the "current" rows. Let application code run against that view.

Please help explain if I'm destroying my DB Schema for the sake of performance :(

I've got a database in production for nearly 3 years, on Sql 2008 (was '05, before that). Has been fine, but it isn't very performant. So i'm tweaking the schema and queries to help speed some things up. Also, a score of main tables contain around 1-3 mill rows, per table (to give u a estimate on sizes).
Here's a sample database diagram (Soz, under NDA so i can't display the original) :-
alt text http://img11.imageshack.us/img11/4608/dbschemaexample.png
Things to note (which are directly related to my problem) :-
A vehicle can have 0 (NULL) or 1 Radio. (Left Outer Join)
A vehicle can have 0 (NULL) or 1 Cupholder (Left Outer Join)
A vehicle has 1 Tyre Type (Inner Join).
Firstly, this looks like a normalised database schema. I suck and DB theory, so I'm guessing this is 3NF (at least) ... famous last words :)
Now, this is killing my database performance because these two outer joins and inner join are getting called a lot AND there's also a few more joins in many statements.
To try and fix this, I thought I might try and indexed view. Creating the view is a piece of cake. But indexing it, doesn't work -> can't create indexed views with joins OR self referencing tables (also another prob :( ).
So, i've cried for hours (and /wrists, dyed hair and wrote an emo song about it and put it on myfailspace) and did the following...
Added a new row into each 'optional' outer join tables (in this example, Radios and CupHolders). ID = 0, rest of the data = 'Unknown Blah' or 0's.
Update Parent tables, so that any NULL data's now have a 0.
Update relationship from outer joins to inner joins.
Now, this works. I can even make my indexed view, which is very fast now.
So ... i'm in pain. This just goes against everything I've been taught. I feel dirty. Alone. Infected.
Is this a bad thing to do? Is this a common scenario of denormalizing a database for the sake of performance?
I would love some thoughts on this, please :)
PS. Those images a random google finds -- so not me.
null values generally are not used in indexs. What you've done is to provide a sentinel value so that the column always has a value which allows your indexes to be used more effectively.
You didn't change the structure of your database either, so I wouldn't call this denormalizing. I've done that with date values where you have an "end date" null denoted not ended yet. Instead I made it a known date way in the future which allowed for indexing.
I think this is fine.
Database should always be designed and initially implemented in 3NF. But the world is a place of reality, not ideals, and it's okay to revert to 2NF (or even 1NF) for performance reasons. Don't beat yourself up about it, pragmatism beats dogmatism in the real world all the time.
Your solution, if it improves performance, is a good one. The idea of having an actual radio (for example), manufactured by nobody and having no features, is not a bad one - it's been done a lot before, believe me :-) The only reason you would use that field as NULL was to see which vehicles have no radio and there's little difference between these queries:
select Registration from vehicles where RadioId is null
select Registration from vehicles where RadioId = 0
My first thought was to simply combine the four tables into one and hang the duplicate data issue. Most problems with DBMS' stem from poor performance rather than low storage space.
Maybe keep that as your fallback position if your current de-normalized schema becomes slow as well.
"...So i'm tweaking the schema and queries to help speed some things up..." - I would beg to differ about this. It seems that you're slowing things down. (Just kidding.)
I like the Database Programmer blog. He has two columns for and against normalization that you might find helpful:
http://database-programmer.blogspot.com/2008/10/argument-for-normalization.html
http://database-programmer.blogspot.com/2008/10/argument-for-denormalization.html
I'm not a DBA, but I think the evidence is in front of your eyes: Performance is worse. I don't see what splitting these 1:1 relationships into separate tables is buying you, but I'll be happy to take instruction.
Before I changed anything, I'd ask SQL Server to EXPLAIN PLAN on every query that was slow and use that information to see exactly what should be changed. Don't guess because a normalization guru told you so. Get the data to back up what you're doing. What you're doing sounds like optimizing middle tier code without profiling. Gut feelings aren't very accurate.
im running into the same issue of performance vs academic excellence. we have a large view on a customer database with 300 columns and 91000 records. we use outer joins to create the view and the performance is pretty bad. we have considered changing to inner joins by putting in the dummy records with a value of zero on the columns we join on (instead of null) to enable a unique index on the view.
i have to agree that if performance is important, sometimes strange things have to be done to make it happen. ultimately those who pay our bills don't care if the architecture is perfect.

Resources