Lazy Reads in Castle.ActiveRecord - castle-activerecord

I am writting an application which needs to periodically (each week for example) loop through several million records ina database and execute code on the results of each row.
Since the table is so big, I suspect that when I call SomeObject.FindAll() it is reading all 1.4million rows and trying to return all the rows in a SomeObject[].
Is there a way I can execute a SomeObject.FindAll() expression, but load the values in a more DBMS friendly way?

Not with FindAll() - which, as you've surmised, will try to load all the instances of the specified type at one time (and, depending on how you've got NHibernate set up may issue a stupendous number of SQL queries to do it).
Lazy loading works only on properties of objects, so for example if you had a persisted type SomeObjectContainer which had as a property a list of SomeObject mapped in such a way that it should match all SomeObjects and with lazy="true", then did a foreach on that list property, you'd get what you want, sort-of; by default, NHibernate would issue a query for each element in the list, loading only one at a time. Of course, the read cache would grow ginormous, so you'd probably need to flush a lot.
What you can do is issue an HQL (or even embedded SQL) query to retrieve all the IDs for all SomeObjects and then loop through the IDs one at a time fetching the relevant object with FindByPrimaryKey. Again, it's not particularly elegant.
To be honest, in a situation like that I'd probably turn this into a scheduled maintenance job in a stored proc - unless you really have to run code on the object rather than manipulate the data somehow. It might annoy object purists, but sometimes a stored proc is the right way to go, especially in this kind of batch job scenario.

Related

Is there a quicker way to add column or change fields in SSIS packages?

We have 10 slightly different SSIS packages that transfer data from one database to another. Whenever we make a change to the first db, such as adding in a new field or changing a property of said field such as extending a varchar's length, we have to update the packages as well.
Each of these packages have a long flow with multiple merge joins, sorts, conditional statements, etc. If the field that needs to be changed is at the beginning of the process, I have to go through each merge and update it with the new change and each time I do so, it takes a few minutes to process, then I'm on to the next one. As I get near the end, the process takes longer and longer to compute for each merge join. Doing this for 10 different packages, even if they are done at the same time, still takes upwards of 3 hours. This is time consuming and very monotonous. There's got to be a better way, right?
BIML is very good for this. BIML is an XML-based technology which translates to dtsx packages. BIMLScript is BIML interleaved with c# or vb to provide control flow logic, so you can create multiple packages/package elements based on conditions. You can easily query the table structure or custom metadata, such that if you are only doing db to db transformations, you can make structural changes to the database(s) and regenerate your SSIS packages without having to do any editing.
The short answer is no. The metadata that SSIS generates makes it very awkward when data sources change. You can go down the road of dynamically generated packages, but it's not ideal.
Your other option is damage reduction. Consider whether you could implement the Canonical Data Model pattern:
http://www.eaipatterns.com/CanonicalDataModel.html
It would involve mapping the data to some kind of internal format immediately on receiving it, possibly via a temporary table or cache, and then only using your internal format from then on. Then you map back to your output format at the end of processing.
While this does increase the overall complexity of your package, it means that an external data source changing will only affect the transforms at the beginning and end of your processing, which may well save you lots of time in the long run.

Managing DB Links inside a view Oracle PL/SQL

Lets say you have six oracle database servers which are basically identical, but represent different factories.
For easier reporting, we could make a nice big view on a seventh server that selects in via #dblink1-6. Works fine 99% of the time. Someone kicks the cord at plant 5, your view is dead for all plants. In this case, we want to just show the five that are working.
I cannot push the data from the six servers into the 7th, the 7th has to look out to 1-6. We can't use a materialized view because that's not live data in this case... often it could be, but not with linked servers where the outside server can't push data in.
What can I write into a view that basically says if this dblink works, union in a select statement otherwise don't?
As you've found, a view queries across a dblink will be marked invalid the first time it is accessed after a dblink is inaccessible.
My preferred solution would be to use materialized views so that the seventh server would always have access to at least some data - but in your case, you'd prefer to have no data rather than non-live data, so that's not an option.
In which case you need something to catch the "dblink inaccessible" exception and hide it from the view. The only way I can think of to solve that one is to query the tables using a pipelined function, which would swallow the exception and return zero rows if the dblink is down. Your original view would then UNION ALL the queries across six pipelined functions. Unfortunately I'm pretty sure this solution would perform very poorly in comparison with your original view, because it will not be able to do things like pushing predicates into the view (effectively the pipelined function will force the equivalent of a FTS across every dblink that's available, every time the query is run). Since your purpose is reporting, this may or may not be a big issue.
Note: I've never actually done this, so this answer is just "an idea to try".

Splitting Trigger in SQL Server

Greeting,
Recently I've started to work on an application, where 8 different modules are using the same table at some point in the workflow. This table have an Instead-Of trigger, which is 5,000 lines long (where first 500 and last 500 lines are common for all modules, and then each module has its own 500 lines of code).
Since the number of modules are going to grow and I want to keep thing as clear (and separate) as possible, I was wondering is there some sort of best practice to split trigger into stored procedures, or should I leave it all in one place?
P.S. Are there going to be any performance penalties for calling procedures from the trigger and passing 15+ parameters to them?
Bearing in mind that the inserted and deleted pseudo-tables are only accessible from within trigger code, and that they can contain multiple rows, you're facing two choices:
Process the rows in inserted and deleted in a RBAR1 fashion, to be able to pass scalar parameters to the stored procedures, or,
Copy all of the data from inserted and deleted into table variables that are then passed to the procedures as appropriate.
I'd expect either approach to impose some2 performance overhead, just from the copying
That being said, it sounds like too much is happening inside the triggers themselves - does all of this code have to be part of the same transaction that performed the DML statement? If not, consider using some form of queue (a table of requests or Service Broker, say) in which to place information on work to perform, and then process the data later - if you use Service Broker, you could have it inspect a shared message and then send appropriate messages to dedicated endpoints for each of your modules, as appropriate.
1 Row By Agonizing Row - using either a cursor of something else to simulate one to access each row in turn - usually frowned upon in a Set-based language like SQL.
2 How much is impossible to know without getting into the specifics of your code and probably trying all possible approaches and measuring the result.
I don't think there is a meaningful performance penalty in this case.
Any way, it is bad practice to write it all inside the trigger (when it is 5000 lines long...).
I think the main consideration is maintainability, which will be much better if you split it
To several SPs

Using a Cache Table in SQLServer, am I crazy?

I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.

What costs more: DataSets or Multiple Updates?

If my program was to hit the database with multiple updates would it be better to pull in the tables into a dataset, change the values and then send it back to the database. Does anyone know what's more expensive?
No matter what, the database needs to perform all those updates based on the edits you did to the local DataSet. As I understand it, that will be just as expensive as sequentially updating. The only advantage is its easier to iterate over a dataset rather than pull-and-push one result after another.
What will be expensive is all the workaround code for dealing with the potential exceptions that can occur because you choose "which costs less" over "which is simplest". Premature optimization.
Depends on the size of the DataSet. If your data set is too large, it doesn't worth it. Otherwise, it might be a good approach. However, nothing prevents you to do multiple updates in a batch even without using a DataSet. You could write a stored procedure with an XML parameter that will do batch updates for you.

Resources