SnowFlake Procedures - snowflake-cloud-data-platform

Are the snowflake procedures written in Javascript precompiled. I am noticing for the procedure is compiling for each invocation. Any thought?
Also, Is it wise to use SnowFlake SQL or Java for procedures in production given the fact that they are in Preview?

Compiled: No
JavaScript is not a compiled language. But then again nether is SQL, but many DB's have "pre-compiled SQL procedures" and "pre-compiled SQL". Snowflake does not have "pre-compiled" SQL. But Java is compiled.
Which is a long way of saying, NO the SQL/JavaScript is evaluated on every run "presently", and thus not "compiled"
Does it matter?
But really if you are using a JavaScript procedure over 10 rows, yes the compile time will be a big factor. If you are using a Javascript function over a data set that is a 1 billion rows read from file, and filtered down to 1 million rows, the "cost" of the JavaScript across those rows, will be invisible compared the the data read.
So as with EVERYTHING, there is no generalized answer, you should test your performance, and when it's "good enough" more on to the next feature/expensive function/procedure.
Is Compile Java worth it?
Probably not, at the edge case, yes, just like in the real world should any project be in Java instead of written in JavaScript. One is fast to develop, the other is very slow to develop. One can be fast when, the other can use libraries (there are no libraries in Snowflake SP). Anyways, I would "get a working solution" and then optimize to "slow" bits, once you see them. Verse trying to up front write the best thing once.

Related

stored procedures and testing -- still a problem even today. Why?

Right now this is just a theory so if I'm way off, feel free to comment and give some ideas (I'm running out of them). I guess it's more so an update to this question and as I look at the "related question" list -- there's a lot of 0 answers. This tells me there's a real gap.
We have multiple problems with our sql setups in general, the majority of which stem from stored procedures that have grown into monsters from hell and some other user functions skattered about into the db. My biggest concern is they're completely untested -- when something goes wrong, no one can say with 100% certainty "yes, I know for a fact this works". Makes debugging a recurring nightmare.
This afternoon, I got this crazy idea we could start writing some assemblies (CLR-ing yo!) for SQL and test them. I ran into the constraints (static methods only, safe/external/unsafe, etc) and overall, that didn't go all that well. At least not as well as I'd hoped and didn't help me move toward my goal.
I've also tried setting up data in a test by hand (they tried it here too before I showed up). Even using an ORM to seed the data -- this also becomes rather difficult very quickly and a maintenance hassle. Of course, most of this pain is in the data setup and not the actual test.
So what's out there now in 2011 that helps fix/curb this problem or have we (as devs) abandonded the idea of testing stored procedures because of the heavy cost?
You can actually make stored procedure tests as a project. Our DBEs at work do that - here's a link you might like: Database Unit Testing with Visual Studio
We've had a lot of success with DbFit.
Yes, there is a cost to setting up test data (there is no way to avoid this cost IMHO), but the Fitnesse platfom (on which DbFit is based) enables you to reuse data population scripts by including them within multiple tests.
Corporate culture rules the day. Some places test extensively. Other place, well, not so much.
I did a short-term contract with a Fortune 500 a few years ago. Design, build, and deploy internally. My database had to interface with two legacy systems. It was clear early on that I was going to have to spend more time testing that usual. (Some days, a query of historical data would return 35 rows. Other days the identical query would return 20,000 rows.)
I built a tool in Microsoft Access that stored and executed SQL statements. (Access was the only tool I was allowed to use.) I could build a current version of the database, populate it with test data, and run all the tests I'd built--several hundred of them--in about 20 minutes.
It helped a lot to be able to go into meetings armed with a one-page printout that said my code was working exactly like it was when they signed off on it. But it wasn't easily automated--most of the SQL was hand-coded.
Can DBUnit help you?
Not used it much myself but you should be able to set the database to a known state, execute the procedure and then verify the data has changed as expected.
EDIT: After looking in to this more it would seem you need something like SQLunit rather than DBUnit. SQLUnit is described as
SQLUnit is a regression and unit
testing harness for testing database
stored procedures. An SQLUnit test
suite would be written as an XML file.
The SQLUnit harness, which is written
in Java, uses the JUnit unit testing
framework to convert the XML test
specifications to JDBC calls and
compare the results generated from the
calls with the specified results.
There are downsides; it's Java based which might not be your preference and more importantly there doesn't seem to have been much activity on the project since June '06 :(

Best programming language for fast DB reads and fast local data structure manipulation

I have a mysql database with a variety of different tables with some storing 100k+ rows. I wanted a language that would allow me to read quickly from the database, allowing me to collate data from various different tables and store them into local objects/data structures. I would then do most of the complex processing locally, which I would also like to be optimized for.
This is mainly for an analysis project of data that is cleared out every day. Some friends recommended Ruby or Python, but not knowing either, I wanted a second opinion before I took the leap.
Part of the problem here is the latency between the db and your application-tier code. Ping the DB server from where you intend to query the database from. Double that and that's your turnaround time for every operation. If you can live with that time, then you're OK. But you might be better off writing your manipulations in sprocs or something that lives close to the DB and use your application code to make that presentable to a user.
the Db is going to be the bottle neck in most cases in terms of getting data out.
Really depends on what your "complex processing" is that will make the greatest difference in what language and what performance you need.
in terms of being easy to get up and started with, python and ruby are quick to get started with and get something working. Bit slower compared to others for computing stuff. But even then, both can compute a hell of a lot of stuff before you will notice much difference from a native compiler language.
100,000 records really isn't that much. Provided you have enough ram and a multiple local "indexes" into the data are referencing the same objects and not copies, you'll be able to cache it locally and access it very quickly without concern. While Ruby and Python are both interpreted languages and operation-for-operation slower than compiled languages, the reality is in executing an application only a small portion of cpu time is spent in your code and the majority is spent into the built-in libraries you call into, which are often native implementations, and thus are as fast as compiled code.
Either Ruby or Python would work fine for this and even if you find, after testing, that you performance is in fact not sufficient, translating from one of these to a faster language like Java or .NET or even C++ would be significantly faster than actually rewriting from scratch since you've really already done the tough work.
One other option is to cache all the data in an in-memory database. Depending on how dynamic the analysis you need to do, this may work well in your situation. SQLite works very well for this.
Also note that since you're asking about caching the data locally and then acting on the local cache only, the performance calling out to the database doesn't apply.

Testing custom ORM solution performance overhead - how to?

I have created a prototype of a custom ORM tool using aspect oriented programming (PostSHarp) and achieving persistence ignorance (before compile-time). Now I tried to find out how much overhead does it introduce compared to using pure DataReader and ADO.NET. I made a test case - insert, read, delete data (about 1000 records) in MS SQL Server 2008 and MySQL Community Edition. I run this test multiple times using pure ADO.NET and my custom tool.
I expected that results will depend on many factors - memory, swapping, CPU, other processes so I ran tests for many times (20-40). But the results were really unexpected. They just differed too much between those cases. If there were just some extreme values, I could ignore them (maybe swapping ocurred or smth. like that) but they were so different that I am sure I cannot trust this kind of testing. Almost half of times my ORM showed 10% better performance than pure ADO.NET, other times it was -10%.
Is there any way I can make those tests reliable? I do not have a powerful computer with lots of memory, but maybe I somehow can make MS SQL and MySQL or ADO.NET to be as consistent as possible during those tests? And how about count of records - which is more reliable, using small amount of records and running more times or other way?
Have you seen ORMBattle.NET? See FAQ there, there are some ideas related to measuring performance overhead introduced by a particular ORM tool. Test suite is open source.
Concerning your results:
Some ORM tools automatically batch statement sequences (i.e. send several SQL statements together). If this feature is implemented well in ORM, it's easy to beat plain ADO.NET by 2-4 times on CRUD operations, if ADO.NET test does not involve batching. Tests on ORMBattle.NET test both cases.
A lot depends on how you establish transaction boundaries there. Please refer to ORMBattle.NET FAQ for details.
CRUD tests aren't best performance indicator at all. In general, it's pretty easy to get
peak possible performance here, since in general, RDBMS must do much more than ORM in this case.
P.S. I'm one of ORMBattle.NET authors, so if you're interested in details / possible contributions, you can contact me directly (or join ORMBattle.NET Google Groups).
I would run the test for a longer duration and with many more iterations as small differences would average out over time and you should get a clearer picture. Also, make sure you eliminate any external things that may be affecting your test, such as other processes running, non enough free memory, cold start vs warm start, network usage, etc.
Also, make sure that your database file and log file have enough free space allocated so you aren't waiting for the DB to grow the file during certain tests.
First of all you need to find out where does the variance come from. The ORM layer itself or the database?
Many times the source of such variance is the database itself. Databases are very complex systems, with many active processes inside that can interact with the result of performance measurements. To achieve some reproductible results you'll have to place your database under 'laboratory' conditions and make sure nothing unexpected happens. what that means depends from vendor to vendor and you need know some pretty advanced topics in order to tacle something like this. For instance, on a SQL Server database the typical sources of variation are:
cold cache vs. warm cache (both data and procedures)
log and database growth events
maintenance jobs
ghost cleanup
lazy writer
checkpoints
external memory pressure

How do you maintain large t-sql procedures

I'm about to inherit a set of large and complex set of stored procedures that do monthly processing on very large sets of data.
We are in the process of debugging them so they match the original process which was written in VB6. The reason they decided to re write them in t-sql is because the vb process takes days and this new process takes hours.
All this is fine, but how can I make these now massive chunks of t-sql code(1.5k+ lines) even remotely readable / maintainable.
Any experience making t-sql not much of head ache is very welcome.
First, create a directory full of .sql files and maintain them there. Add this set of .sql files to a revision control system. SVN works well. Have a tool that loads these into your database, overwriting any existing ones.
Have a testing database, and baseline reports showing what the output of the monthly processing should look like. Your tests should also be in the form of .sql files under version control.
You can now refactor your procs as much as you like, and run your tests afterward to confirm correct function.
For formatting/pretty-fying SQL, I've had success with http://www.sqlinform.com/ - free online version you can try out, and a desktop version available too.
SQLinForm is an automatic SQL code formatter for all major databases ( ORACLE, SQL Server, DB2 / UDB, Sybase, Informix, PostgreSQL, MySQL etc) with many formatting options.
Definately start by reformatting the code, especially indentations.
Then modularise the SQL. Pull out chunks into smaller, descriptively named procedures and functions in their own stand alone files. This alone I find works very well with improving my understanding of large SQL files.
ApexSQLScript is a great tool for scripting out an entire database - you can then check that into source control and manage changes.
I've also found that documenting the sprocs consistently lets you pull out information about them using the data about the source code in sys.sql_modules - you can use tags or whatever to help document subsystems.
Also, use Schemas (or even multiple databases) - this will really help divide up your database into logical units and point out architectural issues.
As far as large code, I've recently found the SQL2005 CTE feature to be very useful in managing code with lots of nested queries (not even recursive). Instead of managing a bunch of nesting and indentation, CTEs can be declared and built up and then used in the final statement. This also helps in refactoring as it seems a lot easier to remove redundant nested queries and columns.
Stored Procs and UDFs are vital for managing a large code base and eliminating dark corners. I have not found views to be terribly helpful because they are not parameterizable (UDFs can be used in these cases if the result sets are small).
Try to modularise the SQL as much as possible and have a set of tests which will enable you to maintain, refactor and add features when needed. I once had the pleasure of inheriting a Stored Proc that totalled 5000 lines and I still have nightmares about it. Once the project was over I printed out the stored proc for a laugh destorying X trees in the process. During one of our companies weekly stand up sessions I laid it out end to end and it streched the entire length of the building. Ised this as an example of how not to write and maintain stored procedures.
One thing that you can do is have an automated script to store all changes to source control so that you can review changes to the procedures (using a diff on the previous and current versions)
It's definitely not free, but for keeping your T-SQL formatted in a consistent way, Redgate Software's SQL Prompt is very handy. As long as your proc's syntax is correct, a couple of keystrokes (Ctrl+K,Y) will reformat it all instantly. The options give you a lot of control over how your SQL is formatted.

Does LINQ To SQL provide faster response times than using ado.net and oledb?

LINQ simplifies database programming no doubt, but does it have a downside? Inline SQL requires one to communicate with the database in a certain way that opens the database to injections. Inline SQL must also be syntax-checked, have a plan built, and then executed, which takes precious cycles. Stored procedures have also been a rock-solid standard in great database application programming. Many programmers I know use a data layer that simplifies development, however, not to the extent LINQ does. Is it time to give up on the SP's and go LINQ?
LINQ to SQL actually presents some alarming performance problems in the database. Basically, it creates multiple execution plans based on the length of the parameter you are using. I posted about it a while back on my blog LINQ to SQL may cause performance problems.
Now, is that to say that LINQ doesn't have a place? Hardly. LINQ definitely has a place in the development toolkit, just like stored procedures. Ultimately, you want to use stored procedures when performance is absolutely necessary and use an ORM tool in any other situation.
As far as inline SQL goes, there are ways to execute inline SQL so that the plan is only built once and is never recompiled. Most ORMs should take care of this aspect of performance tuning as well and using these methods is usually the safest way to execute your SQL since it forces you to use parameterized queries.
Like most database solutions, the right answer depends on the problem you're trying to solve. If you favor development speed over database/application performance, then using LINQ or another DAL/ORM tool is the best way to go. If you favor performance over ease of development, then using stored procedures and pure datasets is going to be your best bet. LLBLGen even provides a LINQ to LLBLGen layer so you can use LINQ to query LLBLGen's objects and have LLBLGen actually handle building your queries and avoid some of the downfalls of LINQ.
Your basic premise is flawed..
Inline SQL requires one to communicate with the database in a certain way that opens the database to injections.
No it doesn't. Hard-coding user-inputted values into a SQL statement does, but you could do that with store procedures as well.
Parameterizing your queries guards against injection attacks, but inline SQL can be parameterizing just as easily as stored procedures.
Inline SQL must also be syntax-checked, have a plan built, and then executed.
All Sql (SPs and inline) must be syntax-checked and have a plan built on their first call. Thereafter, the exact text of the request & the execution plan are cached. If another request with the exact same text (not counting parameters) is received, the cached execution plan is used.
So, if you hard-code values into inline SQL, the text won't match, and it will have to re-parse the query. However, if you use parameters, the text of the query will match, and you will get a cache hit. In which case, it wouldn't matter if the query in inline SQL or a SP.
In other words, the only problem with inline SQL is that it easy to do something that slow & insecure. But making inline SQL fast & secure is no more work that using a SP.
Which brings us to LINQ, which always using parameters, even if you hard-code the values into the LINQ statement, making "fast & secure" inline SQL trivial.
LINQ also have the advantage over SPs of having all your code in one spot, instead of scattered over two different machines.
If you're interested in benchmarking, Rico Mariani has an excellent 5-part study that covers the qualitative and quantitative differences.
He may be an MS guy, but he's known as a performance nut - his benchmarks are thorough and well thought out.
This is a performance run by Maximilian Beller. According to him, LINQ is much much slower.
Read his comprehensive study
Just think about changing a columns name - now change the (n)SPs and (x)Views.
Do everything that is expensive on the database (like searches , sorting etc..) and you won't notice a problem.
Also, if you want to display a large grid without paging ... then use a dataset - that one is faster.
StackOverflow also uses linq2sql - do you see a problem :) ?
Use an ORM - it's the way to go on most applications.
PS: also, about micro benchmarks - like .. let's select 10.000 rows with an ORM - DON'T DO IT. That's not why you use an ORM. If you want to select 10.000 rows use ADO.
It depends on what you're doing. LINQ is going to be less efficient at the actual data/set manipulation than a real database. But you'll save a lot in not having to connect to the database over a network.
If your database is on the same machine or is formally 'well-connected', you're probably better off using it.
But if you're getting back a large result set from a remote db that could mean significant transmission time, or if it's a really short query that won't justify the overhead, LINQ would likely be better.
Because of the structure of LINQ to SQL, there is no possible way it can be faster than using raw SQL, either your own well-formed queries or as a stored procedure. What LINQ buys you is not speed but type safety and organization; in short most of the benefits that ORMs generally grant you.
LINQ to SQL is not about speed, it's about building a more maintainable software system. It's about all the stuff dedicated Software Engineers and Architects care about, stuff like loose coupling and layering
That's not to say that you can't build some really unmaintainable code with LINQ -- nobody is keeping you from shooting yourself in the foot but you -- but done properly, LINQ can help tremendously. I'm not saying LINQ is a silver bullet, however. It has a host of issues that make it difficult to use in many enterprise situations -- which is why MS offers Entity Framework (ADO.NET 3.0). Of course, even that's not perfect given the recent EF Vote of No Confidence.
Is LINQ to SQL or even EF better than raw SQL? I'd say a resounding Hells Yeah. Are there other solutions that might work better? Maybe.

Resources