What does 'Function definition too long' mean in Snowflake - snowflake-cloud-data-platform

I was creating a stored procedure and I stumbled with this issue.
What are snowflake limits for stored procedures in Javascript?

Well, I did several tests and I got some confusing results.
I mostly started getting this error when the total length of the procedure definition was over 800000 characters.
The maximum I got was 810000. My assumption is that there is some limitation based on compress size.
But until know all I can say is:
There is a limit for the size of your procs
The limit is something over 780K
If you reach that limit, it will be better if you split your proc.

At the time of writing, the Snowflake manual says
Snowflake limits the maximum size of the JavaScript source code in the body of a JavaScript UDF. Snowflake recommends limiting the size to 100 KB. (The code is stored in a compressed form, and the exact limit depends on the compressibility of the code.)
So it is not the number of lines of code, but the total code size after compression. Removing "uncompressible" parts of your code (e.g. comments), or running your code thru a JavaScript minimizer is likely the best way to keep under the limit until/unless Snowflake lift the limit in some future release.
For the record, I've found the limit to be closer to 46 KB of JavaScript code rather than the stated 100 KB.
If you hit this limit yourself, then feel free to let Snowflake know and encourage them to raise it :-)

At one point I was implementing a Javascript UDF to unpack base64 data, and my naïve JS got this error, in the end I rewrote the JS to build more dense table data, and sucked all the data into 24K
of which each row looked something like:
1054:{type:10,scale:1},1055:{type:10,scale:1},1056:{type:10,scale:1},1057:{type:10,scale:1},1058:{type:10,scale:1},1059:{type:10,scale:1},1060:{type:10,scale:1},1061:{type:10,scale:1},1062:{type:10,scale:1},1063:{type:10,scale:1},1064:{type:10,scale:1},1065:{type:10,scale:1},1066:{type:10,scale:1},1067:{type:3,scale:1},1068:{type:5,scale:0.05},1069:{type:10,scale:1},1070:{type:10,scale:1},1071:{type:10,scale:1},1072:{type:10,scale:1},1073:{type:10,scale:1},1074:{type:10,scale:1},1075:{type:10,scale:1},1076:{type:10,scale:1},1077:{type:10,scale:1},1078:{type:5,scale:1},1079:{type:10,scale:1},1080:{type:10,scale:1},1081:{type:5,scale:1},1082:{type:4,scale:0.01},1083:{type:4,scale:0.01},1085:{type:5,scale:1},1086:{type:5,scale:1},1087:{type:5,scale:1},1088:{type:5,scale:1},1089:{type:5,scale:1},1090:{type:5,scale:1},1091:{type:5,scale:1},1092:{type:5,scale:1},1093:{type:5,scale:1},1094:{type:5,scale:1},1095:{type:3,scale:1},1096:{type:5,scale:1},1097:{type:5,scale:1},1098:{type:5,scale:1},1099:{type:10,scale:1},1100:{type:10,scale:1},4089:{type:10,scale:1},4090:{type:4,scale:0.05},4091:{type:1,scale:1},4092:{type:2,scale:1},4093:{type:5,scale:0.5},4094:{type:10,scale:1},8001:{type:3,scale:0.1},8002:{type:3,scale:0.1},8003:{type:3,scale:0.1},8004:{type:3,scale:0.1},8005:{type:3,scale:0.1},
and now that I think about it, I could have saved a lot by scale and type being s and t (7.5K it seems)
anyways, the whole file that was small enough was 29K.
Depending if your SP is data heave or code heavy, I found I got a really good performance improvement, by rely of the fact that state is persisted (mostly) across calls to a JS function, thus wrapped the two large data tables I was using like this:
CREATE OR REPLACE FUNCTION unpack_base64_data(s string)
RETURNS variant
LANGUAGE JAVASCRIPT
AS '
if (typeof revLookup === "undefined") {
revLookup = {"43":62,"45":62,"....
}
thus if the object existed, the object was not rebuilt.
so maybe not the exact answer you where looking for, but a data point of what a known "worked" was, and some ideas how to fit in.
which imply using some form of min/webpack on the js might be of value also.

While I can’t see it explicitly documented anywhere, there is a 16MB limit on most (all?) objects in Snowflake so that probably includes SP definitions as well.
For example, the PROCEDURES view holds the sp definition in a TEXT field and these are limited to 16MB

Related

Lightweight solution to keep one table with million records

The brief: I need to have access to simply table with only one column, million rows, no any relationships, with just simle 6-character entries - postal codes. I will use it to check against user entered postal code to find out if it is valid. This will be temporary solution for a few monthes until I can remove this validation and leave it to web services. So right now I am seeking for solution to this.
What I have:
Web portal build on top of Adobe CQ5 (Java, OSGi, Apache Sling, CRX)
Linux environment where it is all situated
plain text file (9mb) with these million rows
What I want:
to have fast access to this data (only read, no write) for only one
purpose: to find a row with specific value (six character length, contais only latin symbols and digits).
create this solution as easier as possible, i.e. to use linux
preinstalled software or with ability to quickly install and start it
without long set up and configuring.
Currently I have the next options: use database or use something like HashSet to keep these million records. The first solution requires additional steps for installing and configuring database, the second solution drives me crazy when I think about whole million records in HashSet. So right now I am considering to try to use SQLite, but I want to hear some suggestions on this problem.
Thanks a lot.
Storing in the content repository
You could store it in the CQ5 repository to eliminate the external dependency on sqlite. If you do, I would recommend structuring the storage hierarchically to limit the number of peer nodes. For example, the postcode EC4M 7RF would be stored at:
/content/postcodes/e/c/4/m/ec4m7rf
This is similar to the approach that you will see to users and groups under /home.
This kind of data structure might help with autocomplete if you needed it also. If you typed ec then you could return all of the possible subsequent characters for postcodes in your set by requesting something like:
/content/postcodes/e/c.1.json
This will show you the 4 (and the next character for any other postcode in EC).
You can control the depth using a numeric selector:
/content/postcodes/e/c.2.json
This will go down two levels showing you the 4 and the M and any postcodes in those 'zones'.
Checking for non-existence using a Bloom Filter
Also, have you considered using a Bloom Filter? A bloom filter is a space efficient probabilistic data structure that can quickly tell you whether an item is definitely not in a set. There is a chance of false positives, but you can control the probability vs size trade-off during the creation of the bloom filter. There is no chance of false negatives.
There is a tutorial that demonstrates the concept here.
Guava provides and implementation of the bloom filter that is easily used. It will work like the hashset, but you may not need to hold the whole dataset in memory.
BloomFilter<Person> friends = BloomFilter.create(personFunnel, 500, 0.01);
for(Person friend : friendsList) {
friends.put(friend);
}
// much later
if (friends.mightContain(dude)) {
// the probability that dude reached this place if he isn't a friend is 1%
// we might, for example, start asynchronously loading things for dude while we do a more expensive exact check
}
Essentially, the bloom filter could sit in front of the check and obviate the need to make the check for items that are definitely in the set. For items that maybe (~99% accurate depending on setup) in the set, then the check is made to rule out the false positive.
I would try to use redis memory database wich can handle millions of key/value pair and is blazing fast for loading or reading. Many connectors exists for all languages. and an apache module also exist (mod_redis)
You said that this is a temporary solution/requirement - so do you need a database?
You already have this as a text file - why not just load it into memory as part of your program as it's only 9 MB (assuming your process is persistent and always resident) and reference as an array or just a table of values.

Big Data Database

I am collecting a large amount of data which is most likely going to be a format as follows:
User 1: (a,o,x,y,z,t,h,u)
Where all the variables dynamically change with respect to time, except u - this is used to store the user name. What I am trying to understand since my background is not very intense in "big data", is when I end up with my array, it will be very large, something like 108000 x 3500, since I will be preforming analysis on each timestep, and graphing it, what would be an appropriate database to manage this in is what I am trying to determine. Since this is for scientific research I was looking at CDF and HDF5, and based on what I read here NASA I think I will want to use CDF. But is this the correct way to manage such data for speed and efficiency?
The final data set will have all the users as columns, and the rows will be timestamped, so my analysis program would read row by row to interpret the data. And make entries into the dataset. Maybe I should be looking at things like CouchDB and RDBMS, I just don't know a good place to start. Advice would be appreciated.
This is an extended comment rather than a comprehensive answer ...
With respect, a dataset of size 108000*3500 doesn't really qualify as big data these days, not unless you've omitted a unit such as GB. If it's just 108000*3500 bytes, that's only 3GB plus change. Any of the technologies you mention will cope with that with ease. I think you ought to make your choice on the basis of which approach will speed your development rather than speeding your execution.
But if you want further suggestions to consider, I suggest:
SciDB
Rasdaman, and
Monet DB
all of which have some traction in the academic big data community and are beginning to be used outside that community too.
I have been using CDF for some similarly sized data and I think it should work nicely. You will need to keep a few things in mind though. Considering I don't really know the details of your project, this may or may not be helpful...
3GB of data is right around the file size limit for the older version of CDF, so make sure you are using an up-to-date library.
While 3GB isn't that much data, depending on how you read and write it, things may be slow going. Make sure you use the hyper read/write functions whenever possible.
CDF supports meta-data (called global/variable attributes) that can hold information such as username and data descriptions.
It is easy to break data up into multiple files. I would recommend using one file per user. This will mean that you can write the user name just once for the whole file as an attribute, rather than in each record.
You will need to create an extra variable called epoch. This is well defined timestamp for each record. I am not sure if the time stamp you have now would be appropriate, or if you will need to process it some, but it is something you need to think about. Also, the epoch variable needs to have a specific type assigned to it (epoch, epoch16, or TT2000). TT2000 is the most recent version which gives nanosecond precision and handles leap seconds, but most CDF readers that I have run into don't handle it well yet. If you don't need that kind of precision, I recommend epoch16 as that has been the standard for a while.
Hope this helps, if you go with CDF, feel free to bug me with any issues you hit.

Achieving high-performance transactions when extending PostgreSQL with C-functions

My goal is to achieve the highest performance available for copying a block of data from the database into a C-function to be processed and returned as the result of a query.
I am new to PostgreSQL and I am currently researching possible ways to move the data. Specifically, I am looking for nuances or keywords related specifically to PostgreSQL to move big data fast.
NOTE:
My ultimate goal is speed, so I am willing to accept answers outside of the exact question I have posed as long as it gets big performance results. For example, I have come across the COPY keyword (PostgreSQL only), which moves data from tables to files quickly; and vice versa. I am trying to stay away from processing that is external to the database, but if it provides a performance improvement that out-weighs the obvious drawback of external processing, then so be it.
It sounds like you probably want to use the server programming interface (SPI) to implement a stored procedure as a C language function running inside the PostgreSQL back-end.
Use SPI_connect to set up the SPI.
Now SPI_prepare_cursor a query, then SPI_cursor_open it. SPI_cursor_fetch rows from it and SPI_cursor_close it when done. Note that SPI_cursor_fetch allows you to fetch batches of rows.
SPI_finish to clean up when done.
You can return the result rows into a tuplestore as you generate them, avoiding the need to build the whole table in memory. See examples in any of the set-returning functions in the PostgreSQL source code. You might also want to look at the SPI_returntuple helper function.
See also: C language functions and extending SQL.
If maximum speed is of interest, your client may want to use the libpq binary protocol via libpqtypes so it receives the data produced by your server-side SPI-using procedure with minimal overhead.

What is the exact datastore (and memcache) entity size limit, and how is it calculated?

The limit is 1MB, according to the docs, which I assumed means 1024**2 bytes, but apparently not.
I've got a simple function which stringifies large python objects into JSON, splits the JSON string into smaller chunks, and puts the chunks (as BlobProperty) and a separate index entity to the datastore (and memcache, using ndb). And another function to reverse this.
First I tried splitting into 1024**2 chunks but the datastore complained about it. Currently I'm using 1000**2 and it has worked without errors. I could've answered my own question already here, if it wasn't for this comment by Guido, with code that splits into 950000 bytes chunks. If Guido does it, it must be for a reason I figured. Why the 50K safety margin?
Maybe we can get a definitive answer on this, to not waste even a single byte. I'm aware of Blobstore.
The limit is 1MB - that is, 220 bytes - but that limit is for the encoded version of the entity, which includes all the metadata and encoding overhead.
One option is to leave some wiggle-room, as you're doing. Another is to catch the error and subdivide chunks if necessary.
If you're having to split stuff up like this, however, the Blobstore may be a better choice for your data than the datastore.

"MaximumRecievedMessageSize" exceed - Better way of handling this?

The data I'm trying to send out through a WCF service is a Northwind Database table (suppliers) which holds about 29 records, and is already exceeding the maximum length of a message I can send. I've looked around for answers and everyone says the same thing: Increase the "maxRecievedMessageSize" in the .config file.
However, this seems very wrong to me - it feels too much like a work around rather than solving the issue (Ex: What if it exceeds the maximum amount I can set it too?). Instead, is there a way to break up the message into chunks? The service itself is modeled by WSSF, so I'm having hard time finding "where" the message is being serialized in the first place (I do not provide code since WSSF provides a very strict template to work on, as I'm aware).
Side-Note/Question: I have a "backup" plan where I can execute a stored command onto the database that only brings back 10 rows of data (at a specified starting point when calling the function). However, I would have to call the function that does this several times. Would this still be better than breaking the message into chunks?
I apologize for not displaying any code but I feel as though it will only cause more confusion. If it is necessary, then I will try and clear this question up to the best of my ability asap. Thank you for your contribution!
Provide Skip and Take properties on your request object to allow the client to control paging.

Resources