CPU Usage Date Predicate SQL Server - sql-server

I'm retrieving some data using a date field predicate(non clustered) and I'm trying to use 2 alternatives to achieve my goals.
The field is a datetime and in one of the statements I'm using literally the value stored in the field
WHERE dataload ='2012-07-16 10:13:01.307'
In the other statement I'm using this predicate.
WHERE dataload >= DATEADD(DAY,0,datediff(day,0,#Dataload))
and dataload <= dateadd(ms,-3,DATEADD(DAY,1,datediff(day,0,#Dataload)))
Looking at the statistics IO/time output, the reads are the same, but the method #1 used 16ms of CPU and the method #2 used 0ms causing the method #1 to consume 61% of the total cost over the 39%of the method #2.
I'm not understanding really why the CPU are being used in #method1 when the method #2 has so many functions on it and gives me 0ms.
There is any basic explanation for it ?

The first method retrieves the field from disk, taking `6 msec. The second method just fetches the data from memory. If you reversed the order, the other method would take longer.

Related

How do I perform a clustering algorithm (like DBSCAN) in Snowflake?

I need to cluster some data in Snowflake using DBSCAN. I created a UDF but the results won't match with a local run, so I ran an UDF that just creates a list with the row number that is being processed and it results in a list that has repeated values and its max value is much smaller than the number of rows in my table. (The expected result was unique values up to the number of rows)
Can this be a parallelization issue?
If so, is there a way to cluster data using DBSCAN in Snowflake?
Thanks!
EDIT -> code example:
#funcs.pandas_udf(name='DBSCAN_TEST', is_permanent=True,
stage_location='#UDF', replace=True,
packages=['scikit-learn==1.0.2', 'pandas', 'numpy'])
def DBSCAN_TEST(data_x: types.PandasDataFrame[float, float]) -> types.PandasSeries[float]:
data_x.columns = [features]
DBSCAN_cluster = DBSCAN(eps=2.5, min_samples=4)
DBSCAN_cluster.fit(data_x)
return resul
As input data I used this
to test DBSCAN.
If I run that locally (using the exact same code inside the UDF) I end up with 24 clusters and that is the expected result. But if I use the UDF it scales up to 71.
I've tried changing the input types to string, as a coworker suggested but it didn't work.
The only clustering option in Snowflake is a clustering key, and general rule of thumb is that this isn't something typically needed until you eclipse a TB of data in a table and/or the auto clustering proves to be deficient in performance.
https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

Snowflake UDF execution time limit

I've got an error when calling a UDF when the amount of data get increased:
Database Error in model test_model (models/prep/test_model.sql)
100132 (P0000): JavaScript execution error: UDF execution time limit exceeded by function IMAGE_URL
compiled SQL at target/run/models/test/test_model.sql
As mentioned in the documentation, there is a execution time limit for js UDF, so how long is the time limit and is it configurable?
so lets write a function and use it to test this out.
CREATE OR REPLACE FUNCTION long_time(D DOUBLE)
RETURNS variant
LANGUAGE JAVASCRIPT
AS $$
function sleepFor(sleepDuration){
var now = new Date().getTime();
while(new Date().getTime() < now + sleepDuration){
/* Do nothing */
}
}
sleepFor(D*1000);
return D;
$$;
select long_time(1); -- takes 1.09s
select long_time(10); -- takes 10.32s
select long_time(60); -- Explodes with
JavaScript execution error: UDF execution time limit exceeded by function LONG_TIME
BUT, ran for 31.33s, so it seems you have 30 seconds to complete, which I feel is rather large amount of time, per call to me.
The default timeout limit for JS UDFs is 30 seconds for a row set! Snowflake will send rows in sets to the JS engine, and it will try to process all rows in the set within that time. The size of the row set may vary, but you may assume it will be around 1K (this is just an estimation, the number of rows in a set could be much higher or lower).
The timeout limit is different for Java and Python UDFs. It's 300 seconds for them.
As #Felipe said, you may contact Snowflake Support and share your query ID to get help with this error. The Support may guide you to mitigate the issue.

Inconsistent values for getNumberFound() in Search API

I have a full-text search index with 42 documents like in the screenshot below:
When I query the index for "" it returns all the 42 documents correctly (good), but when I use the limit and offset options in the query, the value returned for the total number of matches found (results.getNumberFound()) varies from time to time. It gives me different values for different offsets!! In short, making the same query just with different offset values gives a different value for results.getNumberFound() function!
NOTE: This happens only in production server after I deploy the app. In local server everything
works perfectly (i.e for the same query, the number of total hits found is the same regardless of the offset option value).
Query query = Query.newBuilder()
.setOptions(QueryOptions.newBuilder()
.setLimit(limit)
.setOffset(offset).build())
.build(searchPhrase);
Results<ScoredDocument> results = INDEX.search(query);
LOG.warning( "Phrase:'" + searchPhrase +
"' limit:" + limit +
" offset:" + offset +
" num:" + results.getNumberFound());
Here's a screenshot of the log output:
So is there something wrong I'm doing or it's a bug in the Search API because the weird thing is that the issue only happens in the production server not the local one.
The python docs say
number_found
Returns an approximate number of documents matching the query. QueryOptions defining post-processing of the search results. If the QueryOptions.number_found_accuracy parameter were set to 100, then number_found <= 100 is accurate.
Similiar api components in exist in Java. From your code it appears you haven't set an accuracy. See java QueryOptions https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/search/QueryOptions
Having said that I have seen many questions/discussions about lack of accuracy on the number of found results.
Surprisingly, this is working as intended (as Tim says).
https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/search/QueryOptions.Builder#setNumberFoundAccuracy(int)
In its default state, the datastore scans the minimal set of data to fulfill the request. The database provides a very rough estimate of match results by multiplying ID range with estimate of matching keys (#keys found that matched / #ids scanned during the query).
For small data sets, set the accuracy value higher (500 or 1000) and call it a day. You can also improve the estimate by making sure key IDs are uniformly distributed and by fetching a higher limit each call (though if you don't need the data, just use the accuracy parameter).
This might not be applicable here but this is a general workaround for larger data sets:
Use num_accuracy == 1000. When queries return an estimate of <1000, you can trust that. When a query returns an estimate of >1000, perform your own estimate using a second query:
Include an extra numeric field with your data, which is a value of a discrete probabilistic event (e.g. #0s in a hash of some randomish data). When you get a large estimate from the first query, repeat your query with the additional constraint (e.g. AND ZERO_COUNT == y), where y is chosen based on the first query's estimate to match <1000 entities, producing an exact count for the second query which you can accurately extrapolate. Since you don't need the results of this data, you can set limit to 1 & num_accuracy == 1000.

Django hits the database for each filter() call

I have some Django 1.3 code that looks up many model instances in a loop, ie.
my_set = myinstance.subitem_set.all()
for value in values:
existing = my_set.filter(attr_name=value)
if len(existing) == 1:
...
This works, but profiling SQL queries shows that it hits the DB on each iteration. According to https://docs.djangoproject.com/en/1.3/ref/models/querysets/ iterating over the related items should eagerly load them, so I tried calling:
list(my_set)
However, this doesn't help. It does do a query to load all the sub-items, but then it still does an individual query for each sub-item inside the loop. How do I get it to use the cached set and not hit the DB each time? The DB is PostgreSQL 8.4.
The problem is in this line:
if len(existing) == 1:
From Django documentation:
len(). A QuerySet is evaluated when you call len() on it. This, as you might expect, returns the length of the result list.
Note: Don't use len() on QuerySets if all you want to do is determine the number of records in the set. It's much more efficient to handle a count at the database level, using SQL's SELECT COUNT(*), and Django provides a count() method for precisely this reason. See count() below.
So in your case it executes the query each time when you call len(existing). The more effective way is:
existing.count() == 1
This will also hit the database each time you call it but it will execute SELECT COUNT(*) which is faster.

key-value store for time series data?

I've been using SQL Server to store historical time series data for a couple hundred thousand objects, observed about 100 times per day. I'm finding that queries (give me all values for object XYZ between time t1 and time t2) are too slow (for my needs, slow is more then a second). I'm indexing by timestamp and object ID.
I've entertained the thought of using somethings a key-value store like MongoDB instead, but I'm not sure if this is an "appropriate" use of this sort of thing, and I couldn't find any mentions of using such a database for time series data. ideally, I'd be able to do the following queries:
retrieve all the data for object XYZ between time t1 and time t2
do the above, but return one date point per day (first, last, closed to time t...)
retrieve all data for all objects for a particular timestamp
the data should be ordered, and ideally it should be fast to write new data as well as update existing data.
it seems like my desire to query by object ID as well as by timestamp might necessitate having two copies of the database indexed in different ways to get optimal performance...anyone have any experience building a system like this, with a key-value store, or HDF5, or something else? or is this totally doable in SQL Server and I'm just not doing it right?
It sounds like MongoDB would be a very good fit. Updates and inserts are super fast, so you might want to create a document for every event, such as:
{
object: XYZ,
ts : new Date()
}
Then you can index the ts field and queries will also be fast. (By the way, you can create multiple indexes on a single database.)
How to do your three queries:
retrieve all the data for object XYZ
between time t1 and time t2
db.data.find({object : XYZ, ts : {$gt : t1, $lt : t2}})
do the above, but return one date
point per day (first, last, closed to
time t...)
// first
db.data.find({object : XYZ, ts : {$gt : new Date(/* start of day */)}}).sort({ts : 1}).limit(1)
// last
db.data.find({object : XYZ, ts : {$lt : new Date(/* end of day */)}}).sort({ts : -1}).limit(1)
For closest to some time, you'd probably need a custom JavaScript function, but it's doable.
retrieve all data for all objects for
a particular timestamp
db.data.find({ts : timestamp})
Feel free to ask on the user list if you have any questions, someone else might be able to think of an easier way of getting closest-to-a-time events.
This is why databases specific to time series data exist - relational databases simply aren't fast enough for large time series.
I've used Fame quite a lot at investment banks. It's very fast but I imagine very expensive. However if your application requires the speed it might be worth looking it.
There is an open source timeseries database under active development (.NET only for now) that I wrote. It can store massive amounts (terrabytes) of uniform data in a "binary flat file" fashion. All usage is stream-oriented (forward or reverse). We actively use it for the stock ticks storage and analysis at our company.
I am not sure this will be exactly what you need, but it will allow you to get the first two points - get values from t1 to t2 for any series (one series per file) or just take one data point.
https://code.google.com/p/timeseriesdb/
// Create a new file for MyStruct data.
// Use BinCompressedFile<,> for compressed storage of deltas
using (var file = new BinSeriesFile<UtcDateTime, MyStruct>("data.bts"))
{
file.UniqueIndexes = true; // enforces index uniqueness
file.InitializeNewFile(); // create file and write header
file.AppendData(data); // append data (stream of ArraySegment<>)
}
// Read needed data.
using (var file = (IEnumerableFeed<UtcDateTime, MyStrut>) BinaryFile.Open("data.bts", false))
{
// Enumerate one item at a time maxitum 10 items starting at 2011-1-1
// (can also get one segment at a time with StreamSegments)
foreach (var val in file.Stream(new UtcDateTime(2011,1,1), maxItemCount = 10)
Console.WriteLine(val);
}
I recently tried something similar in F#. I started with the 1 minute bar format for the symbol in question in a Space delimited file which has roughly 80,000 1 minute bar readings. The code to load and parse from disk was under 1ms. The code to calculate a 100 minute SMA for every period in the file was 530ms. I can pull any slice I want from the SMA sequence once calculated in under 1ms. I am just learning F# so there are probably ways to optimize. Note this was after multiple test runs so it was already in the windows Cache but even when loaded from disk it never adds more than 15ms to the load.
date,time,open,high,low,close,volume
01/03/2011,08:00:00,94.38,94.38,93.66,93.66,3800
To reduce the recalculation time I save the entire calculated indicator sequence to disk in a single file with \n delimiter and it generally takes less than 0.5ms to load and parse when in the windows file cache. Simple iteration across the full time series data to return the set of records inside a date range in a sub 3ms operation with a full year of 1 minute bars. I also keep the daily bars in a separate file which loads even faster because of the lower data volumes.
I use the .net4 System.Runtime.Caching layer to cache the serialized representation of the pre-calculated series and with a couple gig's of RAM dedicated to cache I get nearly a 100% cache hit rate so my access to any pre-computed indicator set for any symbol generally runs under 1ms.
Pulling any slice of data I want from the indicator is typically less than 1ms so advanced queries simply do not make sense. Using this strategy I could easily load 10 years of 1 minute bar in less than 20ms.
// Parse a \n delimited file into RAM then
// then split each line on space to into a
// array of tokens. Return the entire array
// as string[][]
let readSpaceDelimFile fname =
System.IO.File.ReadAllLines(fname)
|> Array.map (fun line -> line.Split [|' '|])
// Based on a two dimensional array
// pull out a single column for bar
// close and convert every value
// for every row to a float
// and return the array of floats.
let GetArrClose(tarr : string[][]) =
[| for aLine in tarr do
//printfn "aLine=%A" aLine
let closep = float(aLine.[5])
yield closep
|]
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages.
I use boost::date_time for the timestamp field.
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
I created a number of custom iterators and used standard template library features to be able to efficiently search for specific values or ranges of time-based records.

Resources