Almost empty plan cache - sql-server

I am experiencing a strange situation - my plan cache is almost empty. I use the following query to see what's inside:
SELECT dec.plan_handle,qs.sql_handle, dec.usecounts, dec.refcounts, dec.objtype
, dec.cacheobjtype, des.dbid, des.text,deq.query_plan
FROM sys.dm_exec_cached_plans AS dec
join sys.dm_exec_query_stats AS qs on dec.plan_handle=qs.plan_handle
CROSS APPLY sys.dm_exec_sql_text(dec.plan_handle) AS des
CROSS APPLY sys.dm_exec_query_plan(dec.plan_handle) AS deq
WHERE cacheobjtype = N'Compiled Plan'
AND objtype IN (N'Adhoc', N'Prepared')
One moment it shows me 82 rows, the next one 50, then 40 then 55 and so on while an hour before I couldn't reach the end of the plan cache issuing the same command. The point is that SQL Server keeps the plan cache very-very small.
The main reason of my investigation is high CPU compared to our baselines without any high loads, under normal during-the day workload - constantly 65-80%
Perfmon counters show low values for Plan Cache Hit Ratio - around 30-50%, high compilations - 400 out of 2000 batch requests per second and high CPU - 73 avg. What could cause this behaviour?
The main purpose of the question is to learn the possible reasons for an empty plan cache.
Memory is OK - min: 0 max: 245000.
I also didn't notice any signs of memory pressure - PLE, lazy writes, free list stalls disk activity were just ok, logs did not tell me a thing.
I came here for possible causes of this so I could proceed with investigation.
EDIT: I have also considered this thread:
SQL Server 2008 plan cache is almost always empty
But none of the recommendations/possible reasons are relevant.

The main purpose of the question is to learn the possible reasons for an empty plan cache.
If it is to learn,the answer from Martin Smith,in the thread you referred will help you
If you want to know in particular,why plan is getting emptied,i recommend using extended events and try below extended event

Related

How to join splayed table in KDB?

I have 2 very large (billions of rows) splayed tables, Trades and StockPrices, on a remote server. I want to do an asof join
h:hopen `:RemoteServer:Port
h"aj[`Stock`Date`Time,
select from Trades where Date within 2014.04.01 2014.04.13,
StockPrices
]"
But I just get the error (I'm Studio for KDB+)
An error occurred during execution of the query.
The server sent the response:
splay
Studio Hint: Possibly this error refers to nyi op on splayed table
So what would be the correct way to do such a join?
Also, performance and efficiency is an issue with such a big table -- what should I be doing to ensure the query doesn't take hours and doesn't consume to much of the server's system resources?
You need to map the splayed StockPrices table into memory. This can be done by using a select query:
q)(`::6060)"aj[`sym`time;select from trade;quote]" / bad
'splay
q)(`::6060)"aj[`sym`time;select from trade;select from quote]" / good
sym time prx bid ask
-------------------------------------------
aea 01:01:16.347 637.7554 866.0131 328.1476
aea 01:59:14.108 819.5301 115.053 208.1114
aea 02:42:44.724 69.38325 641.8554 333.3092
This page may be useful for looking up errors from Kdb+: http://code.kx.com/q/ref/error-list/
Regarding optimising performance of aj see http://code.kx.com/q/ref/joins/#aj-aj0-asof-join
Also, if there isn't an overlap of data between days, it may be faster to run the query on a day by day basis, possibly in parallel.
If there is an overlap of data across days, combining the date & time columns into a single timestamp column would speed up the lookup.

get_by_key_name() in GAE taking as long as 750ms. Is this expected?

My program fetches ~100 entries in a loop. All entries are fetched using get_by_key_name(). Appstats show that some get_by_key_name() requests are taking as much as 750ms! (other big values are 355ms, 260ms, 230ms). Average for other fetches ranges from 30ms to 100ms. These times are in real_time and hence contribute towards 'ms' and not 'cpu_ms'.
Due to the above, total time taken to return the webpage is very high ms=5754, where cpu_ms=1472. (above times are seen repeatedly for back to back requests.)
Environment: Python 2.7, webapp2, jinja2, High Replication, No other concurrent requests to the server, Frontend Instance Class is F1, No memcache set yet, max idle instances is automatic, min pending latency is automatic, using db (NOT NDB).
Any help will be greatly appreciated as I based whole database design on fetching entries from the datastore using only get_by_key_name()!!
Update:
I tried profiling using time.clock() before and immediately after every get_by_key_name() method call. The difference I get from time.clock() for every single call is 10ms! (Just want to clarify that the get_by_key_name() is called on different Kinds).
According to time.clock() the total execution time (in wall-clock time) is 660ms. But the real-time is 5754 (=ms), and cpu_ms is 1472 per GAE logs.
Summary of Questions:
*[Update: This was addressed by passing list of keys] Why get_by_key_name() is taking that long?*
Why ms of 5754 is so much more than cpu_ms of 1472. Is task execution in halted/waiting-state for 75% (1-1472/5754) of the time due to which real-time (wall clock) time taken is so long as far as end user is concerned?
If the above is true, then why time.clock() shows that only 660ms (wall-clock time) elapsed between start of the first get_by_key_name() request and the last (~100th) get_by_key_name() request; although GAE shows this time as 5754ms?

What's so special about the value 976 in sys.dm_exec_query_stats?

When querying the sys.dm_exec_query_stats DMV, I am observing some interesting behaviour on the last_worker_time column.
Usually, it reports 0 for the specific stored procedure I am monitoring. However occasionally it will return a non-zero value, and when it does it seems to always be 976.
The MSDN documentation says the following about the last_worker_time column:
CPU time, reported in microseconds (but only accurate to
milliseconds), that was consumed the last time the plan was executed.
However this doesn't explain the strange behaviour. Can anyone explain why the value 976 is so prolific?
The following simplified version of my DMV query produces this phenomenon:
select
qs.last_worker_time
from
sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.plan_handle) st
where
db_name(st.dbid) = 'IntegrationManagement'
and object_name(st.objectid, st.dbid) in ('GetFromOutQueue')
The SQL Server 2008 R2 instance is hosted on Windows Server 2008 R2 running on VMware.
You'd see this if the measurement was done with a timer that ticks 1024 times a second. Usually there's no tick between the start and end of the procedure, and the reported time is 0us. Sometimes there's one tick, and the reported time is 1/1024 second, rounded down to the nearest microsecond = 976 us.

What are some suggested LogParser queries to run to detect sources of high network traffic?

In looking at the network in/out metrics for our AWS/EC2 instance, I would like to find the sources of the high network traffic occurrences.
I have installed up Log Parser Studio and run a few queries - primarily looking for responses that took a while:
SELECT TOP 10000 * FROM '[LOGFILEPATH]' WHERE time-taken > 1000
I am also targeting time spans that cover when the network in/out spikes have occurred:
SELECT TOP 20000 * FROM '[LOGFILEPATH]'
WHERE [date] BETWEEN TIMESTAMP('2013-10-20 02:44:00', 'yyyy-MM-dd hh:mm:ss')
AND TIMESTAMP('2013-10-20 02:46:00', 'yyyy-MM-dd hh:mm:ss')
One issue is that the log files are 2-7 gigs (targeting single files per query). In trying Log Parser Lizard, it crashed with an out of memory exception on large files (boo).
What are some other queries, and methodologies I should follow to identify the source of the high network traffic, which would hopefully help me figure out how to plug the hole?
Thanks.
One function that may be of particular use to you is the QUANTIZE() function. This allows you to aggregate stats for a period of time thus allowing you to see spikes in a given time period. Here is one query I use that allows me to see when we get scanned:
SELECT QUANTIZE(TO_LOCALTIME(TO_TIMESTAMP(date, time)), 900) AS LocalTime,
COUNT(*) AS Hits,
SUM(sc-bytes) AS TotalBytesSent,
DIV(MUL(1.0, SUM(time-taken)), Hits) AS LoadTime,
SQRROOT(SUB(DIV(MUL(1.0, SUM(SQR(time-taken))), Hits), SQR(LoadTime))) AS StandardDeviation
INTO '[OUTFILEPATH]'
FROM '[LOGFILEPATH]'
WHERE '[WHERECLAUSE]'
GROUP BY LocalTime
ORDER BY LocalTime
I usually output this to a .csv file and then chart in Excel to visually see where a period of time is out of normal range. This particular query breaks things down to 15 min segments based on the 900 passed to QUANTIZE. The TotalBytesSent, LoadTime and StandardDeviation allow me to see other aberrations in downloaded content or response times.
Another thing to look at is the number of requests a particular client has made to your site. The following query can help identify scanning or DoS activity coming in:
SELECT
DISTINCT c-ip as ClientIP,
COUNT(*) as Hits,
PROPCOUNT(*) as Percentage
INTO '[OUTFILEPATH]'
FROM '[LOGFILEPATH]'
WHERE '[WHERECLAUSE]'
GROUP BY ClientIP
HAVING (Hits > 50)
ORDER BY Percentage DESC
Adjusting the HAVING clause will set the minimum number of requests an IP will need to make before it shows up. Based on the activity and the WHERE clause, 50 may be too low. The PROPCOUNT() function gives a percentage of the overall value of a particular field. In this case, it gives the what percent a particular IP of all the requests made to the site. Typically this will surface the IP addresses of search engines as well, but those are pretty easy to weed out.
I hope that gives you some ideas on what you can do.

Paginated searching... does performance degrade heavily after N records?

I just tried the following query on YouTube:
http://www.youtube.com/results?search_query=test&search=tag&page=100
and received the error message:
Sorry, YouTube does not serve more than 1000 results for any query.
(You asked for results starting from 2000.)
I also tried Google search for "test", and although it said there were about 3.44 billion results, I was only able to get to page 82 (or about 820 results).
This leads me to wonder, does performance start to degrade with paginated searches after N records (specifically wondering about with ROW_NUMBER() in SQL Server or similar feature in other DB systems), or are YouTube/Google doing this for other reasons? Granted, it's pretty unlikely that most people would need to go past the first 1000 results for a query, but I would imagine the limitation is specifically put in place for some technical reason.
Then again Stack Overflow lets you page through 47k results: https://stackoverflow.com/questions/tagged/c?page=955&sort=newest&pagesize=50
Yes. High offsets are slow and inefficient.
The only way to find the records at an offset, is to compute all records that came before and then discard them.
(I dont know ROW_NUMBER(), but would be LIMIT in standard SQL. So
SELECT * FROM table LIMIT 1999,20
)
.. in the above example, the first 2000 records have to be fetched first, and then discarded. Generally it can't skip ahead, or use indexes to jump right to the correct location in the data, because normally there would be a 'WHERE' clause filtering the results.
It is possible to cache the results, which is probably what SO does. So it doesn't actually have to compute the large offsets each and every time. (Most of SO's searches are a 'small' set of known tags, so its quite feasible to cache. A arbitrary search query is will have much versions to catch, making it impractical)
(Alternatively it might be using some other implementation that does allow arbitrary offsets)
Other places taking about similar things
http://sphinxsearch.com/docs/current.html#conf-max-matches
Back of the envolope test:
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 100999,3;
...
3 rows in set (11.32 sec)
mysql> select gridimage_id from gridimage_search where moderation_status = "geograph" order by imagetaken limit 3;
...
3 rows in set (4.59 sec)
(Arbitrary query choosen so as not to use indexes very well, if indexes can be used the difference is less pronounced and harder to see. But in a production system running lots of queries, 1 or 2ms difference is huge)
Update: (to show a indexed query)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 10;
...
10 rows in set (0.00 sec)
mysql> select gridimage_id from gridimage_search order by imagetaken limit 100000,10;
...
10 rows in set (1.70 sec)
It's a TOP clause designed to limit the amount of physical reads that the database has to perform, which limits the amount of time that the query takes. Imagine you have 82 billion links to stories about "Japan" in your database. What if someone queries "Japan"? Are all 82 billion results really going to be clicked? No. The user needs the top 1000 most relevant results. When the search is generic, like "test", there is no way to determine relevance. In this case, YouTube/Google has to limit the volume returned so other users aren't affected by generic searches. What's faster, returning 1,000 results or 82,000,000,000 results?

Resources