EF Core version: 3.1.
Here's my method:
public static ILookup<string, int> GetClientCountLookup(DepotContext context, DateRange dateRange)
=> context
.Flows
.Where(e => e.TimeCreated >= dateRange.Start.Date && e.TimeCreated <= dateRange.End.Date)
.GroupBy(e => e.Customer)
.Select(g => new { g.Key, Count = g.Count() })
.ToLookup(k => k.Key, e => e.Count);
All fields used are indexed.
Here's generated query:
SELECT [f].[Customer] AS [Key], COUNT(*) AS [Count]
FROM [Flows] AS [f]
WHERE ([f].[TimeCreated] >= #__dateRange_Start_Date_0) AND ([f].[TimeCreated] <= #__dateRange_End_Date_1)
GROUP BY [f].[Customer]
When that query is executed as SQL, the execution time is 100ms.
When that query is used in code with ToLookup method - execution time is 3200ms.
What's even more weird - the execution time in EF Core seems totally independent from the data sample size (let's say, depending on date range we can count hundreds, or hundreds of thousands records).
WHAT THE HECK IS HAPPENING HERE?
The query I pasted is the real query EF Core sends.
The code fragment I pasted first is executed in 3200ms.
Then I took exact generated SQL and executed in as SQL query in Visual Studio - took 100ms.
It doesn't make any sense to me. I use EF Core for a long time and it seem to perform reasonably.
Most queries (plain, simple, without date ranges) are fast, results are fetched immediately (in less than 200ms).
In my application I built a really HUGE query with like 4 multi-column joins and subqueries... Guess what - it fetches 400 rows in 3200ms. It also fetches 4000 rows in 3200ms. And also when I remove most of the joins, includes, even remove the subquery - 3200ms. Or 4000, depending on my Internet or server momentary state and load.
It's like constant lag and I pinpointed it to the exact first query I pasted.
I know ToLookup method causes to finally fetch all input expression results, but in my case (real world data) - there are exactly 5 rows.
The results looks like this:
|------------|-------|
| Key | Count |
|------------|-------|
| Customer 1 | 500 |
| Customer 2 | 50 |
| Customer 3 | 10 |
| Customer 4 | 5 |
| Customer 5 | 1 |
Fetching 5 rows from database takes 4 seconds?! It's ridiculous. If the whole table was fetched, then rows grouped and counted - that would add up. But the generated query returns literally 5 rows.
What is happening here and what am I missing?
Please, DO NOT ASK ME TO PROVIDE THE FULL CODE. It is confidential, part of a project for my client, I am not allowed to disclose my client's trade secrets. Not here nor in any other question. I know it's hard to understand what happens when you don't have my database and the whole application, but the question here is pure theoretical. Either you know what's going on, or you don't. As simple as that. The question is very hard though.
I can only tell the RDBMS used is MS SQL Express running on Ubuntu server, remotely. The times measured are the times of executing either code tests (NUnit) or queries against the remote DB, all performed on my AMD Ryzen 7 8 core 3.40GHz processor. The server lives on Azure, like 2 core of I5 2.4GHz or something like that.
Here's test:
[Test]
public void Clients() {
var dateRange = new DateRange {
Start = new DateTime(2020, 04, 06),
End = new DateTime(2020, 04, 11)
};
var q1 = DataContext.Flows;
var q2 = DataContext.Flows
.Where(e => e.TimeCreated >= dateRange.Start.Date && e.TimeCreated <= dateRange.End.Date)
.GroupBy(e => e.Customer)
.Select(g => new { g.Key, Count = g.Count() });
var q3 = DataContext.Flows;
var t0 = DateTime.Now;
var x = q1.Any();
var t1 = DateTime.Now - t0;
t0 = DateTime.Now;
var l = q2.ToLookup(g => g.Key, g => g.Count);
var t2 = DateTime.Now - t0;
t0 = DateTime.Now;
var y = q3.Any();
var t3 = DateTime.Now - t0;
TestContext.Out.WriteLine($"t1 = {t1}");
TestContext.Out.WriteLine($"t2 = {t2}");
TestContext.Out.WriteLine($"t3 = {t3}");
}
Here's the test result:
t1 = 00:00:00.6217045 // the time of dummy query
t2 = 00:00:00.1471722 // the time of grouping query
t3 = 00:00:00.0382940 // the time of another dummy query
Yep: 147ms is my grouping that took 3200ms previously.
What happened? A dummy query was executed before.
That explains why the results hardly depended on data sample size!
The huge unexplainable time is INITIALIZATION, not the actual query time. I mean, if not the dummy query before, the whole time would elapse on ToLookup line of code! The line would initialize the DbContext, create connection to database and then perform the actual query and fetch the data.
So as the final answer I can say my test methodology was wrong. I measured the time of the first query to my DbContext. This is wrong, the database should be initialized before the times are measured. I can do it by performing any query before the measured queries.
Well, another question appears - why is THE FIRST query so slow, why is initialization so slow. If my Blazor app would use DbContext as Transient (instantiated each time injected) - would it take so much time each time? I don't think so, because it's how my application worked before major redesign. It didn't have noticeable lags (I would notice 3 seconds lag when changing between pages). But I'm not sure. Now my application uses a scoped DbContext, so it's one for user session. So I won't see the initialization overhead at all, so - the method of measuring time after a dummy query seems to be accurate.
Related
I previously touched on this problem in my post "What's the best way to return a sample of data over the a period?" but while the potential solutions offered there were promising, ultimately they didn't solve the problem I have (which has since changed slightly too).
The problem
The problem I'm trying to solve is returning aggregated data from a large dataset split into time-bound chunks quickly. For example, from a collection of temperature readings, return the average hourly reading for the last 24 hours.
For this example, let's say we have a collection named observations that contains temperature data collected from many devices at a rate of 1 per second. This collection contains a large amount of data (the dataset I'm working with has 120M documents). Each collection contains the following three fields: deviceId, timestamp, temperature.
For clarity, this gives us:
200 devices
3,600 documents per device per hour
86,400 documents per device per day
17,280,000 documents per day (all devices)
120,960,000 documents per week (all devices)
Retrieving data for a device for a period of time is trivial:
FOR o IN observations
FILTER o.deviceId = #deviceId
FILTER o.timestamp >= #start AND o.timestamp <= #end
RETURN o
The issue comes when trying to return aggregated data. Let's say that we want to return three sets of data for a specific deviceId:
The average daily reading for the last week (7 records from 17,280,000)
The average hourly readings for the last day (24 records from 86,400)
The average minutely readings for the last hour (60 records from 3,600)
(In case it affects the potential solutions, some of the data may not be per second, but at a lower rate, for example, per 15 seconds, perhaps even per minute. There may also be missing data for certain periods of time. While I've assumed ideal for the above figures, we all know reality is rarely that simple.)
What I've tried so far
I've looked into using WINDOW (example below) but the query ran very slow, whether this was an issue with the query or with volume of data, I don't know, but I couldn't find much information on it. Also, this still requires a way to perform multiple readings, one per each period of time (what I think of as a step).
FOR o IN observations
FILTER o.deviceId = #deviceId
WINDOW DATE_TIMESTAMP(o.timestamp) WITH { preceding: "PT60M" }
AGGREGATE temperature = AVG(o.temperature)
RETURN {
timestamp: o.timestamp,
temperature
}
I also looked at ways of filtering the timestamps based on a modulus as suggested in the previous thread but that didn't account for averages or for data that may be missing (a missed update, so no record with an exact timestamp, for example).
I've pulled out the data and filtered it outside of ArangoDB but this isn't really a solution, it's slow and especially for large volumes of data (17M for a week of readings) it just wasn't working at all.
So I looked at recreating some of that logic in the query, by stepping through each chunk of data, and returning the average values, which works, but isn't the most performant (taking roughly 10s for me, albeit relatively low powered box but still, slow):
LET steps = 24
LET stepsRange = 0..23
LET diff = #end - #start
LET interval = diff / steps
LET filteredObservations = (
FOR o IN observations
FILTER o.deviceId == #deviceId
FILTER o.timestamp >= #start AND o.timestamp <= #end
RETURN o
)
FOR step IN stepsRange
RETURN (
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN { step, temperature }
)
I've also tried variations of the above using WINDOW but without much luck. I'm not massively across the graph functionality of ArangoDB, coming from a relational/document database background, so I wonder if there is something in that which could make querying this data quicker and easier.
Summary
I expect this query to be ran simultaneously for several different time ranges and devices by many users, so really the performance needs to be < 1s. In terms of compromises, if this could be achieved by picking one record from each time chunk, that would be okay.
I figured out the answer to this at about 2.00 am last night, waking up and scribbling down a general idea. I've just tested it and it seems to be running quite quickly.
My thought was this: grabbing an average between two timestamps is a quick query, so if we simplify the overall query to simply run a filtered aggregate for each timestep, then the performance of the query will simply be a linear cost depending on the number of datapoints required.
It wasn't much different than the example above:
# Slow method
LET steps = 24
LET stepsRange = 0..23
LET diff = #end - #start
LET interval = diff / steps
LET filteredObservations = (
FOR o IN observations
FILTER o.deviceId == #deviceId
FILTER o.timestamp >= #start AND o.timestamp <= #end
RETURN o
)
FOR step IN stepsRange
RETURN (
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN { step, temperature }
)
Instead of filtering the observations first, and then filtering this subset again and again to collect the aggregate, we instead just loop through from 0 to n, in this case 23, and run the query to filter and aggregate the result:
# Quick method
LET steps = 24
LET stepsRange = 0..23
LET diff = #end - #start
LET interval = diff / steps
FOR step IN stepsRange
RETURN FIRST(
LET stepStart = start + (interval * step)
LET stepEnd = stepStart + interval
RETURN FIRST(
FOR f IN filteredObservations
FILTER f.timestamp >= stepStart AND f.timestamp <= stepEnd
COLLECT AGGREGATE temperature = AVG(f.temperature)
RETURN temperature
)
)
In my case, the total query time is around 75 ms for 24 datapoints, hourly average from 24 hours worth of data. Increasing this up to 48 points only increases the query run time by about 25%, and returning 1440 steps (per minute averages) runs in 132 ms.
I'd say this qualifies as a performant query.
(It's worth noting that I have a persistent index for this collection, without which the query is very slow.)
I am working on a forecasting model for monthly data which I intend to use in SQL server 2016 (in-database).
I created a simple TBATS model for testing:
dataset <- msts(data = dataset[,3],
start = c(as.numeric(dataset[1,1]),
as.numeric(dataset[1,2])),
seasonal.periods = c(1,12))
dataset <- tsclean(dataset,
replace.missing = TRUE,
lambda = BoxCox.lambda(dataset,
method = "loglik",
lower = -2,
upper = 1))
dataset <- tbats(dataset,
use.arma.errors = TRUE,
use.parallel = TRUE,
num.cores = NULL
)
dataset <- forecast(dataset,
level =c (80,95),
h = 24)
dataset <- as.data.frame(dataset)
Dataset was imported from .csv file I created with SQL query.
Later, I used same code in SQL server, input being the same query I used for .csv file (meaning data was exactly the same aswell)
However, when I executed the script, I noticed I got different results. All numbers look fine and make perfect sense, both SQL and standalone R give a forecast table, but all numbers between two tables differ for few % (about 3% on average).
Is there an explanation for this? It really bothers me as I need best possible results.
EDIT: This is how my data looks for easier understanding. It's basically 3 column table: year, month, value of transactions (numbers are randomised because data is classified). Alltogether I have data for 9 years.
2008 11 1093747561919.38
2008 12 816860005030.31
2009 1 341394536377.06
2009 2 669993867646.25
2009 3 717585597605.75
2009 4 627553319006.03
2009 5 984146176491.78
2009 6 605488762214.33
2009 7 355366795222.40
2009 8 549252969698.07
2009 9 598237364101.23
This is an example of results. Top two rows are from SQL server, bottom two rows are from RStudio.
t Point Lo80 Hi80
1 872379.7412 557105.271 1187654.211
2 1093817.266 778527.1078 1409107.424
1 806050.6884 517606.464 1094494.913
2 1031845.483 743387.015 1320303.95
EDIT 2: I checked each part of code carefully and I figured out that difference in results happens at TBATS model.
SQL server returns:
TBATS(0.684, {0,0}, -, {<12,5>})
RStudio returns:
TBATS(0.463, {0,0}, -, {<12,5>})
This explains difference in forecast values, but the question remains as these should be the same.
I'll answer this for anyone having problems in the future:
Seems like there is an difference in execution in R engine depending on you OS and runtime. I tested this by runing standalone R on my PC and on server using RStudio and Microsoft R Open and runing R in database on my PC and on server. I also tested all different runtimes.
If anyone wants to test themseves, R runtime can be changed in Tools - Global Options - General - R version (for RStudio)
All tests returned a slightly different results. This does not mean the results were wrong (in my case at least, as I'm forecasting for real business data and results have wide intervals anyway).
This may not be an actual solution, but I hope I can prevent someone panicking for a week like I did.
I am running RStudio Server on a 256GB RAM server, and MS-SQL-Server 2012 on another. This DB contains data that allows me to build a graph with ~100 million nodes and ~150 million edges.
I have timed how long it takes to build this graph from that data:
1st SELECT query = ˜22M rows = 12 minutes = df1 (dataframe1)
2nd SELECT query = ˜30M rows = 8 minutes = df2
3rd SELECT query = ˜32M rows = 8 minutes = df3
4th SELECT query = ˜63M rows = 70 minutes = df4
edges = rbind(df1, df2, df3, df4) = 6 minutes
mygraph = graph.data.frame(edges) = 30 minutes
So a little over two hours. Since my data is quite stable, I figured I could speed things up by saving mygraph to disk. But when I tried to load it, it just wouldn't. I gave up after a 4 hour wait, thinking something had gone wrong.
So I reboot the server, delete my .rstudio folder and start over, this time saving the dataframes from each SQL query plus the edges dataframe, in both RData and RDS formats (save() and saveRDS(), compress = FALSE everytime). After each save, I timed load() and readRDS() times of the five dataframes. Times where pretty much the same for load() and readRDS():
df1 = 1.1 GB file = 1 minute
df2 = 1.4 GB file = 2 minutes
df3 = 1.7 GB file = 6 minutes
df4 = 3.1 GB file = 13 minutes
edges = 6.8 GB file = 21 minutes
Good enough, I thought. But today when I started a new session and tried to load(df1) to make some changes to it, again I got that feeling that something was wrong. After 20 minutes waiting for it to load, I gave up. Memory, disk and CPU shouldn't be the issues, as I'm the only one using this server. I have already reboot the server and deleted my .rstudio folder, thinking maybe something in there was hanging my session, but the dataframe still won't load. While load() is supposedly running, iotop shows no disk activity and this is what I get from ps
ps -C rsession -o %cpu,%mem,cmd
%CPU %MEM CMD
99.5 0.3 /usr/lib/rstudio-server/bin/rsession -u myusername
I have no idea what to try next. It makes no sense to me that loading a RData file would take longer than querying a SQL database that lives on a different server. And even if it did, then why was it so fast when I was timing load() and readRDS() times after saving the dataframes?
It's the first time I ask something here at StackOverflow, so sorry if I forgot to mention something important for you to be able to answer this question. If I did, please let me know.
EDIT: some additional info requested by Brandon in the comments. OS is CentOS 7. The dataframes contain lists of edges in the first two columns (col1=node1; col2=node2) and two additional columns for edge attributes. All columns are strings, varying between 5 and 14 characters long. I have also added the approximate number of rows of each dataframe to my original post. Thanks!
I have got into strenge situation. I want to know count of Fetch based on daily, weekly, monthly and All time. In the Datastore, the count is about 2,368,348. Whenever I try to get the count either by Model or GqlQuery I get a 500 error. When rows are less, the code below is working fine.
Can any guru correct me or tell me right solution, please? I am using Python.
The Model:
class Fetch(db.Model):
adid = db.IntegerProperty()
ip = db.StringProperty()
date = db.DateProperty(auto_now_add=True)
Stats Codes:
adid = cgi.escape(self.request.get('adid'))
...
query = "SELECT __key__ FROM Fetch WHERE adid = " + adid + " AND date >= :1"
rows = db.GqlQuery( query, monthlyDate)
fetch_count = 0
for row in rows:
fetch_count = fetch_count + 1
self.response.out.write( fetch_count)
It looks like your query is taking longer than GAE allows a query to run (typically ~60 seconds). From the count() documentation:
Unless the result count is expected to be small, it is best to specify a limit argument; otherwise the method will continue until it finishes counting or times out.
From the Request Timer documentation:
A request handler has a limited amount of time to generate and return a response to a request, typically around 60 seconds. Once the deadline has been reached, the request handler is interrupted.
If a DeadlineExceededError is being raised, this is your problem. If you need to run this query consider using Backends in GAE. With Backends there is no time limit for generating and returning a request.
I was wondering what is a viable database solution for local storage on Windows Phone 7 right now. Using search I stumbled upon these 2 threads but they are over a few months old. I was wondering if there are some new development in databases for WP7. And I didn't found any reviews about the databases mentioned in the links below.
windows phone 7 database
Local Sql database support for Windows phone 7
My requirements are:
It should be free for commercial use
Saving/updating a record should only save the actual record and not the entire database (unlike WinPhone7 DB)
Able to fast query on a table with ~1000 records using LINQ.
Should also work in simulator
EDIT:
Just tried Sterling using a simple test app: It looks good, but I have 2 issues.
Creating 1000 records takes 30 seconds using db.Save(myPerson). Person is a simple class with 5 properties.
Then I discovered there is a db.SaveAsync<Person>(IList) method. This is fine because it doesn't block the current thread anymore.
BUT my question is: Is it save to call db.Flush() immediately and do a query on the currently saving IList? (because it takes up to 30 seconds to save the records in synchronous mode). Or do I have to wait until the BackgroundWorker has finished saving?
Query these 1000 records with LINQ and a where clause the first time takes up to 14 sec to load into memory.
Is there a way to speed this up?
Here are some benchmark results: (Unit tests was executed on a HTC Trophy)
-----------------------------
purging: 7,59 sec
creating 1000 records: 0,006 sec
saving 1000 records: 32,374 sec
flushing 1000 records: 0,07 sec
-----------------------------
//async
creating 1000 records: 0,04 sec
saving 1000 records: 0,004 sec
flushing 1000 records: 0 sec
-----------------------------
//get all keys
persons list count = 1000 (0,007)
-----------------------------
//get all persons with a where clause
persons list with query count = 26 (14,241)
-----------------------------
//update 1 property of 1 record + save
persons list with query count = 26 (0,003s)
db saved (0,072s)
You might want to take a look at Sterling - it should address most of your concerns and is very flexible.
http://sterling.codeplex.com/
(Full disclosure: my project)
try Siaqodb is commercial project and as difference from Sterling, not serialize objects and keep all in memory for query.Siaqodb can be queried by LINQ provider which efficiently can pull from database even only fields values without create any objects in memory, or load/construct only objects that was requested.
Perst is free for non-commercial use.
You might also want to try Ninja Database Pro. It looks like it has more features than Sterling.
http://www.kellermansoftware.com/p-43-ninja-database-pro.aspx