RData takes longer to load than querying the database again

RData takes longer to load than querying the database again - sql-server

I am running RStudio Server on a 256GB RAM server, and MS-SQL-Server 2012 on another. This DB contains data that allows me to build a graph with ~100 million nodes and ~150 million edges.
I have timed how long it takes to build this graph from that data:
1st SELECT query = ˜22M rows = 12 minutes = df1 (dataframe1)
2nd SELECT query = ˜30M rows = 8 minutes = df2
3rd SELECT query = ˜32M rows = 8 minutes = df3
4th SELECT query = ˜63M rows = 70 minutes = df4
edges = rbind(df1, df2, df3, df4) = 6 minutes
mygraph = graph.data.frame(edges) = 30 minutes
So a little over two hours. Since my data is quite stable, I figured I could speed things up by saving mygraph to disk. But when I tried to load it, it just wouldn't. I gave up after a 4 hour wait, thinking something had gone wrong.
So I reboot the server, delete my .rstudio folder and start over, this time saving the dataframes from each SQL query plus the edges dataframe, in both RData and RDS formats (save() and saveRDS(), compress = FALSE everytime). After each save, I timed load() and readRDS() times of the five dataframes. Times where pretty much the same for load() and readRDS():
df1 = 1.1 GB file = 1 minute
df2 = 1.4 GB file = 2 minutes
df3 = 1.7 GB file = 6 minutes
df4 = 3.1 GB file = 13 minutes
edges = 6.8 GB file = 21 minutes
Good enough, I thought. But today when I started a new session and tried to load(df1) to make some changes to it, again I got that feeling that something was wrong. After 20 minutes waiting for it to load, I gave up. Memory, disk and CPU shouldn't be the issues, as I'm the only one using this server. I have already reboot the server and deleted my .rstudio folder, thinking maybe something in there was hanging my session, but the dataframe still won't load. While load() is supposedly running, iotop shows no disk activity and this is what I get from ps
ps -C rsession -o %cpu,%mem,cmd
%CPU %MEM CMD
99.5 0.3 /usr/lib/rstudio-server/bin/rsession -u myusername
I have no idea what to try next. It makes no sense to me that loading a RData file would take longer than querying a SQL database that lives on a different server. And even if it did, then why was it so fast when I was timing load() and readRDS() times after saving the dataframes?
It's the first time I ask something here at StackOverflow, so sorry if I forgot to mention something important for you to be able to answer this question. If I did, please let me know.
EDIT: some additional info requested by Brandon in the comments. OS is CentOS 7. The dataframes contain lists of edges in the first two columns (col1=node1; col2=node2) and two additional columns for edge attributes. All columns are strings, varying between 5 and 14 characters long. I have also added the approximate number of rows of each dataframe to my original post. Thanks!

Related

EF Core: Simple query - why so slow?

EF Core version: 3.1.
Here's my method:
public static ILookup<string, int> GetClientCountLookup(DepotContext context, DateRange dateRange)
=> context
.Flows
.Where(e => e.TimeCreated >= dateRange.Start.Date && e.TimeCreated <= dateRange.End.Date)
.GroupBy(e => e.Customer)
.Select(g => new { g.Key, Count = g.Count() })
.ToLookup(k => k.Key, e => e.Count);
All fields used are indexed.
Here's generated query:
SELECT [f].[Customer] AS [Key], COUNT(*) AS [Count]
FROM [Flows] AS [f]
WHERE ([f].[TimeCreated] >= #__dateRange_Start_Date_0) AND ([f].[TimeCreated] <= #__dateRange_End_Date_1)
GROUP BY [f].[Customer]
When that query is executed as SQL, the execution time is 100ms.
When that query is used in code with ToLookup method - execution time is 3200ms.
What's even more weird - the execution time in EF Core seems totally independent from the data sample size (let's say, depending on date range we can count hundreds, or hundreds of thousands records).
WHAT THE HECK IS HAPPENING HERE?
The query I pasted is the real query EF Core sends.
The code fragment I pasted first is executed in 3200ms.
Then I took exact generated SQL and executed in as SQL query in Visual Studio - took 100ms.
It doesn't make any sense to me. I use EF Core for a long time and it seem to perform reasonably.
Most queries (plain, simple, without date ranges) are fast, results are fetched immediately (in less than 200ms).
In my application I built a really HUGE query with like 4 multi-column joins and subqueries... Guess what - it fetches 400 rows in 3200ms. It also fetches 4000 rows in 3200ms. And also when I remove most of the joins, includes, even remove the subquery - 3200ms. Or 4000, depending on my Internet or server momentary state and load.
It's like constant lag and I pinpointed it to the exact first query I pasted.
I know ToLookup method causes to finally fetch all input expression results, but in my case (real world data) - there are exactly 5 rows.
The results looks like this:
|------------|-------|
| Key | Count |
|------------|-------|
| Customer 1 | 500 |
| Customer 2 | 50 |
| Customer 3 | 10 |
| Customer 4 | 5 |
| Customer 5 | 1 |
Fetching 5 rows from database takes 4 seconds?! It's ridiculous. If the whole table was fetched, then rows grouped and counted - that would add up. But the generated query returns literally 5 rows.
What is happening here and what am I missing?
Please, DO NOT ASK ME TO PROVIDE THE FULL CODE. It is confidential, part of a project for my client, I am not allowed to disclose my client's trade secrets. Not here nor in any other question. I know it's hard to understand what happens when you don't have my database and the whole application, but the question here is pure theoretical. Either you know what's going on, or you don't. As simple as that. The question is very hard though.
I can only tell the RDBMS used is MS SQL Express running on Ubuntu server, remotely. The times measured are the times of executing either code tests (NUnit) or queries against the remote DB, all performed on my AMD Ryzen 7 8 core 3.40GHz processor. The server lives on Azure, like 2 core of I5 2.4GHz or something like that.

Here's test:
[Test]
public void Clients() {
var dateRange = new DateRange {
Start = new DateTime(2020, 04, 06),
End = new DateTime(2020, 04, 11)
};
var q1 = DataContext.Flows;
var q2 = DataContext.Flows
.Where(e => e.TimeCreated >= dateRange.Start.Date && e.TimeCreated <= dateRange.End.Date)
.GroupBy(e => e.Customer)
.Select(g => new { g.Key, Count = g.Count() });
var q3 = DataContext.Flows;
var t0 = DateTime.Now;
var x = q1.Any();
var t1 = DateTime.Now - t0;
t0 = DateTime.Now;
var l = q2.ToLookup(g => g.Key, g => g.Count);
var t2 = DateTime.Now - t0;
t0 = DateTime.Now;
var y = q3.Any();
var t3 = DateTime.Now - t0;
TestContext.Out.WriteLine($"t1 = {t1}");
TestContext.Out.WriteLine($"t2 = {t2}");
TestContext.Out.WriteLine($"t3 = {t3}");
}
Here's the test result:
t1 = 00:00:00.6217045 // the time of dummy query
t2 = 00:00:00.1471722 // the time of grouping query
t3 = 00:00:00.0382940 // the time of another dummy query
Yep: 147ms is my grouping that took 3200ms previously.
What happened? A dummy query was executed before.
That explains why the results hardly depended on data sample size!
The huge unexplainable time is INITIALIZATION, not the actual query time. I mean, if not the dummy query before, the whole time would elapse on ToLookup line of code! The line would initialize the DbContext, create connection to database and then perform the actual query and fetch the data.
So as the final answer I can say my test methodology was wrong. I measured the time of the first query to my DbContext. This is wrong, the database should be initialized before the times are measured. I can do it by performing any query before the measured queries.
Well, another question appears - why is THE FIRST query so slow, why is initialization so slow. If my Blazor app would use DbContext as Transient (instantiated each time injected) - would it take so much time each time? I don't think so, because it's how my application worked before major redesign. It didn't have noticeable lags (I would notice 3 seconds lag when changing between pages). But I'm not sure. Now my application uses a scoped DbContext, so it's one for user session. So I won't see the initialization overhead at all, so - the method of measuring time after a dummy query seems to be accurate.

Standalone R and R - SQL give different results

I am working on a forecasting model for monthly data which I intend to use in SQL server 2016 (in-database).
I created a simple TBATS model for testing:
dataset <- msts(data = dataset[,3],
start = c(as.numeric(dataset[1,1]),
as.numeric(dataset[1,2])),
seasonal.periods = c(1,12))
dataset <- tsclean(dataset,
replace.missing = TRUE,
lambda = BoxCox.lambda(dataset,
method = "loglik",
lower = -2,
upper = 1))
dataset <- tbats(dataset,
use.arma.errors = TRUE,
use.parallel = TRUE,
num.cores = NULL
)
dataset <- forecast(dataset,
level =c (80,95),
h = 24)
dataset <- as.data.frame(dataset)
Dataset was imported from .csv file I created with SQL query.
Later, I used same code in SQL server, input being the same query I used for .csv file (meaning data was exactly the same aswell)
However, when I executed the script, I noticed I got different results. All numbers look fine and make perfect sense, both SQL and standalone R give a forecast table, but all numbers between two tables differ for few % (about 3% on average).
Is there an explanation for this? It really bothers me as I need best possible results.
EDIT: This is how my data looks for easier understanding. It's basically 3 column table: year, month, value of transactions (numbers are randomised because data is classified). Alltogether I have data for 9 years.
2008 11 1093747561919.38
2008 12 816860005030.31
2009 1 341394536377.06
2009 2 669993867646.25
2009 3 717585597605.75
2009 4 627553319006.03
2009 5 984146176491.78
2009 6 605488762214.33
2009 7 355366795222.40
2009 8 549252969698.07
2009 9 598237364101.23
This is an example of results. Top two rows are from SQL server, bottom two rows are from RStudio.
t Point Lo80 Hi80
1 872379.7412 557105.271 1187654.211
2 1093817.266 778527.1078 1409107.424
1 806050.6884 517606.464 1094494.913
2 1031845.483 743387.015 1320303.95
EDIT 2: I checked each part of code carefully and I figured out that difference in results happens at TBATS model.
SQL server returns:
TBATS(0.684, {0,0}, -, {<12,5>})
RStudio returns:
TBATS(0.463, {0,0}, -, {<12,5>})
This explains difference in forecast values, but the question remains as these should be the same.

I'll answer this for anyone having problems in the future:
Seems like there is an difference in execution in R engine depending on you OS and runtime. I tested this by runing standalone R on my PC and on server using RStudio and Microsoft R Open and runing R in database on my PC and on server. I also tested all different runtimes.
If anyone wants to test themseves, R runtime can be changed in Tools - Global Options - General - R version (for RStudio)
All tests returned a slightly different results. This does not mean the results were wrong (in my case at least, as I'm forecasting for real business data and results have wide intervals anyway).
This may not be an actual solution, but I hope I can prevent someone panicking for a week like I did.

Speed up sqlFetch()

I am working with an Oracle database and like to fetch a table with 30 million records.
library(RODBC)
ch <- odbcConnect("test", uid="test_user",
pwd="test_pwd",
believeNRows=FALSE, readOnly=TRUE)
db <- sqlFetch(ch, "test_table")
For 1 million records the process needs 1074.58 sec. Thus, it takes quite a while for all 30 million records. Is there any possiblity to speed up the process?
I would appreciate any help. Thanks.

You could try making a system call through the R terminal
to a mySQL shell using the system() command. Process your data externally and only load what you need as output.

What is a viable local database for Windows Phone 7 right now?

I was wondering what is a viable database solution for local storage on Windows Phone 7 right now. Using search I stumbled upon these 2 threads but they are over a few months old. I was wondering if there are some new development in databases for WP7. And I didn't found any reviews about the databases mentioned in the links below.
windows phone 7 database
Local Sql database support for Windows phone 7
My requirements are:
It should be free for commercial use
Saving/updating a record should only save the actual record and not the entire database (unlike WinPhone7 DB)
Able to fast query on a table with ~1000 records using LINQ.
Should also work in simulator
EDIT:
Just tried Sterling using a simple test app: It looks good, but I have 2 issues.
Creating 1000 records takes 30 seconds using db.Save(myPerson). Person is a simple class with 5 properties.
Then I discovered there is a db.SaveAsync<Person>(IList) method. This is fine because it doesn't block the current thread anymore.
BUT my question is: Is it save to call db.Flush() immediately and do a query on the currently saving IList? (because it takes up to 30 seconds to save the records in synchronous mode). Or do I have to wait until the BackgroundWorker has finished saving?
Query these 1000 records with LINQ and a where clause the first time takes up to 14 sec to load into memory.
Is there a way to speed this up?
Here are some benchmark results: (Unit tests was executed on a HTC Trophy)
-----------------------------
purging: 7,59 sec
creating 1000 records: 0,006 sec
saving 1000 records: 32,374 sec
flushing 1000 records: 0,07 sec
-----------------------------
//async
creating 1000 records: 0,04 sec
saving 1000 records: 0,004 sec
flushing 1000 records: 0 sec
-----------------------------
//get all keys
persons list count = 1000 (0,007)
-----------------------------
//get all persons with a where clause
persons list with query count = 26 (14,241)
-----------------------------
//update 1 property of 1 record + save
persons list with query count = 26 (0,003s)
db saved (0,072s)

You might want to take a look at Sterling - it should address most of your concerns and is very flexible.
http://sterling.codeplex.com/
(Full disclosure: my project)

try Siaqodb is commercial project and as difference from Sterling, not serialize objects and keep all in memory for query.Siaqodb can be queried by LINQ provider which efficiently can pull from database even only fields values without create any objects in memory, or load/construct only objects that was requested.

Perst is free for non-commercial use.

You might also want to try Ninja Database Pro. It looks like it has more features than Sterling.
http://www.kellermansoftware.com/p-43-ninja-database-pro.aspx

How do I measure response time in seconds given the following benchmarking data?

We recently got some data back on a benchmarking test from a software vendor, and I think I'm missing something obvious.
If there were 17 transactions (I assume they mean successfully completed requests) per second, and 1500 of these requests could be served in 5 minutes, then how do I get the response time for a single user? Is this sort of thing even possible with benchmarking? I have a lot of other data from them, including apache config settings, but I'm not sure how to do all the math.
Given the server setup they sent, I want to know how I can deduce the user response time. I have looked at other similar benchmarking tests, but I'm having trouble measuring requests to response time. What other data do I need to provide here to get that?

If only 1500 of these can be served per 5 minutes then:
1500 / 5 = 300 transactions per min can be served
300 / 60 = 5 transactions per second can be served
so how are they getting 17 completed transactions per second? Last time I checked 5 < 17 !
This doesn't seem to fit. Or am I looking at it wrongly?
I presume be user response time, you mean the time it takes to serve a single transaction:
If they can serve 5 per second than it takes 200ms (1/5) per transaction
If they can serve 17 per second than it takes 59ms (1/17) per transaction
That is all we can tell from the given data. Perhaps clarify how many transactions are being done per second.