Speed up sqlFetch() - database

I am working with an Oracle database and like to fetch a table with 30 million records.
library(RODBC)
ch <- odbcConnect("test", uid="test_user",
pwd="test_pwd",
believeNRows=FALSE, readOnly=TRUE)
db <- sqlFetch(ch, "test_table")
For 1 million records the process needs 1074.58 sec. Thus, it takes quite a while for all 30 million records. Is there any possiblity to speed up the process?
I would appreciate any help. Thanks.

You could try making a system call through the R terminal
to a mySQL shell using the system() command. Process your data externally and only load what you need as output.

Related

constantUsersPerSecond in Gatling

I'm using Gatling to run loadtest.
I've put all the queries (3million queries in total) in a log file and load it into a feed
val feeder = tsv(logFileLocation).circular
val search =
feed(feeder)
.exec(
http("/")
.get("${params}")
.headers(sentHeaders)
).pause(1)
As for the simulation, I want 50 users concurrently and peak request at 50/seconds so I set up this way
setUp(myservicetest.inject(atOnceUsers(50))).throttle(
reachRps(50) in (40 minutes), jumpToRps(20), holdFor(20 minutes)).maxDuration(60 minutes).protocols(httpProtocol)
My assumption is these 50 users each loads the queries and starts from fisr to last queries. Because there are enough queries to execute, these 50 users will always stay on line for the whole duration (60 mins)
But when I ran it, I saw user1 runs query1, user2 runs query2, .. user50 runs query50. Every user just runs 1 query and then quit. So exactly 50 queries were executed in the loadtest and it finished quickly.
SO my question is, say I have 3 million queries in tsv(logFileLocation).circular and multiple users, will each user starts from query1 and try to execute all 3m queries. Or each user is scheduled to run part of the 3m queries and if enough time is allocated, at the end of the test all 3m queries are executed for just once?
Thanks
Disclaimer: Gatling's author here
The latter: "at the end of the test all 3m queries are executed for just once".
Each virtual user performs its scenario.
Injection profiles only controls when new virtual users are started.
If you want virtual users to perform more than one request, you have to include that in your scenario, eg with loops and pauses.
Those are Gatling basics, I recommend you have a look at the new Gatling Academy courses we've just launched.

Why does the log always say "No Data Available" when the cube is built?

In the sample case on the Kylin official website, when I was building cube, in the first step of the Create Intermediate Flat Hive Table, the log is always No Data Available, the status is always running.
The cube build has been executed for more than three hours.
I checked the hive database table kylin_sales and there is data in the table.
And I fount that the intermediate flat hive table kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
has been created successfully in the hive, but there is no data in its.
hive> show tables;
OK
...
kylin_intermediate_kylin_sales_cube_402e3eaa_dfb2_7e3e_04f3_07248c04c10c
kylin_sales
...
Time taken: 9.816 seconds, Fetched: 10000 row(s)
hive> select * from kylin_sales;
OK
...
8992 2012-04-17 ABIN 15687 0 13 95.5336 17 10000975 10000507 ADMIN Shanghai
8993 2013-02-02 FP-non GTC 67698 0 13 85.7528 6 10000856 10004882 MODELER Hongkong
...
Time taken: 3.759 seconds, Fetched: 10000 row(s)
The deploy environment is as follows:
 
zookeeper-3.4.14
hadoop-3.2.0
hbase-1.4.9
apache-hive-2.3.4-bin
apache-kylin-2.6.1-bin-hbase1x
openssh5.3
jdk1.8.0_144
I deployed the cluster through docker and created 3 containers, one master, two slaves.
Create Intermediate Flat Hive Table step is running.
No Data Available means this step's log has not been captured by Kylin. Usually only when the step is exited (success or failed), the log will be recorded, then you will see the data.
For this case, usually, it indicates the job was pending by Hive, due to many reasons. The simplest way is, watch Kylin's log, you will see the Hive CMD that Kylin executes, and then you can run it manually in console, then you will reproduce the problem. Please check if your Hive/Hadoop has enough resource (cpu, memory) to execute such a query.

Talend Open Studio - Iterate all X rows and not for each row

I used talend for data extraction and insertion into OTSDB. But I need to cut my file, and a classic iteration take too much time (40 rows/s and I have 90 millions rows).
Do you know how to send for example 50 rows by 50 rows instead of each row individually?
The writing mode can be adjusted for many talend components.
The tMysqlOutput component for example can be configured to do an insert every X rows.
The tFileOutputDelimited (eg. CSV) has a setting for the flush buffer size.
Have a closer look at your component's advanced settings.

RData takes longer to load than querying the database again

I am running RStudio Server on a 256GB RAM server, and MS-SQL-Server 2012 on another. This DB contains data that allows me to build a graph with ~100 million nodes and ~150 million edges.
I have timed how long it takes to build this graph from that data:
1st SELECT query = ˜22M rows = 12 minutes = df1 (dataframe1)
2nd SELECT query = ˜30M rows = 8 minutes = df2
3rd SELECT query = ˜32M rows = 8 minutes = df3
4th SELECT query = ˜63M rows = 70 minutes = df4
edges = rbind(df1, df2, df3, df4) = 6 minutes
mygraph = graph.data.frame(edges) = 30 minutes
So a little over two hours. Since my data is quite stable, I figured I could speed things up by saving mygraph to disk. But when I tried to load it, it just wouldn't. I gave up after a 4 hour wait, thinking something had gone wrong.
So I reboot the server, delete my .rstudio folder and start over, this time saving the dataframes from each SQL query plus the edges dataframe, in both RData and RDS formats (save() and saveRDS(), compress = FALSE everytime). After each save, I timed load() and readRDS() times of the five dataframes. Times where pretty much the same for load() and readRDS():
df1 = 1.1 GB file = 1 minute
df2 = 1.4 GB file = 2 minutes
df3 = 1.7 GB file = 6 minutes
df4 = 3.1 GB file = 13 minutes
edges = 6.8 GB file = 21 minutes
Good enough, I thought. But today when I started a new session and tried to load(df1) to make some changes to it, again I got that feeling that something was wrong. After 20 minutes waiting for it to load, I gave up. Memory, disk and CPU shouldn't be the issues, as I'm the only one using this server. I have already reboot the server and deleted my .rstudio folder, thinking maybe something in there was hanging my session, but the dataframe still won't load. While load() is supposedly running, iotop shows no disk activity and this is what I get from ps
ps -C rsession -o %cpu,%mem,cmd
%CPU %MEM CMD
99.5 0.3 /usr/lib/rstudio-server/bin/rsession -u myusername
I have no idea what to try next. It makes no sense to me that loading a RData file would take longer than querying a SQL database that lives on a different server. And even if it did, then why was it so fast when I was timing load() and readRDS() times after saving the dataframes?
It's the first time I ask something here at StackOverflow, so sorry if I forgot to mention something important for you to be able to answer this question. If I did, please let me know.
EDIT: some additional info requested by Brandon in the comments. OS is CentOS 7. The dataframes contain lists of edges in the first two columns (col1=node1; col2=node2) and two additional columns for edge attributes. All columns are strings, varying between 5 and 14 characters long. I have also added the approximate number of rows of each dataframe to my original post. Thanks!

What is a viable local database for Windows Phone 7 right now?

I was wondering what is a viable database solution for local storage on Windows Phone 7 right now. Using search I stumbled upon these 2 threads but they are over a few months old. I was wondering if there are some new development in databases for WP7. And I didn't found any reviews about the databases mentioned in the links below.
windows phone 7 database
Local Sql database support for Windows phone 7
My requirements are:
It should be free for commercial use
Saving/updating a record should only save the actual record and not the entire database (unlike WinPhone7 DB)
Able to fast query on a table with ~1000 records using LINQ.
Should also work in simulator
EDIT:
Just tried Sterling using a simple test app: It looks good, but I have 2 issues.
Creating 1000 records takes 30 seconds using db.Save(myPerson). Person is a simple class with 5 properties.
Then I discovered there is a db.SaveAsync<Person>(IList) method. This is fine because it doesn't block the current thread anymore.
BUT my question is: Is it save to call db.Flush() immediately and do a query on the currently saving IList? (because it takes up to 30 seconds to save the records in synchronous mode). Or do I have to wait until the BackgroundWorker has finished saving?
Query these 1000 records with LINQ and a where clause the first time takes up to 14 sec to load into memory.
Is there a way to speed this up?
Here are some benchmark results: (Unit tests was executed on a HTC Trophy)
-----------------------------
purging: 7,59 sec
creating 1000 records: 0,006 sec
saving 1000 records: 32,374 sec
flushing 1000 records: 0,07 sec
-----------------------------
//async
creating 1000 records: 0,04 sec
saving 1000 records: 0,004 sec
flushing 1000 records: 0 sec
-----------------------------
//get all keys
persons list count = 1000 (0,007)
-----------------------------
//get all persons with a where clause
persons list with query count = 26 (14,241)
-----------------------------
//update 1 property of 1 record + save
persons list with query count = 26 (0,003s)
db saved (0,072s)
You might want to take a look at Sterling - it should address most of your concerns and is very flexible.
http://sterling.codeplex.com/
(Full disclosure: my project)
try Siaqodb is commercial project and as difference from Sterling, not serialize objects and keep all in memory for query.Siaqodb can be queried by LINQ provider which efficiently can pull from database even only fields values without create any objects in memory, or load/construct only objects that was requested.
Perst is free for non-commercial use.
You might also want to try Ninja Database Pro. It looks like it has more features than Sterling.
http://www.kellermansoftware.com/p-43-ninja-database-pro.aspx

Resources