How do I loop this piece of code? - database

me and my partners have this piece of code where we extract tweets in R and put it in a database, what we like to know is how to loop this piece of code, so that it loops periodically. Preferably every 30 minutes.
Here's our code:
#Load twitter package for R
library(twitteR)
#load MySQL package for R
library(RMySQL)
#Load authentication files for twitter
load(file="twitter_authentication.Rdata")
registerTwitterOAuth(cred)
#Search twitter for tweets e.g. #efteling
efteling <- searchTwitter("#efteling", n=100)
#Store the tweets into a dataframe
dataFrameEfteling <- do.call("rbind", lapply(efteling, as.data.frame))
#Setup up the connection to the database()
doConnect <- dbConnect(MySQL(), user="root", password="", dbname="portfolio", host="127.0.0.1")
dbWriteTable(doConnect, "tweetsEfteling", dataFrameEfteling)
eftelingResult <- dbSendQuery(doConnect, "select text from tweetsEfteling")
showResultEfteling <- fetch(eftelingResult, n=20)

Do you have access to crontab? If so, you can set it to run the script however frequently you like.
Here is a little information on crontab.
If your server is running linux, you can just type in
crontab -e
to pull up your personal crontab file. After that, you schedule your command.
For every 30 mins, you would use this command.
*/30 * * * * /path/to/script
Save and exit.

Have you considered using Twitter's streaming API vs REST? This would likely accomplish the same thing if you leave the connection open for an extended period of time. Plus it would cut down on API pulls. Try the streamR package.
If you still want to set it on a timer—http://statistics.ats.ucla.edu/stat/r/faq/timing_code.htm looks useful.

Related

How Do I Capture and Download Snowflake Query Results?

I'm using Snowflake on a Windows PC.
For example: https://<my_org>.snowflakecomputing.com/console#/internal/worksheet
I have a bunch of queries, the collective output of which I want to capture and load into a file.
Apart from running the queries one-at-a-time and using copy-and-paste to populate the file, is there a way I can run all the queries at once and have the output logged to a file on my PC?
There are many ways to achieve the high level outcome that you are seeking, but you have not provided enough context to know which would be best-suited to your situation. For example, by mentioning https://<my_org>.snowflakecomputing.com/console#/internal/worksheet, it is clear that you are currently planning to execute the series of queries through the Snowflake web UI. Is using the web UI a strict requirement of your use-case?
If not, I would recommend that you consider using a Python script (along with the Snowflake Connector for Python) for a task like this. One strategy would be to have the Python script serially process each query as follows:
Execute the query
Export the result set (as a CSV file) to a stage location in cloud storage via two of Snowflake's powerful features:
RESULT_SCAN() function
COPY INTO <location> command to EXPORT data (which is the "opposite" of the COPY INTO <table> command used to IMPORT data)
Download the CSV file to your local host via Snowflake's GET command
Here is a sample of what such a Python script might look like...
import snowflake.connector
query_array = [r"""
SELECT ...
FROM ...
WHERE ...
""",r"""
SELECT ...
FROM ...
WHERE ...
"""
]
conn = snowflake.connector.connect(
account = ...
,user = ...
,password = ...
,role = ...
,warehouse = ...
)
file_number = 0;
for query in query_array:
file_number += 1
file_name = f"{file_prefix}_{file_number}.csv.gz"
rs_query = conn.cursor(snowflake.connector.DictCursor).execute(query)
query_id = rs_query.sfqid # Retrieve query ID for query execution
sql_copy_into = f"""
COPY INTO #MY_STAGE/{file_name}
FROM (SELECT * FROM TABLE(RESULT_SCAN('{query_id}')))
DETAILED_OUTPUT = TRUE
HEADER = TRUE
SINGLE = TRUE
OVERWRITE = TRUE
"""
rs_copy_into = conn.cursor(snowflake.connector.DictCursor).execute(sql_copy_into)
for row_copy_into in rs_copy_into:
file_name_in_stage = row_copy_into["FILE_NAME"]
sql_get_to_local = f"""
GET #MY_STAGE/{file_name_in_stage} file://.
"""
rs_get_to_local = conn.cursor(snowflake.connector.DictCursor).execute(sql_get_to_local)
Note: I have chosen (for performance reasons) to export and transfer the files as zipped (gz) files; you could skip this by passing the COMPRESSION=NONE option in the COPY INTO <location> command.
Also, if your result sets are much smaller, then you could use an entirely different strategy and simply have Python pull and write the results of each query directly to a local file. I assumed that your result sets might be larger, hence the export + download option I have employed here.
You can use the SnowSQL client for this. See https://docs.snowflake.com/en/user-guide/snowsql.html
Once you get it configured, then you can make a batch file or similar that calls SnowSQL to run each of your queries and write the output to a file. Something like:
#echo off
>output.txt (
snowsql -q "select blah"
snowsql -q "select blah"
...
snowsql -q "select blah"
)

batch query is not allowed to request data from "".""

I'm getting started with Kapacitor and have been trying to run the first guide in the Kapacitor documentation, but with data I already have. I managed to define a task, but I can neither enable it nor can I run a backfill. I came across this question, which is similar to my problem, but the answer there didn't help. In contrast to the error message there I get empty strings for database, retention policy, and/or measurement.
In Kapacitor config I set an InfluxDB connection to the local host instance with the name localhost (which has a database mydb and the measurements weather.current.clouds and weather.current.visibility with default retention policy autogen) and created the following weathertest.tick script:
dbrp "mydb"."autogen"
var clouds = batch
|query('select mean(value) / 100.0 as val from "mydb"."autogen"."weather.current.clouds"')
.period(1h)
.every(1h)
.groupBy(time(1m), *)
.fill(0)
var vis = batch
|query('select mean(value) / 10000.0 as val from "mydb"."autogen"."weather.current.visibility"')
.period(1h)
.every(1h)
.groupBy(time(1m), *)
.fill(0)
clouds
|join(vis)
.as('c', 'v')
|eval(lambda: 100 * (1 - "c.val") * "v.val")
.as('pcent')
|influxDBOut()
.cluster('localhost')
.database('mydb')
.retentionPolicy('autogen')
.measurement('testmetric')
.tag('host', 'myhost.local')
.tag('key', 'weather.current.lightidx')
This is what I came up with after hours of trial and (especially) error. As given in the title, when I try to enable my task with kapacitor enable weathertest, I get the error message enabling task weathertest: batch query is not allowed to request data from ""."". Same thing when I try to record as in the "Backfill" example. Also, in that example there is a start and a stop date for limiting the time frame. The time format given there is wrong and is not understood by Kapacitor. Instead of e. g. 2015-10-01 I have to put in 2015-10-01T00:00Z to make it at least pass the error message regarding time format error.
In the Kapacitor logs there is not a single line regarding these errors, only when I try to remove a record, I get something like remove /var/lib/kapacitor/replay/1f5...750.brpl: no such file or directory and this can be found in the logs. There are lots of info lines in the logs showing successful POSTs to/from InfluxDB for the _internal database with HTTP response result 204.
Has anyone an Idea what I may be doing wrong?
OK, after the weekend I tried again. Without any change it accepted my script now in the failing steps, however, now I was able to find error messages in the log. The node mentioned there was the eval node and pointed towards a type mismatch. When I changed the line
|eval(lambda: 100 * (1 - "c.val") * "v.val")
to
|eval(lambda: 100.0 * (1.0 - "c.val") * "v.val")
the error messages were gone and the command kapacitor show weathertest showed a rather sane content now.
Furthermore, I redefined, recorded, replayed and deleted the tasks and recordings during my tests over and over again and I may have forgotten to redefine tasks after making changes to the tick script (not really sure). After changing the above, redefining the task and replaying it I finally found the expected data in the InfluxDB instance.

Python - Unable to read latest data from database

I have a Django application writing to sqlite3 database, which is immediately accessed by Python scripts on the same machine.
I want the python scripts to be able to read the last entry that the Django application has submitted.
Scenario
User logs in Django website from Raspberry Pi (RPi)
User enters desired data.
Data submitted to database.
User pushes push button connected to RPi, which prompts database query.
Separate python scripts perform calculations based on user input.
I am sure that the Django app writes to the database before I am trying to access this data from the python scripts. I am using sqiltebrowser to check this.
Database:
tact
-----------------
45 <- old value retrieved
60 <- new value not retrieved
60 <- new value not retrieved
Code:
conn = sqlite3.connect('Line3_Data.db')
conn.commit()
c = conn.cursor()
c.execute("SELECT tact FROM LineOEE03 ORDER BY tact desc limit 1")
current_tact = c.fetchone()
print(current_tact) #prints 45
I am aware that conn.commit() should get my session up to date, but why is this not the case here?

Server out-of-memory issue when using RJDBC in paralel computing environment

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.
This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

How to Improve Performance and speed

I have written this program for connecting and fetching the data into file, but this program is so slow in fetching . is there is any way to improve the performance and faster way to load the data into the file . iam targeting around 100,000 to million of records so thats why iam worried about performance and also can i use array fetch size and batch size as we can do in java.
import java.sql as sql
import java.lang as lang
def main():
driver, url, user, passwd = ('oracle.jdbc.driver.OracleDriver','jdbc:oracle:thin:#localhost:1521:xe','odi_temp','odi_temp')
##### Register Driver
lang.Class.forName(driver)
##### Create a Connection Object
myCon = sql.DriverManager.getConnection(url, user, passwd)
f = open('c:/test_porgram.txt', 'w')
try:
##### Create a Statement
myStmt = myCon.createStatement()
##### Run a Select Query and get a Result Set
myRs = myStmt.executeQuery("select emp_id ,first_name,last_name,date_of_join from src_sales_12")
##### Loop over the Result Set and print the result in a file
while (myRs.next()):
print >> f , "%s,%s,%s,%s" %(myRs.getString("EMP_ID"),myRs.getString("FIRST_NAME"),myRs.getString("LAST_NAME"),myRs.getString("DATE_OF_JOIN") )
finally:
myCon.close()
f.close()
### Entry Point of the program
if __name__ == '__main__':
main()
Unless you're on the finest, finest gear for the DB and file server, or the worst gear running the script, this application is I/O bound. After the select has returned from the DB, the actual movement of the data will dominate more than any inefficiencies in Jython, Java, or this code.
You CPU is basically unconscious during this process, you're simply not doing enough data transformation. You could write a process that is slower than the I/O, but this isn't one of them.
You could write this in C and I doubt you'd see a substantial difference.
Can't you just use the Oracle command-line SQL client to directly export the results of that query into a CSV file?
You might use getString with hardcoded indices instead of the column name (in your print statement) so the program doesn't have to look up the names over and over. Also, I don't know enough about Jython/Python file output to say whether this is enabled by default or not, but you should try to make sure your output is buffered.
EDIT:
Code requested (I make no claims about the correctness of this code):
print >> f , "%s,%s,%s,%s" %(myRs.getString(0),myRs.getString(1),myRs.getString(2),myRs.getString(3) )
or
myRs = myStmt.executeQuery("select emp_id ,first_name,last_name,date_of_join from src_sales_12")
hasFirst = myRs.next()
if (hasFirst):
empIdIdx = myRs.findColumn("EMP_ID")
fNameIdx = myRs.findColumn("FIRST_NAME")
lNameIdx = myRs.findColumn("LAST_NAME")
dojIdx = myRs.findColumn("DATE_OF_JOIN")
print >> f , "%s,%s,%s,%s" %(myRs.getString(empIdIdx),myRs.getString(fNameIdx),myRs.getString(lNameIdx),myRs.getString(dojIdx) )
##### Loop over the Result Set and print the result in a file
while (myRs.next()):
print >> f , "%s,%s,%s,%s" %(myRs.getString(empIdIdx),myRs.getString(fNameIdx),myRs.getString(lNameIdx),myRs.getString(dojIdx) )
if you just want to fetch data into files ,you can try database tools(for example , "load","export").
You may also find that if you do the construction of the string which goes into the file in the SQL select statement, you will get better performance.
So your SQL select should be SELECT EMP_ID || ',' || FIRST_NAME || ',' || LAST_NAME || ',' || DATE_OF_JOIN MY_DATA ... (depending on what database and separator is)
then in your java code you just get the one string empData = myRs.findColumn("EMP_DATA") and write that to a file. We have seen significant performance benefits doing this.
The other thing you may see benefit from is changing the JDBC connection to use a larger read buffer - rather than 30 rows at a time in the fetch, fetch 5000 rows.

Resources