I'm using Snowflake on a Windows PC.
For example: https://<my_org>.snowflakecomputing.com/console#/internal/worksheet
I have a bunch of queries, the collective output of which I want to capture and load into a file.
Apart from running the queries one-at-a-time and using copy-and-paste to populate the file, is there a way I can run all the queries at once and have the output logged to a file on my PC?
There are many ways to achieve the high level outcome that you are seeking, but you have not provided enough context to know which would be best-suited to your situation. For example, by mentioning https://<my_org>.snowflakecomputing.com/console#/internal/worksheet, it is clear that you are currently planning to execute the series of queries through the Snowflake web UI. Is using the web UI a strict requirement of your use-case?
If not, I would recommend that you consider using a Python script (along with the Snowflake Connector for Python) for a task like this. One strategy would be to have the Python script serially process each query as follows:
Execute the query
Export the result set (as a CSV file) to a stage location in cloud storage via two of Snowflake's powerful features:
RESULT_SCAN() function
COPY INTO <location> command to EXPORT data (which is the "opposite" of the COPY INTO <table> command used to IMPORT data)
Download the CSV file to your local host via Snowflake's GET command
Here is a sample of what such a Python script might look like...
import snowflake.connector
query_array = [r"""
SELECT ...
FROM ...
WHERE ...
""",r"""
SELECT ...
FROM ...
WHERE ...
"""
]
conn = snowflake.connector.connect(
account = ...
,user = ...
,password = ...
,role = ...
,warehouse = ...
)
file_number = 0;
for query in query_array:
file_number += 1
file_name = f"{file_prefix}_{file_number}.csv.gz"
rs_query = conn.cursor(snowflake.connector.DictCursor).execute(query)
query_id = rs_query.sfqid # Retrieve query ID for query execution
sql_copy_into = f"""
COPY INTO #MY_STAGE/{file_name}
FROM (SELECT * FROM TABLE(RESULT_SCAN('{query_id}')))
DETAILED_OUTPUT = TRUE
HEADER = TRUE
SINGLE = TRUE
OVERWRITE = TRUE
"""
rs_copy_into = conn.cursor(snowflake.connector.DictCursor).execute(sql_copy_into)
for row_copy_into in rs_copy_into:
file_name_in_stage = row_copy_into["FILE_NAME"]
sql_get_to_local = f"""
GET #MY_STAGE/{file_name_in_stage} file://.
"""
rs_get_to_local = conn.cursor(snowflake.connector.DictCursor).execute(sql_get_to_local)
Note: I have chosen (for performance reasons) to export and transfer the files as zipped (gz) files; you could skip this by passing the COMPRESSION=NONE option in the COPY INTO <location> command.
Also, if your result sets are much smaller, then you could use an entirely different strategy and simply have Python pull and write the results of each query directly to a local file. I assumed that your result sets might be larger, hence the export + download option I have employed here.
You can use the SnowSQL client for this. See https://docs.snowflake.com/en/user-guide/snowsql.html
Once you get it configured, then you can make a batch file or similar that calls SnowSQL to run each of your queries and write the output to a file. Something like:
#echo off
>output.txt (
snowsql -q "select blah"
snowsql -q "select blah"
...
snowsql -q "select blah"
)
Related
We're building dynamic data loading statements for Snowflake using the Python interface.
We want to create a stage at query runtime, and use that stage in a subsequent statement. Table and stage names are dynamic using bind variable.
Yet, it doens't seem like we can find the correct syntax as we tried everything on https://docs.snowflake.com/en/user-guide/python-connector-api.html
COPY INTO IDENTIFIER( %(table_name)s )(SRC, LOAD_TIME, ROW_HASH)
FROM (SELECT t.$1, CURRENT_TIMESTAMP(0), MD5(t.$1) FROM "'%(stage_name)s'" t)
PURGE = TRUE;
Is this even possible? Does it work for anyone?
Your code does not create stage as you mentioned, and you don't need create a stage, instead use table stage or user stage. The SQL below uses table stage.
You also need to change your syntax a little and use more pythonic way : f-strings
sql = f"""COPY INTO {table_name} (SRC, LOAD_TIME, ROW_HASH)
FROM (SELECT t.$1, CURRENT_TIMESTAMP(0), MD5(t.$1) FROM #%{table_name} t)
PURGE = TRUE"""
Code Migration due to Performance Issues :-
SQL Server LIKE Condition ( BEFORE )
SQL Server Full Text Search --> CONTAINS ( BEFORE )
Elastic Search ( CURRENTLY )
Achieved So Far :-
We have a web page created in ASP.Net Core which has a Auto Complete Drop Down of 2.5+ Million Companies Indexed in Elastic Search https://www.99corporates.com/
Due to performance issues we have successfully shifted our code from SQL Server Full Text Search to Elastic Search and using NEST v7.2.1 and Elasticsearch.Net v7.2.1 in our .Net Code.
Still looking for a solution :-
If the user does not select a company from the Auto Complete List and simply enters a few characters and clicks on go then a list should be displayed which we had done earlier by using the SQL Server Full Text Search --> CONTAINS
Can we call the ASP.Net Web Service which we have created using SQL CLR and code like SELECT * FROM dbo.Table WHERE Name IN( dbo.SQLWebRequest('') )
[System.Web.Script.Services.ScriptMethod()]
[System.Web.Services.WebMethod]
public static List<string> SearchCompany(string prefixText, int count)
{
}
Any better or alternate option
While that solution (i.e. the SQL-APIConsumer SQLCLR project) "works", it is not scalable. It also requires setting the database to TRUSTWORTHY ON (a security risk), and loads a few assemblies as UNSAFE, such as Json.NET, which is risky if any of them use static variables for caching, expecting each caller to be isolated / have their own App Domain, because SQLCLR is a single, shared App Domain, hence static variables are shared across all callers, and multiple concurrent threads can cause race-conditions (this is not to say that this is something that is definitely happening since I haven't seen the code, but if you haven't either reviewed the code or conducted testing with multiple concurrent threads to ensure that it doesn't pose a problem, then it's definitely a gamble with regards to stability and ensuring predictable, expected behavior).
To a slight degree I am biased given that I do sell a SQLCLR library, SQL#, in which the Full version contains a stored procedure that also does this but a) handles security properly via signatures (it does not enable TRUSTWORTHY), b) allows for handling scalability, c) does not require any UNSAFE assemblies, and d) handles more scenarios (better header handling, etc). It doesn't handle any JSON, it just returns the web service response and you can unpack that using OPENJSON or something else if you prefer. (yes, there is a Free version of SQL#, but it does not contain INET_GetWebPages).
HOWEVER, I don't think SQLCLR is a good fit for this scenario in the first place. In your first two versions of this project (using LIKE and then CONTAINS) it made sense to send the user input directly into the query. But now that you are using a web service to get a list of matching values from that user input, you are no longer confined to that approach. You can, and should, handle the web service / Elastic Search portion of this separately, in the app layer.
Rather than passing the user input into the query, only to have the query pause to get that list of 0 or more matching values, you should do the following:
Before executing any query, get the list of matching values directly in the app layer.
If no matching values are returned, you can skip the database call entirely as you already have your answer, and respond immediately to the user (much faster response time when no matches return)
If there are matches, then execute the search stored procedure, sending that list of matches as-is via Table-Valued Parameter (TVP) which becomes a table variable in the stored procedure. Use that table variable to INNER JOIN against the table rather than doing an IN list since IN lists do not scale well. Also, be sure to send the TVP values to SQL Server using the IEnumerable<SqlDataRecord> method, not the DataTable approach as that merely wastes CPU / time and memory.
For example code on how to accomplish this correctly, please see my answer to Pass Dictionary to Stored Procedure T-SQL
In C#-style pseudo-code, this would be something along the lines of the following:
List<string> = companies;
companies = SearchCompany(PrefixText, Count);
if (companies.Length == 0)
{
Response.Write("Nope");
}
else
{
using(SqlConnection db = new SqlConnection(connectionString))
{
using(SqlCommand batch = db.CreateCommand())
{
batch.CommandType = CommandType.StoredProcedure;
batch.CommandText = "ProcName";
SqlParameter tvp = new SqlParameter("ParamName", SqlDbType.Structured);
tvp.Value = MethodThatYieldReturnsList(companies);
batch.Paramaters.Add(tvp);
db.Open();
using(SqlDataReader results = db.ExecuteReader())
{
if (results.HasRows)
{
// deal with results
Response.Write(results....);
}
}
}
}
}
Done. Got the solution.
Used SQL CLR https://github.com/geral2/SQL-APIConsumer
exec [dbo].[APICaller_POST]
#URL = 'https://www.-----/SearchCompany'
,#JsonBody = '{"searchText":"GOOG","count":10}'
Let me know if there is any other / better options to achieve this.
When I have a query generated like this:
var query = from x in Entities.SomeTable
select x;
I can set a breakpoint and after hovering cursor over query I can see what will be the SQL command sent to database. Unfortunately I cannot do it when I use Count
var query = (from x in Entities.SomeTable
select x).Count();
Of course I could see what comes to SqlServer using profiler but maybe someone has any idea how to do it (if it is possible) in VS.
You can use ToTraceString():
ObjectQuery<SomeTable> query = (from x in Entities.SomeTable select x).Count();
Console.WriteLine(query.ToTraceString());
You can use the Database.Log to log any query made like this :
using (var context = new MyContext())
{
context.Database.Log = Console.Write;
// Your code here...
}
Usually, in my context's constructor, I set that to my logger (whether it is NLog, Log4Net, or the stock .net loggers) and not the console, but actual logging tool is irrelevant.
For more information
In EF6 and above, you can use the following before your query:
context.Database.Log = s => System.Diagnostics.Debug.WriteLine(s);
I've found this to be quicker than pulling up SQL Profiler and running a trace.
Also, this post talks more about this topic:
How do I view the SQL generated by the Entity Framework?
I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()
I have written this program for connecting and fetching the data into file, but this program is so slow in fetching . is there is any way to improve the performance and faster way to load the data into the file . iam targeting around 100,000 to million of records so thats why iam worried about performance and also can i use array fetch size and batch size as we can do in java.
import java.sql as sql
import java.lang as lang
def main():
driver, url, user, passwd = ('oracle.jdbc.driver.OracleDriver','jdbc:oracle:thin:#localhost:1521:xe','odi_temp','odi_temp')
##### Register Driver
lang.Class.forName(driver)
##### Create a Connection Object
myCon = sql.DriverManager.getConnection(url, user, passwd)
f = open('c:/test_porgram.txt', 'w')
try:
##### Create a Statement
myStmt = myCon.createStatement()
##### Run a Select Query and get a Result Set
myRs = myStmt.executeQuery("select emp_id ,first_name,last_name,date_of_join from src_sales_12")
##### Loop over the Result Set and print the result in a file
while (myRs.next()):
print >> f , "%s,%s,%s,%s" %(myRs.getString("EMP_ID"),myRs.getString("FIRST_NAME"),myRs.getString("LAST_NAME"),myRs.getString("DATE_OF_JOIN") )
finally:
myCon.close()
f.close()
### Entry Point of the program
if __name__ == '__main__':
main()
Unless you're on the finest, finest gear for the DB and file server, or the worst gear running the script, this application is I/O bound. After the select has returned from the DB, the actual movement of the data will dominate more than any inefficiencies in Jython, Java, or this code.
You CPU is basically unconscious during this process, you're simply not doing enough data transformation. You could write a process that is slower than the I/O, but this isn't one of them.
You could write this in C and I doubt you'd see a substantial difference.
Can't you just use the Oracle command-line SQL client to directly export the results of that query into a CSV file?
You might use getString with hardcoded indices instead of the column name (in your print statement) so the program doesn't have to look up the names over and over. Also, I don't know enough about Jython/Python file output to say whether this is enabled by default or not, but you should try to make sure your output is buffered.
EDIT:
Code requested (I make no claims about the correctness of this code):
print >> f , "%s,%s,%s,%s" %(myRs.getString(0),myRs.getString(1),myRs.getString(2),myRs.getString(3) )
or
myRs = myStmt.executeQuery("select emp_id ,first_name,last_name,date_of_join from src_sales_12")
hasFirst = myRs.next()
if (hasFirst):
empIdIdx = myRs.findColumn("EMP_ID")
fNameIdx = myRs.findColumn("FIRST_NAME")
lNameIdx = myRs.findColumn("LAST_NAME")
dojIdx = myRs.findColumn("DATE_OF_JOIN")
print >> f , "%s,%s,%s,%s" %(myRs.getString(empIdIdx),myRs.getString(fNameIdx),myRs.getString(lNameIdx),myRs.getString(dojIdx) )
##### Loop over the Result Set and print the result in a file
while (myRs.next()):
print >> f , "%s,%s,%s,%s" %(myRs.getString(empIdIdx),myRs.getString(fNameIdx),myRs.getString(lNameIdx),myRs.getString(dojIdx) )
if you just want to fetch data into files ,you can try database tools(for example , "load","export").
You may also find that if you do the construction of the string which goes into the file in the SQL select statement, you will get better performance.
So your SQL select should be SELECT EMP_ID || ',' || FIRST_NAME || ',' || LAST_NAME || ',' || DATE_OF_JOIN MY_DATA ... (depending on what database and separator is)
then in your java code you just get the one string empData = myRs.findColumn("EMP_DATA") and write that to a file. We have seen significant performance benefits doing this.
The other thing you may see benefit from is changing the JDBC connection to use a larger read buffer - rather than 30 rows at a time in the fetch, fetch 5000 rows.