Max LOB size (16777216) exceeded for array_agg - snowflake-cloud-data-platform

I have a table with about 30k rows and each of them is put in {}
in the end I would like to get it like this:
[
{Objekt1},
{Objekt2}
]
This solution worked well, as we haven't had that many rows. But now we get this limit.
COPY INTO FROM (
SELECT array_agg(*) FROM (
SELECT OBJECT_CONSTRUCT( ......
OBJECT_CONSTRUCT(.....) )
from
(select * from (select
REPLACE(parse_json(OFFER):"spec":"im:offerID",'"')::varchar AS ID,
...,
... )))) )
FILE_FORMAT = (TYPE = JSON COMPRESSION = None )
credentials =
(aws_key_id=''aws_secret_key='')
OVERWRITE = TRUE single = true
HEADER = FALSE
max_file_size=267772160
We offer this to some external agency and that style is the only way, they can read it.
Is there another solution? Or a way to go around this problem?
Thanks

As you've discovered, there is a hard limit of 16Mb on array_agg (and in a lot of other places in Snowflake e.g. it's the max size for a variant column).
If it is acceptable to create multiple files then you can probably achieve this in a Stored Proc - find some combination of column values that will guarantee that the data in each partition will result in an array_agg size < 16Mb - and then loop through those partitions running a COPY INTO for each one and outputting to a different file each time.
If you have to produce a single file then I can't think of a way of achieving this in Snowflake (though someone else may be able to). If you can process the file once it is written to S3 then it would be straightforward to copy the data to a file as JSON and then edit it to add the '[' and ']' around it

Related

flask-sqlalchemy slow paginate count

I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.
So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").

SSIS - Is there a way to filter data from a flat table?

I have a general ledger transactions down to the individual journal bookings. I am working on a process to rebuild these booking from the bottoms up using tables and other information that is located in other places. To start, I have an SSIS package that pulls in 3-4 different "Divisions" worth of data.
In one case, there are over 600k lines, and I'll need at most 50k. The 600k being loaded into a table takes a while. I was looking for a way to head that off. If I were doing it in SQL Server, I'd do something like:
SELECT * FROM C601
WHERE (COST_CENTER = 5U AND ACCOUNT = 1100001)
OR (COST_CENTER = 5U AND ACCOUNT = 1300001)
I'd have about 12-13 total WHERE items, but would reduce it to maybe 10% of the original items. Is there a way to filter the flat file loading in SSIS with far fewer items before I load the SQL Server table as I would with SQL above?
Use a Conditional Split Transformation
1st approach
Add a similar expression:
[COST_CENTER] = "5U" && ([ACCOUNT] = 1100001 || [ACCOUNT] = 1300001)
2nd approach
Or you can add two split expression as following:
COST_CENTER]!= "5U" || [ACCOUNT]!= 1100001
And
[COST_CENTER] != "5U" || [ACCOUNT] != 1300001
Then you can use the Conditional Split default output to get the desired result.

lua and lsqlite3: speeding up select statement

I'm using the lsqlite3 lua wrapper and I'm making queries into a database. My DB has ~5million rows and the code I'm using to retrieve rows is akin to:
db = lsqlite3.open('mydb')
local temp = {}
local sql = "SELECT A,B FROM tab where FOO=BAR ORDER BY A DESC LIMIT N"
for row in db:nrows(sql) do temp[row['key']] = row['col1'] end
As you can see I'm trying to get the top N rows sorted in descending order by FOO (I want to get the top rows and then apply the LIMIT not the other way around). I indexed the column A but it doesn't seem to make much of a difference. How can I make this faster?
You need to index the column on which you filter (i.e. with the WHERE clause). THe reason is that ORDER BY comes into play after filtering, not the other way around.
So you probably should create an index on FOO.
Can you post your table schema?
UPDATE
Also you can increase the sqlite cache, e.g.:
PRAGMA cache_size=100000
You can adjust this depending on the memory available and the size of your database.
UPDATE 2
I you want to have a better understanding of how your query is handled by sqlite, you can ask it to provide you with the query plan:
http://www.sqlite.org/eqp.html
UPDATE 3
I did not understand your context properly with my initial answer. If you are to ORDER BY on some large data set, you probably want to use that index, not the previous one, so you can tell sqlite to not use the index on FOO this way:
SELECT a, b FROM foo WHERE +a > 30 ORDER BY b

Groovy sql dataset causes java.lang.OutOfMemory

I have a table with 252759 tuples. I would like to use DataSet object to make my life easier, however when I try to create a DataSet for my table, after 3 seconds, I get java.lang.OutOfMemory.
I have no experience with Datasets, are there any guidelines how to use DataSet object for big tables?
Do you really need to retrieve all the rows at once? If not, then you could just retrieve them in batches of (for example) 10000 using the approach shown below.
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'', driver:'org.hsqldb.jdbcDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
String query = "SELECT * FROM my_table WHERE id > ? ORDER BY id limit 10000"
Integer maxId = 0
// Closure that executes the query and returns true if some rows were processed
Closure executeQuery = {
def oldMaxId = maxId
sql.eachRow(query, [maxId]) { row ->
// Code to process each row goes here.....
maxId = row.id
}
return maxId != oldMaxId
}
while (executeQuery());
AFAIK limit is a MySQL-specific feature, but most other RDBMS have an equivalent feature that limits the number of rows returned by a query.
Also, I haven't tested (or even compiled) the code above, so handle with care!
Why not start with giving the JVM more memory?
java -Xms<initial heap size> -Xmx<maximum heap size>
252759 tuples doesn't sound like anything a maching with 4GB RAM + some virtual memory couldn't handle in memory.

How to Improve Performance and speed

I have written this program for connecting and fetching the data into file, but this program is so slow in fetching . is there is any way to improve the performance and faster way to load the data into the file . iam targeting around 100,000 to million of records so thats why iam worried about performance and also can i use array fetch size and batch size as we can do in java.
import java.sql as sql
import java.lang as lang
def main():
driver, url, user, passwd = ('oracle.jdbc.driver.OracleDriver','jdbc:oracle:thin:#localhost:1521:xe','odi_temp','odi_temp')
##### Register Driver
lang.Class.forName(driver)
##### Create a Connection Object
myCon = sql.DriverManager.getConnection(url, user, passwd)
f = open('c:/test_porgram.txt', 'w')
try:
##### Create a Statement
myStmt = myCon.createStatement()
##### Run a Select Query and get a Result Set
myRs = myStmt.executeQuery("select emp_id ,first_name,last_name,date_of_join from src_sales_12")
##### Loop over the Result Set and print the result in a file
while (myRs.next()):
print >> f , "%s,%s,%s,%s" %(myRs.getString("EMP_ID"),myRs.getString("FIRST_NAME"),myRs.getString("LAST_NAME"),myRs.getString("DATE_OF_JOIN") )
finally:
myCon.close()
f.close()
### Entry Point of the program
if __name__ == '__main__':
main()
Unless you're on the finest, finest gear for the DB and file server, or the worst gear running the script, this application is I/O bound. After the select has returned from the DB, the actual movement of the data will dominate more than any inefficiencies in Jython, Java, or this code.
You CPU is basically unconscious during this process, you're simply not doing enough data transformation. You could write a process that is slower than the I/O, but this isn't one of them.
You could write this in C and I doubt you'd see a substantial difference.
Can't you just use the Oracle command-line SQL client to directly export the results of that query into a CSV file?
You might use getString with hardcoded indices instead of the column name (in your print statement) so the program doesn't have to look up the names over and over. Also, I don't know enough about Jython/Python file output to say whether this is enabled by default or not, but you should try to make sure your output is buffered.
EDIT:
Code requested (I make no claims about the correctness of this code):
print >> f , "%s,%s,%s,%s" %(myRs.getString(0),myRs.getString(1),myRs.getString(2),myRs.getString(3) )
or
myRs = myStmt.executeQuery("select emp_id ,first_name,last_name,date_of_join from src_sales_12")
hasFirst = myRs.next()
if (hasFirst):
empIdIdx = myRs.findColumn("EMP_ID")
fNameIdx = myRs.findColumn("FIRST_NAME")
lNameIdx = myRs.findColumn("LAST_NAME")
dojIdx = myRs.findColumn("DATE_OF_JOIN")
print >> f , "%s,%s,%s,%s" %(myRs.getString(empIdIdx),myRs.getString(fNameIdx),myRs.getString(lNameIdx),myRs.getString(dojIdx) )
##### Loop over the Result Set and print the result in a file
while (myRs.next()):
print >> f , "%s,%s,%s,%s" %(myRs.getString(empIdIdx),myRs.getString(fNameIdx),myRs.getString(lNameIdx),myRs.getString(dojIdx) )
if you just want to fetch data into files ,you can try database tools(for example , "load","export").
You may also find that if you do the construction of the string which goes into the file in the SQL select statement, you will get better performance.
So your SQL select should be SELECT EMP_ID || ',' || FIRST_NAME || ',' || LAST_NAME || ',' || DATE_OF_JOIN MY_DATA ... (depending on what database and separator is)
then in your java code you just get the one string empData = myRs.findColumn("EMP_DATA") and write that to a file. We have seen significant performance benefits doing this.
The other thing you may see benefit from is changing the JDBC connection to use a larger read buffer - rather than 30 rows at a time in the fetch, fetch 5000 rows.

Resources