I have a table with 252759 tuples. I would like to use DataSet object to make my life easier, however when I try to create a DataSet for my table, after 3 seconds, I get java.lang.OutOfMemory.
I have no experience with Datasets, are there any guidelines how to use DataSet object for big tables?
Do you really need to retrieve all the rows at once? If not, then you could just retrieve them in batches of (for example) 10000 using the approach shown below.
def db = [url:'jdbc:hsqldb:mem:testDB', user:'sa', password:'', driver:'org.hsqldb.jdbcDriver']
def sql = Sql.newInstance(db.url, db.user, db.password, db.driver)
String query = "SELECT * FROM my_table WHERE id > ? ORDER BY id limit 10000"
Integer maxId = 0
// Closure that executes the query and returns true if some rows were processed
Closure executeQuery = {
def oldMaxId = maxId
sql.eachRow(query, [maxId]) { row ->
// Code to process each row goes here.....
maxId = row.id
}
return maxId != oldMaxId
}
while (executeQuery());
AFAIK limit is a MySQL-specific feature, but most other RDBMS have an equivalent feature that limits the number of rows returned by a query.
Also, I haven't tested (or even compiled) the code above, so handle with care!
Why not start with giving the JVM more memory?
java -Xms<initial heap size> -Xmx<maximum heap size>
252759 tuples doesn't sound like anything a maching with 4GB RAM + some virtual memory couldn't handle in memory.
Related
I have a Postgres 10 database in my Flask app. I'm trying to paginate the filtering results on table over milions of rows. The problem is, that paginate method do counting total number of query results totaly ineffective.
Heres the example with dummy filter:
paginate = Buildings.query.filter(height>10).paginate(1,10)
Under the hood if perform 2 queries:
SELECT * FROM buildings where height > 10
SELECT count(*) FROM (
SELECT * FROM buildings where height > 10
)
--------
count returns 200,000 rows
The problem is that count on raw select without subquery is quite fast ~30ms, but paginate method wraps that into subquery that takes ~30s.
The query plan on cold database:
Is there an option of using default paginate method from flask-sqlalchemy in performant way?
EDIT:
To get the better understanding of my problem here is the real filter operations used in my case, but with dummy field names:
paginate = Buildings.query.filter_by(owner_id=None).filter(Buildings.address.like('%A%')).paginate(1,10)
So the SQL the ORM produce is:
SELECT count(*) AS count_1
FROM (SELECT foo_column, [...]
FROM buildings
WHERE buildings.owner_id IS NULL AND buildings.address LIKE '%A%' ) AS anon_1
That query is already optimized by indices from:
CREATE INDEX ix_trgm_buildings_address ON public.buildings USING gin (address gin_trgm_ops);
CREATE INDEX ix_buildings_owner_id ON public.buildings USING btree (owner_id)
The problem is just this count function, that's very slow.
So it looks like a disk-reading problem. The solutions would be get faster disks, get more RAM is it all can be cached, or if you have enough RAM than to use pg_prewarm to get all the data into the cache ahead of need. Or try increasing effective_io_concurrency, so that the bitmap heap scan can have more than one IO request outstanding at a time.
Your actual query seems to be more complex than the one you show, based on the Filter: entry and based on the Row Removed by Index Recheck: entry in combination with the lack of Lossy blocks. There might be some other things to try, but we would need to see the real query and the index definition (which apparently is not just an ordinary btree index on "height").
I am connecting to a universe database (from rocket software) using their .net driver. I would like to fetch data on demand on user request per page i.e. do pagination. With other databases we could use (offset fetch) but universe db doesn't seem to support it. It does not recognize keyword offset, something like
SELECT NAME, AGE FROM CONTACTS WHERE AGE > 25 offset 5 sample 5 does not work. I does not recognize those keywords and there is no good documentation :-(
Note: Although it is traditionally a multi-value database, the one I am using does not use multi value types but the structure is normalized.
This is certainly one of the shortcomings of this platform. I have worked through this in the past with the something similar to the following subroutine. I had to remove a bunch of stuff for brevity but this compiles so it must work completely bug free, right?
Caveats: You need to have #SELECT DICT item in each file you want to use this with containing all of the columns you want to return.
Multivalues get a little tricky. I had flattened the data I was using this with so I did not run into that problem, but this does not do UNNESTs.
Also you might want to add a value saying how many records there are total and possibly work out some kind of token passing and list saving to cut down on executing the query each time you run it but that gets much, much deeper than the basic question at hand.
SUBROUTINE SQLSelectWithOffset(TableName,UVWithClause,Starting,Offset)
***********************************************************************
* PROGRAM ID: SQLSelectWithOffset
*
* PROGRAM TITLE: SQLSelectWithOffset
*
* DESCRIPTION: Universe doesn't support sql commands using starting and offset
* which makes life hard when you want all of a file
* but you choke on the size. Tokens allow for the selectlist to be saved
* TableName = UV FIle to select on. If this is blank program will return the number of records remaining
* UVWithClause = Your critera, WITH or BY criteria you want in a sort select.
* Starting = Holds you place in line
* Offest = How many records to return
************************************************************************
$INCLUDE UNIVERSE.INCLUDE ODBC.H
RETURN.LIST = ""
IF Starting = "" or Starting < 1 THEN
Starting = 1
END
GOSUB GET.MASTER.LIST
FOR X=Starting TO Offset
ID = EXTRACT(FULL.LIST,X,0,0)
IF ID = "" THEN CONTINUE
RETURN.LIST<-1> = ID
NEXT X
SELECT RETURN.LIST TO 9
SQLSTMT ="SELECT * FROM ":TableName:" SLIST 9"
ST=SQLExecDirect(#HSTMT, SQLSTMT)
RETURN
GET.MASTER.LIST:
STMT = "SSELECT ":TableName
IF UVWithClause NE "" THEN
STMT := " ":UVWithClause
END
EXECUTE "CLEARSELECT"
EXECUTE STMT
READLIST FULL.LIST ELSE FULL.LIST = ""
RETURN
END
Good luck, please only use this information for good!
OutOfMemoryError caused when db4o databse has 15000+ objects
My question is in reference to my previous question (above). For the same PostedMessage model and same query.
With 100,000 PostedMessage objects, the query takes about 1243 ms to return first 20 PostedMessages.
Now, I have saved 1,000,000 PostedMessage objects in db4o. The same query took 342,132 ms. Which is non-linearly high.
How can I optimize the query speed?
FYR:
The timeSent and timeReceived are Indexed fields.
I am using SNAPSHOT query mode.
I am not using TA/TP.
Do you sort the result? Unfortunatly db4o doesn't use the index for sorting / orderBy. That means it will run a regular sort algorith, with O(n*log(n)). It won't scala liniearly.
Also db4o doesn't support a TOP operator. That means even without sorting it takes quite a bit of time to copy the ids to the results set, even when you never read the entities afterwards.
So, there's no real good solution for this, except trying to use some criteria which cut down the result size.
Some adventerous people might use a different query evaluation, but personally don't recommend that.
#Gamlor No, I am not sorting at all. The code is as follows:
public static ObjectSet<PostedMessage> getMessagesBetweenDates(
Calendar after,
Calendar before,
ObjectContainer db) {
if (after == null || before == null || db == null) {
return null;
}
Query q = db.query(); //db is pre-configured to use SNAPSHOT mode.
q.constrain(PostedMessage.class);
Constraint from = q.descend("timeRecieved").constrain(new Long(after.getTimeInMillis())).greater().equal();
q.descend("timeRecieved").constrain(new Long(before.getTimeInMillis())).smaller().equal().and(from);
ObjectSet<EmailMessage> results = q.execute();
return results;
}
The arguments to this method are as follows:
after = 13-09-2011 10:55:55
before = 13-09-2011 10:56:10
And I expect only 10 PostedMessages to be returned between "after" and "before". (I am generating dummy PostedMessage with timeReceived incremented by 1 sec each.)
I'm using the lsqlite3 lua wrapper and I'm making queries into a database. My DB has ~5million rows and the code I'm using to retrieve rows is akin to:
db = lsqlite3.open('mydb')
local temp = {}
local sql = "SELECT A,B FROM tab where FOO=BAR ORDER BY A DESC LIMIT N"
for row in db:nrows(sql) do temp[row['key']] = row['col1'] end
As you can see I'm trying to get the top N rows sorted in descending order by FOO (I want to get the top rows and then apply the LIMIT not the other way around). I indexed the column A but it doesn't seem to make much of a difference. How can I make this faster?
You need to index the column on which you filter (i.e. with the WHERE clause). THe reason is that ORDER BY comes into play after filtering, not the other way around.
So you probably should create an index on FOO.
Can you post your table schema?
UPDATE
Also you can increase the sqlite cache, e.g.:
PRAGMA cache_size=100000
You can adjust this depending on the memory available and the size of your database.
UPDATE 2
I you want to have a better understanding of how your query is handled by sqlite, you can ask it to provide you with the query plan:
http://www.sqlite.org/eqp.html
UPDATE 3
I did not understand your context properly with my initial answer. If you are to ORDER BY on some large data set, you probably want to use that index, not the previous one, so you can tell sqlite to not use the index on FOO this way:
SELECT a, b FROM foo WHERE +a > 30 ORDER BY b
I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...
You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count
+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.
Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...
According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.
I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT
orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result
We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics
As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}
The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...