I have got into strenge situation. I want to know count of Fetch based on daily, weekly, monthly and All time. In the Datastore, the count is about 2,368,348. Whenever I try to get the count either by Model or GqlQuery I get a 500 error. When rows are less, the code below is working fine.
Can any guru correct me or tell me right solution, please? I am using Python.
The Model:
class Fetch(db.Model):
adid = db.IntegerProperty()
ip = db.StringProperty()
date = db.DateProperty(auto_now_add=True)
Stats Codes:
adid = cgi.escape(self.request.get('adid'))
...
query = "SELECT __key__ FROM Fetch WHERE adid = " + adid + " AND date >= :1"
rows = db.GqlQuery( query, monthlyDate)
fetch_count = 0
for row in rows:
fetch_count = fetch_count + 1
self.response.out.write( fetch_count)
It looks like your query is taking longer than GAE allows a query to run (typically ~60 seconds). From the count() documentation:
Unless the result count is expected to be small, it is best to specify a limit argument; otherwise the method will continue until it finishes counting or times out.
From the Request Timer documentation:
A request handler has a limited amount of time to generate and return a response to a request, typically around 60 seconds. Once the deadline has been reached, the request handler is interrupted.
If a DeadlineExceededError is being raised, this is your problem. If you need to run this query consider using Backends in GAE. With Backends there is no time limit for generating and returning a request.
Related
We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.
EF Core version: 3.1.
Here's my method:
public static ILookup<string, int> GetClientCountLookup(DepotContext context, DateRange dateRange)
=> context
.Flows
.Where(e => e.TimeCreated >= dateRange.Start.Date && e.TimeCreated <= dateRange.End.Date)
.GroupBy(e => e.Customer)
.Select(g => new { g.Key, Count = g.Count() })
.ToLookup(k => k.Key, e => e.Count);
All fields used are indexed.
Here's generated query:
SELECT [f].[Customer] AS [Key], COUNT(*) AS [Count]
FROM [Flows] AS [f]
WHERE ([f].[TimeCreated] >= #__dateRange_Start_Date_0) AND ([f].[TimeCreated] <= #__dateRange_End_Date_1)
GROUP BY [f].[Customer]
When that query is executed as SQL, the execution time is 100ms.
When that query is used in code with ToLookup method - execution time is 3200ms.
What's even more weird - the execution time in EF Core seems totally independent from the data sample size (let's say, depending on date range we can count hundreds, or hundreds of thousands records).
WHAT THE HECK IS HAPPENING HERE?
The query I pasted is the real query EF Core sends.
The code fragment I pasted first is executed in 3200ms.
Then I took exact generated SQL and executed in as SQL query in Visual Studio - took 100ms.
It doesn't make any sense to me. I use EF Core for a long time and it seem to perform reasonably.
Most queries (plain, simple, without date ranges) are fast, results are fetched immediately (in less than 200ms).
In my application I built a really HUGE query with like 4 multi-column joins and subqueries... Guess what - it fetches 400 rows in 3200ms. It also fetches 4000 rows in 3200ms. And also when I remove most of the joins, includes, even remove the subquery - 3200ms. Or 4000, depending on my Internet or server momentary state and load.
It's like constant lag and I pinpointed it to the exact first query I pasted.
I know ToLookup method causes to finally fetch all input expression results, but in my case (real world data) - there are exactly 5 rows.
The results looks like this:
|------------|-------|
| Key | Count |
|------------|-------|
| Customer 1 | 500 |
| Customer 2 | 50 |
| Customer 3 | 10 |
| Customer 4 | 5 |
| Customer 5 | 1 |
Fetching 5 rows from database takes 4 seconds?! It's ridiculous. If the whole table was fetched, then rows grouped and counted - that would add up. But the generated query returns literally 5 rows.
What is happening here and what am I missing?
Please, DO NOT ASK ME TO PROVIDE THE FULL CODE. It is confidential, part of a project for my client, I am not allowed to disclose my client's trade secrets. Not here nor in any other question. I know it's hard to understand what happens when you don't have my database and the whole application, but the question here is pure theoretical. Either you know what's going on, or you don't. As simple as that. The question is very hard though.
I can only tell the RDBMS used is MS SQL Express running on Ubuntu server, remotely. The times measured are the times of executing either code tests (NUnit) or queries against the remote DB, all performed on my AMD Ryzen 7 8 core 3.40GHz processor. The server lives on Azure, like 2 core of I5 2.4GHz or something like that.
Here's test:
[Test]
public void Clients() {
var dateRange = new DateRange {
Start = new DateTime(2020, 04, 06),
End = new DateTime(2020, 04, 11)
};
var q1 = DataContext.Flows;
var q2 = DataContext.Flows
.Where(e => e.TimeCreated >= dateRange.Start.Date && e.TimeCreated <= dateRange.End.Date)
.GroupBy(e => e.Customer)
.Select(g => new { g.Key, Count = g.Count() });
var q3 = DataContext.Flows;
var t0 = DateTime.Now;
var x = q1.Any();
var t1 = DateTime.Now - t0;
t0 = DateTime.Now;
var l = q2.ToLookup(g => g.Key, g => g.Count);
var t2 = DateTime.Now - t0;
t0 = DateTime.Now;
var y = q3.Any();
var t3 = DateTime.Now - t0;
TestContext.Out.WriteLine($"t1 = {t1}");
TestContext.Out.WriteLine($"t2 = {t2}");
TestContext.Out.WriteLine($"t3 = {t3}");
}
Here's the test result:
t1 = 00:00:00.6217045 // the time of dummy query
t2 = 00:00:00.1471722 // the time of grouping query
t3 = 00:00:00.0382940 // the time of another dummy query
Yep: 147ms is my grouping that took 3200ms previously.
What happened? A dummy query was executed before.
That explains why the results hardly depended on data sample size!
The huge unexplainable time is INITIALIZATION, not the actual query time. I mean, if not the dummy query before, the whole time would elapse on ToLookup line of code! The line would initialize the DbContext, create connection to database and then perform the actual query and fetch the data.
So as the final answer I can say my test methodology was wrong. I measured the time of the first query to my DbContext. This is wrong, the database should be initialized before the times are measured. I can do it by performing any query before the measured queries.
Well, another question appears - why is THE FIRST query so slow, why is initialization so slow. If my Blazor app would use DbContext as Transient (instantiated each time injected) - would it take so much time each time? I don't think so, because it's how my application worked before major redesign. It didn't have noticeable lags (I would notice 3 seconds lag when changing between pages). But I'm not sure. Now my application uses a scoped DbContext, so it's one for user session. So I won't see the initialization overhead at all, so - the method of measuring time after a dummy query seems to be accurate.
I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github and Google's examples are either oversimplified (memcached frontpage counter with cron) or overkill (accurate sharded counter). But most importantly, no appengine-counter solution discussed elsewhere includes a time component (hourly, daily counts) needed for statistics.
Requirements: The system does not need to be 100% accurate and could just ignore memcache loss (if infrequent). This should simplify things considerably. The idea is to just use memcache and accumulate stats in time intervals.
UseCase: Users on your system create content (e.g. pages). You want to track approx. How often a user's pages are viewed per hour or day. Some pages are viewed often, some never. You want to query by user and timeframe. Subpages may have fixed IDs (Query for user with most hits on homepage). You may want to delete old entries (Query for entries of year=xxxx).
class StatisticsDB(ndb.Model):
# key.id() = something like YYYY-MM-DD-HH_groupId_countableID ... contains date
# timeframeId = ndb.StringProperty() YYYY-MM-DD-HH needed for cleanup if counter uses ancestors
countableId = ndb.StringProperty(required=True) # name of counter within group
groupId = ndb.StringProperty() # counter group (allows single DB query with timeframe prefix inequality)
count = ndb.Integerproperty() # count per specified timeframe
#classmethod
def increment(class, groupID, countableID):
# increment memcache
# save hourly to DB (see below)
Note: groupId and countableId indexes are necessary to avoid 2 inequalities in queries. (query all countables of a groupId/userId and chart/highcount-query: countableId with highest counts derives groupId/user), using ancestors in the DB may not support chart queries.
The problem is how to best save the memcached counter to DB:
cron: This approach is mentioned in example docs (example front-page counter), but uses fixed counter ids that are hardcoded in the cron-handler. As there is no prefix-query for existing memcache keys, determining which counter-ids were created in memcache during the last time interval and need to be saved is probably the bottleneck.
task-queue: if a counter is created schedule a task to collect it and write it to DB. COST: 1 task-queue entry per used counter and one ndb.put per time granularity (e.g. 1 hour) when the queue-handler saves the data. Seems the most promising approach to also capture infrequent events accurately.
infrequently when increment(id) executes: If a new timeframe starts, save the previous. This needs at least 2 memcache accesses (get date, incr counter) per increment. One for tracking the timeframe and one for the counter. Disadvantage: bursty counters with longer stale periods may loose the cache.
infrequently when increment(id) executes: probabilistic: if random % 100 == 0 then save to DB, but the counter should have uniformly distributed counting events
infrequently when increment(id) executes: if the counter reaches e.g. 100 then save to DB
Did anyone solve this problem, what would be a good way to design this?
What are the weaknesses and strengths of each approach?
Are there alternate approaches that are missing here?
Assumptions: Counting can be slightly inaccurate (cache loss), the counterID space is large, counterIDs are incremented sparesly (some once per day, some often per day)
Update: 1) I think cron can be used similar to the task queue. One only has to create the DB model of the counter with memcached=True and run a query in cron for all counters marked that way. COST: 1 put at 1st increment, query at cron, 1 put to update counter. Without thinking it through fully this appears slightly more costly/complex than the task approach.
Discussed elsewhere:
High concurrency non-sharded counters - no count per timeframe
Open Source GAE Fast Counters - no count per timeframe, nice performance comparison to sharded solution, expected losses due to memcache loss reported
Yep, your #2 idea seems to best address your requirements.
To implement it you need a task execution with a specified delay.
I used the deferred library for such purpose, using the deferred.defer()'s countdown argument. I learned in the meantime that the standard queue library has similar support, by specifying the countdown argument for a Task constructor (I have yet to use this approach, tho).
So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:
get the memcache counter value using the key from the task payload
add the value to the corresponding db counter
delete the memcache counter when the db update is successful
You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:
The delayed task:
reads the memcache counter value
enqueues another (transactional) task (with no delay) for adding the value to the db counter
deletes the memcache counter
The non-delayed task is now idempotent and can be safely re-tried until successful.
The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.
Update:
The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional countdown or eta arguments to taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
Counting things in a distributed system is a hard problem. There's some good info on the problem from the early days of App Engine. I'd start with Sharding Counter, which, despites being written in 2008, is still relevant.
Here is the code for the implementation of the task-queue approach with hourly timeframe. Interestingly it works without transactions and other mutex magic. (For readability the python indent of methods is wrong.)
Supporting priorities for memcache would increase accuracy of this solution.
TASK_URL = '/h/statistics/collect/' # Example: '/h/statistics/collect/{counter-id}"?groupId=" + groupId + "&countableId=" + countableId'
MEMCACHE_PREFIX = "StatisticsDB_"
class StatisticsDB(ndb.Model):
"""
Memcached counting saved each hour to DB.
"""
# key.id() = 2016-01-31-17_groupId_countableId
countableId = ndb.StringProperty(required=True) # unique name of counter within group
groupId = ndb.StringProperty() # couter group (allows single DB query for group of counters)
count = ndb.IntegerProperty(default=0) # count per timeframe
#classmethod
def increment(cls, groupId, countableId): # throws InvalidTaskNameError
"""
Increment a counter. countableId is the unique id of the countable
throws InvalidTaskNameError if ids do not match: [a-zA-Z0-9-_]{1,500}
"""
# Calculate memcache key and db_key at this time
# the counting timeframe is 1h, determined by %H, MUST MATCH ETA calculation in _add_task()
counter_key = datetime.datetime.utcnow().strftime("%Y-%m-%d-%H") + "_" + groupId +"_"+ countableId;
client = memcache.Client()
n = client.incr(MEMCACHE_PREFIX + counter_key)
if n is None:
cls._add_task(counter_key, groupId, countableId)
client.incr(MEMCACHE_PREFIX + counter_key, initial_value=0)
#classmethod
def _add_task(cls, counter_key, groupId, countableId):
taskurl = TASK_URL + counter_key + "?groupId=" + groupId + "&countableId=" + countableId
now = datetime.datetime.now()
# the counting timeframe is 1h, determined by counter_key, MUST MATCH ETA calculation
eta = now + datetime.timedelta(minutes = (61-now.minute)) # at most 1h later, randomized over 1 minute, throttled by queue parameters
task = taskqueue.Task(url=taskurl, method='GET', name=MEMCACHE_PREFIX + counter_key, eta=eta)
queue = taskqueue.Queue(name='StatisticsDB')
try:
queue.add(task)
except taskqueue.TaskAlreadyExistsError: # may also occur if 2 increments are done simultaneously
logging.warning("StatisticsDB TaskAlreadyExistsError lost memcache for %s", counter_key)
except taskqueue.TombstonedTaskError: # task name is locked for ...
logging.warning("StatisticsDB TombstonedTaskError some bad guy ran this task premature manually %s", counter_key)
#classmethod
def save2db_task_handler(cls, counter_key, countableId, groupId):
"""
Save counter from memcache to DB. Idempotent method.
At the time this executes no more increments to this counter occur.
"""
dbkey = ndb.Key(StatisticsDB, counter_key)
n = memcache.get(MEMCACHE_PREFIX + counter_key)
if n is None:
logging.warning("StatisticsDB lost count for %s", counter_key)
return
stats = StatisticsDB(key=dbkey, count=n, countableId=countableId, groupId=groupId)
stats.put()
memcache.delete(MEMCACHE_PREFIX + counter_key) # delete if put succeeded
logging.info("StatisticsDB saved %s n = %i", counter_key, n)
Here is the situation
I have a model like
class Content(ndb.Model):
likeCount=ndb.IntegerProperty(default=0)
likeUser = ndb.KeyProperty(kind=User, repeated=True)
When a new content generate than a new "Content" object is generated like
content_obj_key = Content(parent=objContentData.key, #Where ContentData is another ndb.Model subclass
likeUser=[],
likeCount=0
).put()
And when any user like the same content than below function gets called
def modify_like(contentData_key, user_key):
like_obj = Content.query(parent=contetData_key).get()
if like_obj:
like_obj.likeUser.append(user_key)
like_obj.likeCount += 1
like_obj.put()
Problem:
Now the problem is that when at the same time more than 4 user like the same content than this object write wrong data.
I mean lets say userA, userB, userC and userD like this data at the same time and currently only userE liked the same content.
So after all new four write the "likeCount" is not 5, always less than 5 and the "likeUser" list length is also less than 5.
So how can I solve this problem?
So that always data remains consistent.
It may be that some of the updates are stepping on each other, since several users may be incrementing the same count value at the same time.
If userA and userB get the Content object at the same time - both having the same count value (likeCount=1). Then both increment to value of 2 - when it should be total of 3.
One possible solution is to use sharding. This is useful when entities in your application may have a lot of writes. The count is the total of all shards for that entity. Example code from the documentation:
NUM_SHARDS = 5
class SimpleCounterShard(ndb.Model):
"""Shards for the counter"""
count = ndb.IntegerProperty(default=0)
def get_count():
"""Retrieve the value for a given sharded counter.
Returns:
Integer; the cumulative count of all sharded counters.
"""
total = 0
for counter in SimpleCounterShard.query():
total += counter.count
return total
#ndb.transactional
def increment():
"""Increment the value for a given sharded counter."""
shard_string_index = str(random.randint(0, NUM_SHARDS - 1))
counter = SimpleCounterShard.get_by_id(shard_string_index)
if counter is None:
counter = SimpleCounterShard(id=shard_string_index)
counter.count += 1
counter.put()
More info and examples on sharding counters can be found at:
https://cloud.google.com/appengine/articles/sharding_counters
Use transactions:
#ndb.transactional(retries=10)
def modify_like(contentData_key, user_key):
...
A transaction is an operation or set of operations that are guaranteed
to be atomic, which means that transactions are never partially
applied. Either all of the operations in the transaction are applied,
or none of them are applied. Transactions have a maximum duration of
60 seconds with a 10 second idle expiration time after 30 seconds.
Imagine I add a new user in the datastore. I have to add 200 rows for him (they just contain zeros). But it might take 40 seconds. The real user that has registered for my website has to wait this time before he proceeds. In MySQL it takes fractions of a second. What do you suggest?
Consider this code. It takes 10 seconds on the Google servers, which is still too slow.
def get(self):
class Movie(ndb.Model):
title = ndb.StringProperty (required=True)
rating = ndb.IntegerProperty (required=True)
#classmethod
def populate(cls, n):
for i in range(n):
o = cls(title='foo', rating=5)
o.put()
t1 = datetime.datetime.now()
Movie.populate(200)
t2 = datetime.datetime.now()
self.response.write(t2 - t1) # ~10 seconds
As noted in the comment - instead of saving entities one-by-one, create a list of entities and save them with multi-put.
I would suggest using a more sensible data model, frankly. There's no reason at all to create a model with 200 fields. Not only will the initial setup take ages, but loading each instance will be expensive, and saving will be exceedingly expensive.
In any case, you almost certainly don't need to instantiate all the fields from the start.
(Also, I must say that even with 200 fields, taking 40 seconds to save seems extremely unlikely. You are probably doing something strange, but without seeing any code it's impossible to tell.)