How to solve the aggregate query with condition in Google App Engine - google-app-engine

Suppose, In my app, I ask users to input some string. A user can input string multiple times. Whenever any user inputs a string, I log it in the database along with the day. Many strings can be same, even though inputted by different users. In the home page, I need to give the interface such that any user can query for top n (say 50) strings in any time period (say between last 45 days, or 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written query like:
select string, count(*)
from userStrings where day >= d1 and day <= d2
group by string
order by count(*) desc
limit n
For each user query, I can't process the record at query time - there can be millions of records. If the time period constraint was not there, I could have done something like this - Create a class for UserString and maintain unique object of it for each distinct user string, retrieve corresponding object for user inputted string, and increment it's count [Even with approach, I assume the datastore will have to process all UserStrings objects (~ 100000) and return me the top n - so it can itself be very heavy query].
I am using JDO. My obvious goal is to minimize the app engine cost : CPU + data.
Thanks,

You can use the App Engine Task Queue to process the strings offline. If you need real-time answers you could use memcache to store a temporary record of how many times that word has been entered during the day and background the processing.

Related

Best way to handle time consuming queries in InfluxDB

We have an API that queries an Influx database and a report functionality was implemented so the user can query data using a start and end date.
The problem is that when a longer period is chosen(usually more than 8 weeks), we get a timeout from influx, query takes around 13 seconds to run. When the query returns a dataset successfully, we store that in cache.
The most time-consuming part of the query is probably comparison and averages we do, something like this:
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
FROM $MEASUREMENT
WHERE time >= $startDate AND time < $endDate
AND ("field" = 'myFieldValue' )
GROUP BY "tagname"
What would be the best approach to fix this? I can of course limit the amount of weeks the user can choose, but I guess that's not the ideal fix.
How would you approach this? Increase timeout? Batch query? Any database optimization to be able to run this faster?
In such cases where you allow user to select in days, I would suggest to have another table that stores the result (min, max and avg) of each day as a document. This table can be populated using some job after end of the day.
You can also think changing the document per day to per week or per month, based on how you plot the values. You can also add more fields like in your case, tagname and other fields.
Reason why this is superior to using a cache: When you use a cache, you can store the result of the query, so you have to compute for every different combination in realtime. However, in this case, the cumulative results are already available with much smaller dataset to compute.
Based on your query, I assume you are using InfluxDB v1.X. You could try Continuous Queries which are InfluxQL queries that run automatically and periodically on realtime data and store query results in a specified measurement.
In your case, for each report, you could generate a CQ and let your users to query it.
e.g.:
Step 1: create a CQ
CREATE CONTINUOUS QUERY "cq_basic_rp" ON "db"
BEGIN
SELECT mean("value") AS "mean", min("value") AS "min", max("value") AS "max"
INTO "mean_min_max"
FROM $MEASUREMENT
WHERE "field" = 'myFieldValue' // note that the time filter is not here
GROUP BY time(1h), "tagname" // here you can define the job interval
END
Step 2: Query against that CQ
SELECT * FROM "mean_min_max"
WHERE time >= $startDate AND time < $endDate // here you can pass the user's time filter
Since you already ask InfluxDB to run these aggregates continuously based on the specified interval, you should be able to trade space for time.

Flink CEP sql restrict output

I have a use case where I have 2 input topics in kafka.
Topic schema:
eventName, ingestion_time(will be used as watermark), orderType, orderCountry
Data for first topic:
{"eventName": "orderCreated", "userId":123, "ingestionTime": "1665042169543", "orderType":"ecommerce","orderCountry": "UK"}
Data for second topic:
{"eventName": "orderSucess", "userId":123, "ingestionTime": "1665042189543", "orderType":"ecommerce","orderCountry": "USA"}
I want to get all the userid for orderType,orderCountry where user does first event but not the second one in a window of 5 minutes for a maximum of 2 events per user for a orderType and orderCountry (i.e. upto 10 mins only).
I have union both topics data and created a view on top of it and trying to use flink cep sql to get my output, but somehow not able to figure it out.
SELECT *
FROM union_event_table
MATCH_RECOGNIZE(
PARTITION BY orderType,orderCountry
ORDER BY ingestion_time
MEASURES
A.userId as userId
A.orderType as orderType
A.orderCountry AS orderCountry
ONE ROW PER MATCH
PATTERN (A not followed B) WITHIN INTERVAL '5' MINUTES
DEFINE
A As A.eventName = 'orderCreated'
B AS B.eventName = 'orderSucess'
)
First thing is not able to figure it out what to use in place of A not followed B in sql, another thing is how can I restrict the output for a userid to maximum of 2 events per orderType and orderCountry, i.e. if a user doesn't perform 2nd event after 1st event in 2 consecutive windows for 5 minutes, the state of that user should be removed, so that I will not get output of that user for same orderType and orderCountry again.
I don't believe this is possible using MATCH_RECOGNIZE. This could, however, be implemented with the DataStream CEP library by using its capability to send timed out patterns to a side output.
This could also be solved at a lower level by using a KeyedProcessFunction. The long ride alerts exercise from the Apache Flink Training repo is an example of that -- you can jump straight away to the solution if you want.

SQL Server full text search max performance limits within time window

Let's say we have a table with 100 million records. Which are transactions of sales form multiple sellers. Each record has around 14 columns
TABLE SellerTransactions
string SellerId,
string ProductId,
DateTime CreateDate,
string BankNumber,
string Name(name+' '+surname+' 'alias),
string Comments,
decimal Amount
etc...
Each year we will add around 60 million new records.. and the record count will increase by 10% yearly +-
Search will be done by seller Id, then by product Id or products Id's(for multiple products in a time period for that seller).
Now every search will be filtered by time usually 1 week period/ mostly the last week but its also possible to have worst case scenarios when we will search within all the data we have: years.
And also a search should be possible by bank number, or full text search by name or part of it.
SELECT *
FROM SellerTransactions
WHERE SelledId = 'Seller1Guid'
AND ProductId IN 'ProductGuid1,ProductGuid2..'
AND (CreateDate <= CurrentDate AND CreateDate >= (CurrendDate - 7 days))
AND CONTAINS(Name, "BoB Skynet")
ORDER BY CreateDateTime
TAKE 20 SKIP 20
So search time consumption in regards to these scenarios:
After filtering by id's & time range search in up to 10k records
After filtering by id's & time range search in up to 100k records
After filtering by id's & time range search in up to 1 million records
After filtering by id's & time range search in up to 1 to 10 million records
in these cases would it be 0.5s? or 1s or up to 20s?
Also how would the search performance would change if we would also add search in comments column as well as name?
CONTAINS(Name, "BoB Skynet") OR CONTAINS(Comments, "online")
In most cases the searching should be done on small records counts, in very rare cases we would go though millions of rows.. but how much time would it take?
When it would be an good idea to move to Elastic search for example?
Well we are storing a bit of data, but request count for this data is usually small.

strange appengine query result

What am I doing wrong in this query?
SELECT * FROM TreatmentPlanDetails
WHERE
accountId = 'ag5zfmRvbW9kZW50d2ViMnIRCxIIQWNjb3VudHMYtcjdAQw' AND
status = 'done' AND
category = 'chirurgia orale' AND
setDoneCalendarEventStartTimestamp >= [timestamp for 6 june 2012] AND
setDoneCalendarEventStartTimestamp <= [timestamp for 11 june 2012] AND
deleteStatus = 'notDeleted'
ORDER BY setDoneCalendarEventStartTimestamp ASC
I am not getting any record and I am sure there are records meeting the where clause conditions. To get the correct records I have to widen the timestamp interval by 1 millisecond. Is it normal? Furthermore, if I modify this query by removing the category filter, I am getting the correct results. This is definitely weird.
I also asked on google groups, but I got no answer. Anyway, for details:
https://groups.google.com/forum/?fromgroups#!searchin/google-appengine/query/google-appengine/ixPIvmhCS3g/d4OP91yTkrEJ
Let's talk specifically about creating timestamps to go into the query. What code are you using to create the timestamp record? Apparently that's important, because fuzzing with it a little bit affects the query. It may be relevant that in the datastore, timestamps are recorded as integers representing posix timestamps with microseconds, i.e. the number of microseconds since 1/1/1970 UTC (not counting leap seconds). It's also relevant that dates (i.e. without a time) are represented as midnight, i.e. the earliest time on that day. But please show us the exact code. (It may also be important to show the actual content of the record that you're attempting to retrieve.)
An aside that is not specific to your question: Entity property names count as part of your storage quota. If this is going to be a huge dataset, you might pay more $$ than you'd like for property names like setDoneCalendarEventStartTimestamp.
Because you write :
if I modify this query by removing the category filter, I am getting
the correct results
this probably means that the category was not indexed at the time you write the matching records to the data store. You have to re-write your records to the data store if you want them added to the newly created index.

Saving duration info to SQL Server

I have a WPF app, where one of the fields has a numeric input box for length of a phone call, called ActivityDuration.
Previously this has been saved as an Integer value that respresents minutes. However, the client now wishes to record meetings using the same table, but meetings can last for 4-5 hours so entering 240 minutes doesn't seem very user friendly.
I'm currently considering my options, whether to change ActivityDuration to a time value in SQL 2008 and try to use a time mask input box, or keep it as an integer and present the client with 2 numeric input boxes, one for hours and one for minutes and then do the calculation to save it in SQL Server 2008 as integer minutes.
I'm open to comments and suggestions. One further consideration is that I will need to be able to calculate total time based upon the ActivityDuration so the field DataType should allow it to be summed easy.
The new time datatype only supports 24 hours, so if you need more you'll have to use datetime.
So if sum 7 x 4 hour meetings, you'll get "4 hours" back
How the DB stores it is also different to how you present and capture the data.
Why not hh:nn type display and convert in the client and store as datetime?
Track the start and end time, no need to mask out the date, since the duration will just be a calculation off of the two dates. You can even do this in "sessions" such that one meeting can have multiple sessions (i.e. one meeting that spans across lunch, that shouldn't be counted toward the duration...).
The data type, then is either datetime or smalldatetime.
Then to get the "total duration" it's just a query using
Select sum(datediff(mm, startdate, enddate)) from table where meetingID = 1

Resources