My GAE app will request weekly data from Google Analytics like
number of visitors during last week
number of visitors of particular page during last week
etc.
Then I would like to show this data on my GAE web-page with Google Charts. The data will be shown for last X weeks (let's say, 10 weeks).
What is the best approach to store this data (number of metrics multiplied by number of weeks)? Old data could be deleted.
I don't think I should use datastore like:
class Visitors(ndb.Model):
week1 = ndb.IntegerProperty(default=0) # should store week start and end dates also
week2 = ndb.IntegerProperty(default=0)
...
Probably, it would be better to store data like:
class Analytics(ndb.Model):
visitors = ndb.StringProperty(default=0) # comma separated values like '1000,1001,1002'; last value is previous week
page_visitors = ndb.IntegerProperty(repeated=True,default=0) # [1000,1001,1002]
...
What are you trying to optimize?
With this amount of data, you will pay pennies, or less, for data storage. You are well within the free quota on datastore reads and writes. Performance-wise, the difference is negligible.
I would recommend going with the most straightforward solution: each week is a new entity, each data point is in its own property.
Related
I have a sheet "RawCount" with Google Form results that will accumulate over time (people will make entries each week or as their raw number changes).
I need a way to compile the data to obtain the most recent entry for each individual who has entered data via the form:
This data will accumulate with new entries over the period of eight months from up to 100 or more different people.
I would like to sum the most recent entries for each individual onto another tab in the same Google Sheet that contains a scorecard.
Thanks for any help you can offer. I think I've sprained my brain on this.
I am looking for some help as to the best way to structure data in app engine ndb using python, process it and query it later. I want to store temperature data at hourly intervals for different geographical regions.
I can think of two entity options but there maybe something much better. The first would be to store the hourly temperature in individual properties:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
00:00 = ndb.FloatProperty()
01:00 = ndb.FloatProperty()
...
23:00 = ndb.FloatProperty()
Or I could store the data
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
time = ndb.TimeProperty()
temp = ndb.FloatProperty()
(it might be better to store date and time as one property?)
I want to be able to query the datastore to calculate the Total, Max, Min, and average temperature for any given date range. In the first option I could potentially create 4 more properties to effectively pre-process and store the Total, Max etc for each day so if I wanted to query the total temperature for a year I would only have to sum 365 values as opposed to 8760? I'm not sure how I would do this in the second option?
I am relatively new to app engine and datastore and I think I am still thinking in terms of relationship db's so any help would really be appreciated. Later on it might be necessary to store data in different time zones.
Thanks
Paul
Personally, I'd go with a variant of the first approach:
class TempData(ndb.Model):
region = ndb.StringProperty()
date = ndb.DateProperty()
temp = ndb.FloatProperty(repeated=True)
using the temp list to store temperatures by hour in order as you learn about them. I don't think the preprocessing per-date will add anything much: to compute whatever for a year, you'd still need to fetch 365 entities, and the delay for that will swamp the tiny amount of time required to sum up a few thousand numbers anyway.
In general, preprocessing is useful if you want to handily query by the new fields you create by such processing (e.g rapidly answer the question "which dates in locale X had average temperatures greater than 20 Celsius"). That does not seem to be your use case.
If anything, if it's common for you to have to compute many-month values, preprocessing to aggregate things per-month (into simpler TempDataMonth entities) may be more useful. Or, any other several-days period you find useful, of course (weeks, ten-day-groups, whatever). Those could be computed in a background task periodically checking which such periods have become complete since the last check. But, this is a bit beyond your question, so I'm not getting into fine-grained details.
The general idea is that minimizing the number of entities to fetch tends to be the single most important optimization; other optimizations are of course also possible, but, they tend to play second fiddle to that:-).
Suppose, In my app, I ask users to input some string. A user can input string multiple times. Whenever any user inputs a string, I log it in the database along with the day. Many strings can be same, even though inputted by different users. In the home page, I need to give the interface such that any user can query for top n (say 50) strings in any time period (say between last 45 days, or 10 Jan 2012 to 30 Jan 2012). If it was SQL, I could have written query like:
select string, count(*)
from userStrings where day >= d1 and day <= d2
group by string
order by count(*) desc
limit n
For each user query, I can't process the record at query time - there can be millions of records. If the time period constraint was not there, I could have done something like this - Create a class for UserString and maintain unique object of it for each distinct user string, retrieve corresponding object for user inputted string, and increment it's count [Even with approach, I assume the datastore will have to process all UserStrings objects (~ 100000) and return me the top n - so it can itself be very heavy query].
I am using JDO. My obvious goal is to minimize the app engine cost : CPU + data.
Thanks,
You can use the App Engine Task Queue to process the strings offline. If you need real-time answers you could use memcache to store a temporary record of how many times that word has been entered during the day and background the processing.
I have a system where people can pick some stocks and it values their portfolios but I'm having trouble doing this in a efficient way on a daily basis because I'm creating entries for days that don't have any changes(think of it like I'm measuring the values and having version control so I can track changes to the way the portfolio is designed).
Here's a example(each day's portfolio with stock name and weight):
Day1:
ibm = 10%
microsoft = 50%
google = 40%
day5:
ibm = 20%
microsoft = 20%
google = 40%
cisco = 20%
I can measure the value of the portfolio on day1 and understand I need to measure it again on day5(when it changed) but how do I measure day2-4 without recreating day1's entry in the database?
My approach right now(which I don't like) is to create a temp entry in my database for when someone changes the portfolio and then at the end of the day when I calculate the values if there is a temp entry I use that otherwise I create a new entry(for day2-4) using the last days data. The issue is as data often doesn't change I'm creating entries that are basically duplicates. The catch is: my stock data is all daily. I also thought of taking the portfolio and if it hasn't been updated in 3 days to find the returns of the last 3 days for each stock but I wasn't sure if there was a better solution.
Any ideas? I think this is a straight forward problem but I just can't see a efficient way of doing it.
note: in finance terms, its called creating a NAV and most firms do it the inefficient way I'm doing it but its because the process was created like 50 years ago and hasn't changed. I think this problem is very similar to version control but I can't seem to make a solution.
In storage terms is makes most sense to just store:
UserId - StockId1 - 23% - 2012-06-25
UserId - StockId2 - 11% - 2012-06-26
UserId - StockId1 - 20% - 2012-06-30
So you see that stock 1 went down at 30th. Now if you want to know the StockId1 percentage at the 28th you just select:
SELECT *
FROM stocks
WHERE datecolumn<=DATE(2012-06-28)
ORDER BY datecolumn DESC LIMIT 0,1
If it gives nothing back you did not have it, otherwise you get the last position back.
BTW. if you need for example a graph of stock 1 you could left join against a table full of dates. Then you can fill in the gaps easily.
Found this post here for example:
UPDATE mytable
SET number = (#n := COALESCE(number, #n))
ORDER BY date;
SQL QUERY replace NULL value in a row with a value from the previous known value
I have some reservation with start and end (reference to resource is remove to make example more clear):
class Reservation(db.Model):
fromHour = db.DateTimeProperty()
toHour = db.DateTimeProperty()
fromToRange = db.ComputedProperty(lambda x: [x.fromHour, x.toHour])
And want to add another reservation with check if it not overlaps previous one - how to express such query in Google App Engine.
1st I try this query with list property but double inequality filters not works. It should do two matches from1 < to2 and from1 >= from2 and one results - whatever it could be costly if there will be more data.
fromHour = datetime.datetime(2012, 04, 18, 0, 0, 0)
toHour = datetime.datetime(2012, 04, 18, 2, 0, 0)
reservation = Reservation(fromHour = fromHour, toHour = toHour, colors = ['white', 'black'])
reservation.put()
self.response.out.write('<p>Both %s</p>' % fromHour)
self.response.out.write('<ol>')
for reservation in Reservation.all()\
.filter('fromToRange >', fromHour)\
.filter('fromToRange <=', fromHour):
self.response.out.write('<li>%s</li>' % (reservation.fromToRange))
self.response.out.write('</ol>')
I found another solution that I could use additional property containing days (this will be list of days in range per reservation) than I could hit days need to be checked to narrow scan of data and check every record if not overlapping new reservation.
Please help and provide some answer how to do optimal query to detect overlapping reservations - maybe there is quick 3rd solution for time ranges queries in Google App Engine or is not supported.
Yes, multiple inequality filters limitation on queries makes some things very hard/suboptimal to do. For instance geo searching.
I'd go with a solution that you proposed: quantizing the property value and saving all quantized values in the list property, e.g. saving all days that duration spans inclusively into a list property. The tricky part is to choose the right quantizing level: days, hours, etc..
Personally I'd start with universal time scale: Unix time (epoch), then round it to second and then decimally quantize it. For example cut three zeros from it (1000 second quantizer, spans ~ 16 mins) and save all quantized values from start time to end time in a list.
If the quantizer is too fine, then use bigger one: 10000 sec.
Then you can simply query on the list property with an equality filter and additionally filter results in memory to account for exact duration start and end time.