I'm mostly talking about databases and cache in the context of a Play! app on Heroku:
What does the cache do for a database and how do I use it?
A cache is used to avoid querying a database too much.
Some queries take an especially long time to run. By caching the result (eg. saving it in memory), the expensive query doesn't need to be executed again (for a period of time when the data is still valid - where validity might be a few minutes, or until some data in a certain table changes).
A cache is normally just implemented as a giant hash table, will keys and values. The key is used to look-up a value.
The cache usage is described by http://www.playframework.org/documentation/2.0/ScalaCache. It is very easy to write the code for it. To store something in cache:
Cache.set("item.key", connectedUser)
Here you just pass the key to store the object at, and the object.
To retrieve it:
val user: User = Cache.getOrElseAs[User]("item.key") {
User.findById(connectedUser)
}
Basically, getOrElseAs[class to cast data to here](key here).
Notice the block you can pass to getOrElseAs, so that if it is not found, you can query the database.
Otherwise, you can also use Cache.getAs[User]("item.key") (but you probably want to query anyways if it isn't found).
Related
Can i use elasticsearch to store statistic informations with less overhead?
It should be used for example how often is a function call was madeand how much time has it taken.
Or how many requests has been made to a specific endpoint and also the time it has taken, and so on.
My idea would be to store a key, timestamp and takenTime.
And i can query results in different manner.
simply handled by functions like profile_start and profile_done
void endpointGetUserInformation()
{
profile_start("requests.GetUserInformation");
...
// profile_done stores to the database
profile_done("requests.GetUserInformation");
}
In a normal sql database i would make a table witch holds all keys, and a second table that holds key_ids, timestamp and timeTaken. This storage would need less space on disk.
When i store to elasticsearch it stores a lot of data additionally and also the key is redundant is there a solutuion to store it also simplier.
In one place of code I do something like this:
FormModel(.. some data here..).put()
And a couple lines below I select from the database:
FormModel.all().filter(..).fetch(100)
The problem I noticed - sometimes the fetch doesn't notice the data I just added.
My theory is that this happens because I'm using high replication storage, and I don't give it enough time to replicate the data. But how can I avoid this problem?
Unless the data is in the same entity group there is no way to guarantee that the data will be the most up to data (If I understand this section correctly).
Shay is right: there's no way to know when the datastore will be ready to return the data you just entered.
However, you are guaranteed that the data will be entered eventually, once the call to put completes successfully. That's a lot of information, and you can use it to work around this problem. When you get the data back from fetch, just append/insert the new entities that you know will be in there eventually! In most cases it will be good enough to do this on a per-request basis, I think, but you could do something more powerful that uses memcache to cover all requests (except cases where memcache fails).
The hard part, of course, is figuring out when you should append/insert which entities. It's obnoxious to have to do this workaround, but a relatively low price to pay for something as astonishingly complex as the HRD.
From https://developers.google.com/appengine/docs/java/datastore/transactions#Java_Isolation_and_consistency
This consistent snapshot view also extends to reads after writes
inside transactions. Unlike with most databases, queries and gets
inside a Datastore transaction do not see the results of previous
writes inside that transaction. Specifically, if an entity is modified
or deleted within a transaction, a query or get returns the original
version of the entity as of the beginning of the transaction, or
nothing if the entity did not exist then.
I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:
import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery
def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)
It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function.
When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached.
I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:
Why is it so slow ?
Is there any way to do it in a reasonnable time ?
Thanks :)
First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.
The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:
Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.
In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.
Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.
FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.
Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)
Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.
Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!
I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.
I am writting an application which needs to periodically (each week for example) loop through several million records ina database and execute code on the results of each row.
Since the table is so big, I suspect that when I call SomeObject.FindAll() it is reading all 1.4million rows and trying to return all the rows in a SomeObject[].
Is there a way I can execute a SomeObject.FindAll() expression, but load the values in a more DBMS friendly way?
Not with FindAll() - which, as you've surmised, will try to load all the instances of the specified type at one time (and, depending on how you've got NHibernate set up may issue a stupendous number of SQL queries to do it).
Lazy loading works only on properties of objects, so for example if you had a persisted type SomeObjectContainer which had as a property a list of SomeObject mapped in such a way that it should match all SomeObjects and with lazy="true", then did a foreach on that list property, you'd get what you want, sort-of; by default, NHibernate would issue a query for each element in the list, loading only one at a time. Of course, the read cache would grow ginormous, so you'd probably need to flush a lot.
What you can do is issue an HQL (or even embedded SQL) query to retrieve all the IDs for all SomeObjects and then loop through the IDs one at a time fetching the relevant object with FindByPrimaryKey. Again, it's not particularly elegant.
To be honest, in a situation like that I'd probably turn this into a scheduled maintenance job in a stored proc - unless you really have to run code on the object rather than manipulate the data somehow. It might annoy object purists, but sometimes a stored proc is the right way to go, especially in this kind of batch job scenario.