Does google app engine java datastore cache JPA query results? - google-app-engine

I am using DN3 and GAE 1.7.4.
I use JPA2 which according to the documentations by default has Level2 cache enabled.
Here is my question:
If I run a query that returns some objects, would these objects be put in the cache automatically by their ID?
If I run em.find() with an id of an object which has already been loaded with another query createQuery().getResultList() would it be available in the cache?
Do I need to run my em.find() or query in a transaction in order for the cache to kick in?
I need some clarification on how this cache works and how I could do my queries/finds/persists in order to make the best use of the cache.
Thanks

From Google App Engine: Using JPA with App Engine
Level2 Caching is enabled by default. To get the previous default
behavior, set the persistence property datanucleus.cache.level2.type
to none. (Alternatively include the datanucleus-cache plugin in the
classpath, and set the persistence property
datanucleus.cache.level2.type to javax.cache to use Memcache for L2
caching.
As for your doubts, this depends on your query as well as DataNucleus and GAE Datastore adapter implementation specifics. As Carol McDonald suggested I believe that the best path to find the answers for your questions is with JPA2 Cache interface... More specifically the contains method.
Run your query, get access to the Cache interface through the EntityManagerFactory and see if the Level 2 cache contains the desired entity.
Enabling DataNucleus logs will also give you good hints about what is happening behind the scenes.

After debugging in development local GAE mode I figured level 2 cache works. No need for transaction begin/commit. The result of my simple query on primary keys as well as em.find() would be put in the cache by their primary keys.
However the default cache timeout in local development server is like a few seconds, I had to add this:
<property name="datanucleus.cache.level2.timeout" value="3600000" />
to persistence.xml.

Related

cloudant database local cache with java api

i am new to the cloudant , i tried searching the internet but could not found the correct reference about the Cache.
can u please provide me some reference based on java .like api,or links which reference on caching.
I'm assuming by cache you mean replicating all or a subset of your Cloudant data to a local datastore. If so, have a look here: https://github.com/cloudant/sync-android
Applications use Cloudant Sync to store, index and query local JSON
data on a device and to synchronise data between many devices.
Synchronisation is under the control of the application, rather than
being controlled by the underlying system. Conflicts are also easy to
manage and resolve, either on the local device or in the remote
database.
...
This library is for Android and Java SE; an iOS version is also
available.
There are examples on the above page of using the library.
Where you see the word 'device' on the above page, you can also think 'java app'.

How to bypass JPA level2 cache for query and find

I'm using JPA2 with level2 cache enabled (jcache/memcache) on GAE. I have to run some update transactions and would like them to rely on datastore data and not the cached one. I tried to set the javax.persistence.cache.retrieveMode property to "BYPASS" when using the JPA find method but it doesn't seem to work at all. So I wonder if cache bypass is possible with datanucleus JPA2 ?
Code sample :
if (bypassCache) {
return find(className, Collections.Collections.<String,Object>singletonMap("javax.persistence.cache.retrieveMode",CacheRetrieveMode.BYPASS));
}
else return find(className);
I'm using datanucleus 3.1.3 and appengine 1.7.7.1.
Thanks !
After reading the source code of datanucleus I understand that the JPA cache bypass is not implemented for now for the find methods. Could anyone confirm ?
In fact it seems that EntityManager.find() methods are always L2 cached whatever the properties you set. I did a JPQL query and the query result isn't L2 cached. (datanucleus.query.results.cached is false by default). So my understanding is I should use Queries to get fine control of L2 cache.

How can I speed up the App Engine bulk downloader?

I'm trying to use the App Engine bulkloader to download entities from the datastore (the high-replication one if it matters). It works, but it's quite slow (85KB/s). Are there some magical set of parameters I can pass it to make it faster? I'm receiving about 5MB/minute or 20,000 records/minute, and given that my connection can do 1MB/second (and hopefully App Engine can serve faster than that) there must be a way to do it faster.
Here's my current command. I've tried high numbers, low numbers, and every permutation:
appcfg.py download_data
--application=xxx
--url=http://xxx.appspot.com/_ah/remote_api
--filename=backup.csv
--rps_limit=30000
--bandwidth_limit=100000000
--batch_size=500
--http_limit=32
--num_threads=30
--config_file=bulkloader.yaml
--kind=foo
I already tried this
App Engine Bulk Loader Performance
and it's no faster than what I already have. The number's he mentions are on par with what I'm seeing as well.
Thanks in advance.
Did you set an index on the key of the entity your trying to download?
I don't know if that helps but check if you get a warning at the beginning of the download that says something about "using sequential download"
Put this on the index.yaml to create an index on the entity key upload and wait for the index to be built.
- kind: YOUR_ENTITY_TYPE
properties:
- name: __key__
direction: desc

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

GAE Datastore backup

Is it necessary to do backups of GAE's Datastore?
Does anyone have any experience, suggestions, tricks for doing so?
Backups are always necessary to protect against human error. Since App Engine encourages you to build mutiple revisions of your code that run against the same dataset, it's important to be able to go back.
A simple dump/restore tool is explained in the Bulkloader documentation.
Something else I've done in the past for major DB refactors is:
Change the entity name in your new code (e.g. User -> Customer or User2 if you have to)
When looking up an entity by key:
Try the key and return if possible
Try the key for the old db.Model class. If you find it, migrate the data, put(), and return the new entity
Use the entity as usual
(You may have to use a task queue to migrate all the data. If you always fetch the entities by key it's not necessary.)
Deploy a new version of your code so that both coexist server-side. When you activate the new version, it is like a point-in-time snapshot of the old entities. In an emergency, you could reactivate the old version and use the old data.
You can now use the managed export and import feature, which can be accessed through gcloud or the Datastore Admin API:
Exporting and Importing Entities
Scheduling an Export

Resources