Is it possible to manually define the logic of the serialization used for AppEngine Datastore?
I am assuming Google is using reflection to do this in a generic way. This works but proves to be quite slow. I'd be willing to write (and maintain) quite some code to speed up the serialization / deserialization of datastore objects (I have large objects and this consumes quite some percentage of the time).
The datastore uses Protocol-Buffers internally, and there is no way round, as its the only way your application can communicate with the datastore.
(The implementation can be found in the SDK/google/appengine/datastore/entity_pb.py)
If you think, (de)serialization is too slow in your case, you probably have two choices
Move to a lower DB API. There's another API next to the two well-documented ext.db and ext.ndb APIs at google.appengine.datastore. This hasn't all the fancy model-stuff and provides a simple (and hopefully fast) dictionary-like api. This will keep your datastore-layout compatible with the other two DB APIs.
Serialize the object yourself, and store it in a dummy entry consisting just of a text-field. But you'll probably need to duplicate data into your base entry, as you cannot filter/sort by data inside your self-serialized text.
Related
I'm trying to build a very simple wiki-like system in Clojure and serving the http using Ring.
Instead of using a regular database i was thinking about using just an atom and serialise it to a file when it gets changed. Something like https://github.com/alandipert/enduro just with a delayed write.
Having the data in-mem in vectors and maps will surely make the service faster and the code simpler/more intuitive to write?
Will that work with a multithreaded Jetty/Ring server?
The content of the atom will surely fit in memory for now, but that might not hold true in the future. Any ideas to how i can structure the code to make it easier to switch to an alternative storage backend in the future?
This is the best guide for keeping data in memory and storing it to a single file: http://www.brandonbloom.name/blog/2013/06/26/slurp-and-spit/
Datomic would give you a few options.
You could use the in-memory db which would give you query power and thread safety. It would also be very easy to switch to a persistent datastore if/when the time comes. However, I'm not sure about serialization of the in-memory db.
Or you could use Datomic just for Datalog, which can be used for querying data structures. In that case, you could use an atom and then serialize as planned. Moving to a persistent datastore would be more work than the first case, but still not much. In either case, most of your code wouldn't need to change.
In my opinion, you'd be better of just starting with the free version of Datomic that uses the file system as a datastore. I don't think using an atom simplifies your code very much.
I second the recommendation for Datomic.
I've been using it on a "real" project for a few weeks now, and the more I use it, the more I realize that it would be a solid foundation for handling your data in any non-trivial project. Even if you never plan to use a "real" database in the future, just having a fact-based data model, powerful querying, and even full-text search built in is a huge win over just using an atom to store some huge map.
I checked and the free version does give you local storage as well as the in-memory database, so that would solve your storage needs perfectly (it uses an H2 database behind the scenes). And if you ever find yourself needing to scale to something bigger, you're already set.
So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.
Given that a lot of people use content repositories. There must be a good reason. I'm building out a new web application that will need to store content. Can someone help me understanding this?
What are the advantages to using a content repository like Apache Jackrabbit as opposed to writing your own code/API to store images or text pages? Writing your own requires time etc. but so too does implementing and learning a new framework like the content repository API. A benefit to rolling your own seems to me that you know your code and have immediate expertise if you need to enhance or fix it. Using another framework you need to learn its foibles, and it is always easier to modify code you know that don't know... i.e. you don't know that underlying framework code as well as your own.
As I said a lot of people use them. There must be a reason. I can't see it as being just another "everyone is using them so, so should we." At least I hope it isn't that. :)
A JCR repository allows you to store all your content (from structured database-type data to large multimedia files) in a single place and with a single API, which is extremely convenient and makes your code simpler, avoiding the impedance mismatch between files and data that you usually have in content-based systems.
JCR also provides a lot of infrastructure functionality that you won't have to build or assemble yourself: search (including full-text), observation (callbacks when something changes) versioning, data types including multi-value, ordered nodes, etc...
If you allow a shameless plug, my "JCR - best of both worlds" article at http://java.dzone.com/articles/java-content-repository-best describes this in more detail and also provides a reading list for the JCR spec, that should allow you go get a good overview without reading the whole thing.
The article uses Apache Sling for its examples, which combined with a JCR repository provides a very nice (IMO, but as a Sling committer I'm biased ;-) platform for content-based applications.
My most recent projects have involved both choices: a custom-built data store (MySQL and image files) wtih a multi-level caching mechanism, and a JCR-based commercial repository.
A few thoughts:
In the short run, a DIY solution offers reduced complexity: you only have to build and learn what you need. And there is at least the opportunity to optimize
the data store for your particular application's needs -- more than likely speed of retrieval, but possibly storage footprint, security, or reliability concerns are foremost for you.
However, in the long run, you're looking at a significant increment of work to extend the home-grown system to a new content type (video, e.g.) or to provide new functionality (maybe,
versioning).
Also, it's difficult to separate the choice of a data store approach from the choice of tools that content providers will use to populate and maintain the data store. You'll have to give
your authors something more than an HTML form with a textarea and a submit button.
This is related to the advantages of standardization: compatibility and interchangeability. If everybody writes his own library and API, there is no compatibility and interchangeability, leading to higher cost.
I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible
I would like to develop a web-app requiring data persistence using GWT and GAE. As I understand it, my only (or at least by far the most convenient) option for data persistence is GAE's Datastore, using JDO or JPA annotated objects. I would also like to be able to send my objects back and forth client-server using GWT Remote Procedure Calls (RPC), therefore my objects must be able to "detach". However, GWT RPC serialization cannot handle detached JDO/JPA objects and it doesn't appear as though it will in the near future.
My question: what is the simplest and most direct solution to this? Being able to share the same objects client/server with server-side persistence would be extremely convenient.
EDIT
I should clarify that I still wish to use GWT RPC with GAE's Datastore. I am just looking for the best solution that would allow all these technologies to work together.
Try use http://gilead.sourceforge.net/
I've recently found Objectify, which is designed to be a replacement for JDO. Not much experience with it yet but its simpler to use than JDO, seems more lightweight, and claims to get around the need for DTOs with GWT, though I haven't tried that particular feature yet.
Ray Cromwell has a temporary hack up. I've tried it, and it works.
It forces you to use Transient instead of Detachable entities, because GWT can't serialize a hidden Object[] used by DataNucleus; This means that the objects you send to the client can't be inserted back into the datastore, you must retrieve the actual datastore object, and copy all the persistent fields back into it. Ray's method uses reflection to iterate over the methods, retrieve the getBean() and setBean() methods, and apply the entity setBean() with your transient gwt object's getBean().
You should strive to use JDO, the JPA isn't much more than a wrapper class for now. To use this hack, you must have both getter and setter methods for every persistent field, using PROPER getBean and setBean syntax for every "bean" field. Well, ALMOST PROPER, as it assumes all getters will start with "get", when the default boolean field use is "is".
I've fixed this issue and posted a comment on Ray's blog, but it's awaiting approval and I'm not sure if he'll post it. Basically, I implemented a #GetterPrefix(prefix=MethodPrefix.IS) annotation in the org.datanucleus package to augment his work.
In case it doesn't get posted, and this is an issue, email x_AT_aiyx_DOT_info Re: #GetterPrefix for JDO and I'll send you the fix.
Awhile ago I wrote a post Using an ORM or plain SQL?
This came up last year in a GWT
application I was writing. Lots of
translation from EclipseLink to
presentation objects in the service
implementation. If we were using
ibatis it would've been far simpler to
create the appropriate objects with
ibatis and then pass them all the way
up and down the stack. Some purists
might argue this is Badâ„¢. Maybe so (in
theory) but I tell you what: it
would've led to simpler code, a
simpler stack and more productivity.
which basically matches your observation.
But of course that isn't an option with Google App Engine so you're pretty much stuck having a translation layer between client-side objects and your JPA entities.
JPA entities are quite rigid so they're not really appropriate for sending back and forth between the client anyway. Typically you want little bits from several entities when doing this (thus ending up with some sort of presentation-layer value object). That is your path forward.
Try this. It is a module for serializing GAE core types and send them to the GWT client.
You can consider using JSON. GWT has necessary API to parse & generate JSON string in the client side. You get a lot of JSON API for server side. I tried with google-gson, which is fine. It converts your JSON string to POJO model and viceversa. Hope this helps you providing a decent solution for your requirement
Currently, I use the DTO (DataTransferObject) pattern. Not necessarily as clean and plenty more boilerplate but GAE still requires a fair amount of boilerplate at current. ;)
I have a Domain Object mapped (usually) one-to-one with a DTO. When a client needs Domain info, a DAO(DataAccessObject) coughs up a DTO representation of the Domain object and sends that across the wire. When a DTO comes back, I hand the DAO the DTO which then updates all the appropriate Domain Objects.
Not as clean as being able to pass Domain Objects directly across the wire obviously but the limitations of GAE's JDO implementation and GWT's Serialization process means this is the cleanest way for me to handle this currently.
I believe Google's official answer for this is GWT 2.1 RequestFactory.
Given that you are using GWT and GAE, I'd suggest you stick to the official Google framework... I have a similar GWT / GAE based app and that's what I am doing.
By the way, setting up RequestFactory is a bit of pain in the ass. The current Eclipse plug-in doesn't include all the jars but I was able to find the help I needed, in Stackoverflow
I've been using Objectify as well, and I really like it. You still have to do some dancing around with pre/postLoad methods to translate e.g. Text to String and back.
since GWT ultimately compiles to JavaScript, for detached persistence it would need one of a few services available. the best known are HTML5 stores and Gears (both use SQLite!). of course, neither is widely deployed, so you'd have to convince your users to either use a modern browser or install a little-known plugin. be sure to degrade to a usable subset if the user doesn't comply
What about directly using Datastore API to load/store POJO domain objects?
It should be comparable to DTO approach, meaning e.g. that you have to manually handle all fields (if you don't use tricks like reflection-based automation) while it should give you more flexibility and full access to all Datastore features.