App Engine - Import data - google-app-engine

I'm unsure of a good way to import data that I have from an old SQL-based application into app engine (big table). I'm very confused though I'm sure I'm missing something simple.
The data is not just a simple spread sheet. It consists of customers, appointments, and a few other things. They're all tied together by keys, so that adds a little to the complexity.
I realize there is a bulk uploader, that seemed more for someone with administrative access though and I was hoping to come up with a solution that would work for a user.
It seems that if I could upload a file and do it that way, that would work, but there is a 30 second limit on processes, this would likely exceed the 30 second time limit if adding a few thousand records. Maybe I could use the task queue? I think this may allow processes that take more than 30 seconds, but then I think I'd have issues synchronizing with the development server?
Its not that I don't know how to do this at all, but its that I really have no clue as to a way that will involve the least amount of headache.

From what I understand (and I am a beginner as well), App Engine uses 'denormalized' data. This means there are really no such things as 'joins'. There are some things that can be done to connect tables (property settings I believe) but I have no idea how they work for certain - I haven't tried.
I believe your only option would be to build scripts and rules to convert your SQL data to a denormalized state and then store that in App Engine. If you have to have two way sync, then this could get messy real quick!
See this article:
http://blog.notdot.net/2010/10/Modeling-relationships-in-App-Engine
or maybe this post
https://dba.stackexchange.com/questions/52/in-google-app-engine-what-is-the-most-effective-many-to-many-join-model

Related

Ideal database for a minimalist blog engine

So I'm designing this blog engine and I'm trying to just keep my blog data without considering comments or membership system or any other type of multi-user data.
The blog itself is surrounded around 2 types of data, the first is the actual blog post entry which consists of: title, post body, meta data (mostly dates and statistics), so it's really simple and can be represented by simple json object. The second type of data is the blog admin configuration and personal information. Comment system and other will be implemented using disqus.
My main concern here is the ability of such engine to scale with spiked visits (I know you might argue this but lets take it for granted). So since I've started this project I'm moving well with the rest of my stack except the data layer. Now I've been having this dilemma choosing the database, I've considered MongoDB but some reviews and articles/benchmarking were suggesting slow reads after collections read certain size. Next I was looking at Redis and using its persistence features RDB and AOF, while Redis is good at both fast reading/writing I'm afraid of using it because I'm not familiar with it. And this whole search keeps going on to things like "PostgreSQL 9.4 is now faster than MongoDB for storing JSON documents" etc.
So is there any way I can settle this issue for good? considering that I only need to represent my data in key,value structure and only require fast reading but not writing and the ability to be fault tolerant.
Thank you
If I were you I would start small and not try to optimize for big data just yet. A lot of blogs you read about the downsides of a NoSQL solution are around large data sets - or people that are trying to do relational things with a database designed for de-normalized data.
My list of databases to consider:
Mongo. It has huge community support and based on recent funding - it's going to be around for a while. It runs very well on a single instance and a basic replica set. It's easy to set up and free, so it's worth spending a day or two running your own tests to settle the issue once and for all. Don't trust a blog.
Couchbase. Supports key/value storage and also has persistence to disk. http://www.couchbase.com/couchbase-server/features Also has had some recent funding so hopefully that means stability. =)
CouchDB/PouchDB. You can use PouchDB purely on the client side and it can connect to a server side CouchDB. CouchDB might not have the same momentum as Mongo or Couchbase, but it's an actively supported product and does key/value with persistence to disk.
Riak. http://basho.com/riak/. Another NoSQL that scales and is a key/value store.
You can install and run a proof-of-concept on all of the above products in a few hours. I would recommend this for the following reasons:
A given database might scale and hit your points, but be unpleasant to use. Consider picking a database that feels fun! Sort of akin to picking Ruby/Python over Java because the syntax is nicer.
Your use case and domain will be fairly unique. Worth testing various products to see what fits best.
Each database has quirks and you won't find those until you actually try one. One might have quirks that are passable, one will have quirks that are a show stopper.
The benefit of trying all of them is that they all support schemaless data, so if you write JSON, you can use all of them! No need to create objects in your code for each database.
If you abstract the database correctly in code, swapping out data stores won't be that painful. In other words, your code will be happier if you make it easy to swap out data stores.
This is only an option for really simple CMSes, but it sounds like that's what you're building.
If your blog is super-simple as you describe and your main concern is very high traffic then the best option might be to avoid a database entirely and have your CMS generate static files instead. By doing this, you eliminate all your database concerns completely.
It's not the best option if you're doing anything dynamic or complex, but in this small use case it might fit the bill.

My App Engine app is showing way more write ops than expected. How can I diagnose what's going on?

95% of my app costs are related to write operations. In the last few days I've paid $150. And I wouldn't consider the amount of data stored to be that huge.
First I suspected that there may be a lot of write operations because of exploding indexes, but I read that this situation usually happens when you have two list properties in the same model. But during the model design phase those were limited to 1 per model max.
I also did try to go through my models and pass indexed=False to all properties that I will not need to order or filter by.
One other thing that I would need to disclose about my app is that I have bundled write operations in the sense that there are some entities that when they need to be stored, I usually call a proxy function that stores that entity and derivative entities along with it. Not in a transactional way since it's not a big deal if there's a write failure every now and then. But I don't really see a way around that given the logic of how the models are related.
So I'm eager to hear if someone else faced that problem and which approach / tools /etc they followed to solve it. Or if there are just some general things one can do..
You can try appstats tool. It can show you datastore calls stats, etc.

Environmental database design

I've never designed a database before, but I've had experience programming in a few languages and assembler throughout college, as well as some web design, so I'm able to at least pick up what I need to know if I can be pointed in the right direction. One of the tasks of my job is to sort through some data that we've been collecting in the field, using a "sonde" which measures temperature, pH, conductivity, and other parameters. The device sits in a stream 24/7 (except for when we take it out and switch it with our other sonde every couple weeks, so that we can put in a newly calibrated one in the stream and retrieve the data from the one that was in the field). It collects data every 15 minutes or so, and has done so since 2007. Currently, all of our data is spread across multiple excel spreadsheets, and we have additional data from a weather station and another instrument that all gets compiled into quarterly documents. My goal is to design as simple of a database as possible with most of the functionality of a database like this: http://hudson.dl.stevens-tech.edu/hrecos/d/index.shtml. Ours would be significantly simpler as it is not live data (but would instead retrieve data from files that we upload once we'd finished handling the formatting and compilation of all our data). I would very much like the graphing ability on the site that the above database has, but I at least need to be able to select a range of data and select as many variables as I want within that time range and then be able to download a spreadsheet with the generated data (or at least a CSV file).
I realize this is a tough task, and as I have not designed a database before, I suspect it is very much an uphill task. However if I would be able to learn the things necessary to do this, and make it web-accessible, that would be a huge accomplishment and very much impress my boss. Any advice or tips to go off in the right direction would be very much appreciated.
Thanks for your help!
There are actually 2 parts to the solution you're looking for:
The database, which will store your data in a single organized place, and
The application, which is the interface used by people to interact with the database.
Basically, a database by itself is just a container. You need some kind of application which accept criteria from a user, pull the appropriate data meeting the criteria from the database, and display it to the user in a meaningful fashion - in this case, a graph or a spreadsheet.
Normally for web-based apps the database and application are two separate components. However, for a small app with a fairly small number of users, and especially for someone just starting out, you may want to consider an all-in-one solution like InfoDome, sort of like MSAccess for the web.
Either way, you're still going to need to learn about database design. There's many good tutorials out there, just do some searching. DatabaseAnswers.org has been useful for me. They have a set of tutorials as well as a large collection of sample database schemas.

Should we start with multiple small-grained databases for an app that may scale massively

We're developing a new eCommerce website and are using NHibernate for the first time. At present we are splitting our data into multiple SQL Server databases, divided per area of functionality. So we have one for UserInfo, one for Orders, one for ProductCatalogue and so on...
Our justification for this decision is twofold really:
the website has the potential to be HUGE (it is a new website for one of the largest online brands in the UK) and we feel that by partitioning our data along functional lines we will be able to move the databases onto their own servers which would give us an easy scaling route should we need it;
my team has always worked this way - partly as a consequence of following the MS Commerce Server pattern from previous projects.
However, reading up on this decision on the internet, we find that the normal response to this sort of model is extremely scathing. "Creating more work for the devs now in order to create more work for the devs later" is one sample comment from Stack Overflow!
In addition, NHibernate is much easier to use with only one database (just one SessionFactory needed). And knowing that Stack Overflow ran off just one box for a long time makes me think that maybe we should not try to be so clever.
So, my question is, "are we correct in thinking that using fine-grained databases might increase our ability to scale or should we sacrifice this for easier development"?
Why don't you just design your database properly and put the files on appropriate disk? Use a cluster if necessary. Creating multiple databases is not an inherently scaling solution. Also - cross database referential integrity? Good luck.
What's your definition of "HUGE"? SQL Server can handle massive databases, but one thing I've learnt is that people often have no idea what constitutes a lot of data.
I've never worked in a project like this. I'm used to databases with several hundred tables, which had never been a problem.
Therefore I can't say if your idea is a good idea, I never tried it. The "my team has always worked this way"-argument is a major driver for many decisions, and I can't even say that it is always wrong.
With NHibernate you organize your data in classes. They can be in different namespaces and assemblies. You usually don't work much with the database directly, you don't need this kind of structure there.
About the scalability argument: I'm not sure if it is really scaling well when you need to access several databases every time. I mean: you always need users and orders and probably more. Then you need to get all this data from several databases.
Agree fully with starskythehutch - keep your related tables together in the same DB. BUT, you may want to consider having separate databases for things that are not related or non-critical to your main product; but that are a part of the app.
For eg: if you decide to log every visit/hit to the site in a DB, you should probably keep that in a separate DB.
The reason you should consider:
1. huge number of transactions - say hundreds of thousands / sec. Having non-critical un-related stuff in a separate DB will ensure that tlog contentions because of this are avoided.
Restore, DBCC CHECKDB, backup times. If you stuff your non-related non-critical stuff in your main DB, you are essentially increasing the size of your DB and it will affect these operations. Having it in separate DB will help you improve performance of these operations.

CMS Database design - Master database or Multi-Db per site

I am in process of designing my CMS that I am about to create. I was thinking about the database and how I want to go by approaching it.
Do you think its best to create 1 master database for all my clients websites? or Should I have 1 database per site?
What is the benefits and negatives on both approaches? I am always thinking about the future so I was thinking about implementing memcache or APC cache to the project, to offer an option to my client.
Just trying to learn the best practices and what other developers apporach would be
I've run both. My business chooses to separate client-specific data into separate tables so that if one happens to go corrupt, not all are taken down. In an ideal world this might never happen, but murphy's law....It does seem very easy to find things with them separated. You will know with 100% certainty that one client's content will never show up on another's page.
If you do go down that route, be prepared to create scripts that build and configure databases for you. There's nothing fun about building a great system and having demand for it, only to spend your time manually setting up DB's and installs all day long. Also, setting db names is one additional step that's not part of using a single db table--it's a headache that will repeat itself seemingly over and over again.
Develop the single master DB. It will take a small amount of additional effort and add a little bit more complexity to the database design, but will give you a few nice features. The biggest is being able to share data between sites.
Designing for a master database means that you have the option to combine sites when it makes sense, but also lets you install a master per site. Best of both worlds.
It depends greatly upon the amount of customization each client will require. If you forsee clients asking for many one-off features specific to their deployment, separate databases based off of a single core structure might make sense. I would highly recommend trying to make any customizations usable by all clients though, and keep all structure defined in one place/database instead of duplicating it across multiple databases. By using one database, you make updating the structure straightforward and the implementation consistent across all sites so they can all use the same CMS code.

Resources