Does making a database for this usecase make sense? - database

So I run my code on a weekly basis. Lets say it's 100 new tasks. Each task works on new data for that week.
I want to have a failsafe in case something happens like my computer randomly shuts off or I lose internet connection 30/100 tasks in.
So my idea was to have the database essentially load in the 100 tasks along with that weeks data into a table at the beginning as like a temporary todo list table, then remove them as I go one by one. So if it fails at task 30 out of 100, then next week, I'll have the other 70 still on my todo list plus the new 100.
Does this make sense as a design pattern? The table will essentially be empty 99% of the time. We also already use Postgres so I was thinking of just using that which I guess feels even worse since it offers so much and I'm using it for such a simple reason.
Do "state machines" fit anywhere here? It was suggested to me by someone and I don't really see how it would help after Googling it.

Related

NOSQL denormalization datamodel

Many times I read that data in NOSQL databases is stored denormalized. For instance consider a chess game record. It may not only contain the player id's that participate in the chess game, but also the first and lastname of that player. I suppose this is done because joins are not possible in NOSQL, so if you just duplicate data you can still retrieve all the data you want in one call without manual application level processing of the data.
What I don't understand is that now when you want to update a chess-player's name, you will have to write a query that updates both the chess-game records in which that player participates as well as the player record of that player. This seems like a huge performance overhead as the database will have to search all games where that player participates in and then update each of those records.
Is it true that data is often stored denormalized like in my example?
You are correct, the data is often stored de-normalized in NoSQL databases.
The problem with the updates is partially where the term "eventual consistency" comes from.
In your example, when you update the player's name (not a common event, but it can happen), you would issue a background job to update the name across all other records. Yes, while the update is happening you may retrieve an older value, but eventually the data will be consistent. Since we're not writing ATM software here, the performance/consistency tradeoff is acceptable.
You can find more info here: http://www.allbuttonspressed.com/blog/django/2010/09/JOINs-via-denormalization-for-NoSQL-coders-Part-2-Materialized-views
One way to look at it is that the number of times the user changes his/her name is extremely rare.
But the number of times that board data is read and changed is immense.
So it only makes sense to optimize for a case that will happen so much more times than a case that's only happening ever so rarely.
Another point to note is that by not keeping that name data duplicated under board data, you are actually increasing the performance overhead of the read. Every time you fetch the board data, you'd have to go one more step ahead and fetch all the user data too (even if all you really wanted was just first and last name).
Again the reason to put that first name and last name on board data is probably that on the screen where the board data will be shown, you'll often be showing the user's name too.
For these reasons, you are spared to have duplicate data on NoSQL DBs. (Although this can be done in SQL DBs too but mind ya, you'll be frowned upon). Duplication in NoSQL world is fairly common and is promoted too.
I have been working for the past 7 years with NoSQL (Firestore) for 2 fairly big projects where I was able to write code from scratch (both around 50k LoC and one has about 15k daily active users). I didn't use denormalization at all. The concept never appealed to me, and document reads are fairly cheap in Firestore.
To come back to your example; loading the other data for the chess game seems way more important than instantly being able to show the name. I would load the name based on the user id in the background and put a simple client-side memoize / cache around it to prevent fetching the same user document over and over.
What I did use quite a bit to solve performance issues is generate derived data. I would set a listener on a database document "onWrite" and then store some computed data in another derived document. These documents would automatically update when the source changes, so it doesn't complicate things really. In the case of a chess game, a distilled document could be the leaderboard that is constantly shown to all users of the app.
Another optimization I had to do was to distill a long list of titles + metadata for recently opened "projects". Firestore on the web client side doesn't give the ability to select fields from a document in a query. It only fetches full documents and that was too much data for the list, so we solved this by making an API endpoint to fetch the distilled data through there.
I'm not saying you should follow my advice, but we seem to be doing well in terms of code complexity and database costs. So when I read that NoSQL requires data denormalization I become skeptical :)
That's my 2 cents.

Which database for my specific use case

My head is exploding from reading about databases. I understand that which one you pick depends on the specific use case.
So here is mine:
I have a webapp. A game.
It's level based, you can only go forward not back. But you can continue off of each level played. E.g. You finish Level2 and then play Level3. Then you start Level3 again and save it as Level3b. You can now continue off of Level3 and Level3b.
Only ONE level can be played at any time.
Three data arrays are stored on the server: 'progress', 'choices' and 'vars'
They are modified while you play the level and then put in cold storage for when you might want to start off of them.
The currenty MySQL setup is this:
A table 'saves' holds the metadata for each savegame, importantly the saveID and the userID it belongs to.
Each of the data arrays has a corresponding table.
If the player makes a choice, the insert looks like this:
INSERT INTO choices VALUES saveid=:saveid, choice=:choice
Thus the array can be reconstructed by doing a
SELECT * FROM choices WHERE saveid=:saveid
When the level is finished, the data arrays are put in cold storage by serializing them and storing them in the 'saves' table, which has 3 columns dedicated to this.
Their values are cleared from the three other tables.
If the player starts Level4 off of Level3b, the serialized arrays are fetched from the 'saves' table, unserialized and put back in their respective tables, albeit with the new saveID of Level4.
I hope this is somewhat understandable.
I reckon that:
There will be many more writes than reads
I don't need consistency, if I understand that correctly, since players can only ever manipulate their own data
I don't think I'll be doing (m)any JOINS, since each table needs to be read individually to populate its respective data array
So I don't think I'll be needing much in the way of a relational DB
It should be really light load for the DB most of the way, since the inserts are small
Datastorage must be reliable! I don't think players would stick with us if we start losing their savegames regularly. Though I think Redis' flush to disk every second would suffice, since we're not dealing with mission critical stuff here. If the game forgets the last action or two of the player it's not bad, just don't forget a whole savegame.
Can you advice me on a DB for my use case?
I've started on MySQL, now I've read about CouchDB, MongoDB, Riak, Cassandra. I think Redis is out of the picture, since that one seems to degrade badly once the dataset outgrows your RAM. But I'm open to everything.
I'm also open to people saying: stick with MySQL or goto PostgreSQL.
And I will also accept criticism about the way I've setup the storage. If you say: choose Cassandra and store it like this, I will listen.
This is a sanity check, since now is the last time I'll be able to change the DB before the game goes live and the last thing I want to do is having to swap out the DB in 3 months because it scaled badly.
Oh yeah, App is written in Javascript, communication with server is through PHP.
I dont think you need to worry too much about the database - unless you are SURE you are going to have a massive userbase from day one (web apps generally dont get famous overnight).
You'd be far better off continuing with what you know (MySQL) but keep all database commands in a separate wrapper class (which you should be doing anyway).
If you do this, converting to another database is not that hard as long as you use standard SQL and dont do anything specific to that database.

most efficient way to get, modify and put a batch of entities with ndb

in my app i have a few batch operations i perform.
unfortunately this sometimes takes forever to update 400-500 entities.
what i have is all the entity keys, i need to get them, update a property and save them to the datastore and saving them can take up to 40-50 seconds which is not what im looking for.
ill simplify my model to explain what i do (which is pretty simple anyway):
class Entity(ndb.Model):
title = ndb.StringProperty()
keys = [key1, key2, key3, key4, ..., key500]
entities = ndb.get_multi(keys)
for e in entities:
e.title = 'the new title'
ndb.put_multi(entities)
getting and modifying does not take too long. i tried to get_async getting in a tasklet and whatever else is possible which only changes if the get or the forloop takes longer.
but what really bothers me is that a put takes up to 50seconds...
what is the most efficient way to do this operation(s) in a decent amount of time. of course i know that it depends on many factors like the complexity of the entity but the time it takes to put is really over the acceptable limit to me.
i already tried async operations, tasklets...
I wonder if doing smaller batches of e.g. 50 or 100 entities will be faster. If you make that into a task let you can try running those tasklets concurrently.
I also recommend looking at this with Appstats to see if that shows something surprising.
Finally assuming this uses the HRD you may find that there is a limit on the number of entity groups per batch. This limit defaults very low. Try raising it.
Sounds like what MapReduce was designed for. You can do this fast, by simultaneously getting and modifying all the entities at the same time, scaled across multiple server instances. Your cost goes up by using more instances though.
I'm going to assume that you have the entity design that you want (i.e. I'm not going to ask you what you're trying to do and how maybe you should have one big entity instead of a bunch of small ones that you have to update all the time). Because that wouldn't be very nice. ( =
What if you used the Task Queue? You could create multiple tasks and each task could take as URL params the keys it is responsible for updating and the property and value that should be set. That way the work is broken up into manageable chunks and the user's request can return immediately while the work happens in the background? Would that work?

GUID VS Auto Increment. (In comfortably wise)

A while a go, my sysadmin restored my database by mistake to a much earlier point.
After 3 hours we noticed this, and during this time 80 new rows (auto increment with foreign keys dependency) were created.
So at this point we had 80 different customers with the same ids in two tables that needed to be merged.
I dont remember how but we resolved this but it took a long time.
Now, I am designing a new database and my first thought is to use a GUID index even though this use case is rare.
My question: How do you get along with such long string as your ID?
I mean, when 2 programmers are talking about a customer, it is possible to say:
"Hey. We have a problem with client 874454".
But how do you keep it as simple with GUID, This is really a problem that can cause some trouble and dis-communications.
Thanks
GUIDs can create more problems than they solve if you are not using replication. First,you need to make sure they aren't the clustered index (which is the default for the PK in SQL Server at least) because you can really slow down insert performance. Second they are longer than ints and thus take up not only more space but make joins slower. Every join in every query.
You are going to create a bigger problem trying to solve a rare occurance. Instead think of ways to set things up so that you don't take hours to recover from a mistake.
You could create an auditing solution. That way you can easily recover from all sorts of missteps. And write the code in advance to do the recovering. Then it is relatively easy to fix when things go wrong. Frankly I would never allow a database that contains company critical data to be set up without some form of auditing. It's just too dangerous not to.
Or you could even have a script ready to go to move records to a temporary place and then reinsert them with a new identity (and update the identities on the child records to the new one). You did this once, the dba should have created a script (and put it in source control) so it is available the next time you need to do a similar fix. If your dba is so incompetent he doesn't create and save these sort of scripts, then get rid of him and hire someone who knows what he is doing.
just show a prefix in most views. That's what DVCSs do, since most of them identify most objects by a hexcoded hash.
(OTOH, I know it's fashionable in many circles to use UUIDs for primary keys; but it would take a lot more than a few scary stories to convince me)

Find redundant pages, files, stored procedures in legacy applications

I have several horrors of old ASP web applications. Does anyone have any easy ways to find what scripts, pages, and stored procedures are no longer needed? (besides the stuff in "old___code", "delete_this", etc ;-)
Chances are if the stored proc won't run, it isn't being used because nobody ever bothered to update it when sonmething else changed. Table colunms that are null for every single record are probably not being used.
If you have your sp and database objects in source control (and if you don't why don't you?), you might be able to reaach through and find what other code it was moved to production with which should give you a clue as to what might call it. YOu will also be able to see who touched it last and that person might know if it is still needed.
I generally approach this by first listing all the procs (you can get this from the system tables) and then marking the ones I know are being used off the list. Profiler can help you here as you can see which are commonly being called. (But don't assume that because profiler didn't show the proc that it isn't being used, that just gives you a list of the ones to research.) This makes the ones that need to be rearched a much smaller list. Depending on your naming convention it might be relatively easy to see what part of the code should use them. When researching don't forget that procs are called in places other than the application, so you will need to check through jobs, DTS or SSIS packages, SSRS reports, other applications, triggers etc to be sure something is not being used.
Once you have identified a list of ones you don't think you need, share it with the rest of the development staff and ask if anyone knows if the proc is needed. You'll probably get a a couple more taken off the list this way that are used for something specialized. Then when you have the list, change the names to some convention that allows you to identify them as a candidate for deletion. At the same time set a deletion date (how far out that date is depends on how often something might be called, if it is called something like AnnualXYZReport, then make that date a year out). If no one complains by the deletion date, delete the proc (of course if it is in source control you can alawys get it back even then).
Onnce you have gone through the hell of identifying the bad ones, then it is time to realize you need to train people that part of the development process is to identify procs that are no longer being used and get rid of them as part of a change to a section of code. Depending on code reuse, this may mean searching the code base to see if someother part of the code base uses it and then doing the same thing discussed as above, let everyone know it will be deleted on this date, change the name so that any code referncing it will break and then on the date to delete getting rid of it. Or maybe you can have a meta data table where you can put candidates for deletion at the time you know that you have stopped using something and send a report around to everyone once a month or so to determine if anyone else needs it.
I can't think of any easy way to do this, it's just a matter of identifying what might not be used and slogging through.
For SQL Server only, 3 options that I can think of:
modify the stored procs to log usage
check if code has no permissions set
run profiler
And of course, remove access or delete it and see who calls...

Resources