Whats the best way to structure Firestore database? - database

For scalability, minimal reads & writes, and avoid usage limits. This is my approach for a blog-like project, any suggestions for improvements?

It's good to hear that the following structure:
What is the correct way to structure this kind of data in Firestore?
Answers your question, but regarding:
Do I get fewer reads & writes if you structure your database with more sub-collections/documents like the image above?
No, it doesn't really matter if you query a collection or a sub-collection, you'll always have to pay a number of reads that is equal to the number of documents that are returned.
Also which of the two options can reach usage limits faster?
The structure doesn't matter. What it really matters is the number of requests you perform.

Related

What is faster way to count relation?

I'm using Waterline which is amazing ORM of Node.js. I think there are two ways to count relation(association).
First way is to apply record count when a relation record added or removed. e.g) A comment appended to a post, post's comment count field will be increased.
Second way is using 'count' query. I can count the relations when I need.
What I am worry is second way is easier but it seems to be slower than first way. It can request too much. But first way needs more dirty codes.
I really don't know what is best way to count relation.
The answers to this question have to be a little opinonated, but I will give you my point of view.
I would go with the "count query" solution because it is the most reliable way to get this information. As you said, the other solution needs more dirty code and could be more easily bugged. I always try to have a single way to retrieve an information.
If the request is too much slow and/or too much frequent and slows down your application, then you should consider caching the result. Depending on the infrastructure you are using, you could cache the result of the query in a variable or in a fast cache backend like memcached or Redis. You will have to invalidate the cache when needed and it is up to you to decide the lifetime of the cache. You should define a global cache stategy of your application so you could use it for other parts of your application.

Datastore for large astrophysics simulation data

I'm a grad student in astrophysics. I run big simulations using codes mostly developed by others over a decade or so. For examples of these codes, you can check out gadget http://www.mpa-garching.mpg.de/gadget/ and enzo http://code.google.com/p/enzo/. Those are definitely the two most mature codes (they use different methods).
The outputs from these simulations are huge. Depending on your code, your data is a bit different, but it's always big data. You usually take billions of particles and cells to do anything realistic. The biggest runs are terabytes per snapshot and hundreds of snapshots per simulation.
Currently, it seems that the best way to read and write this kind of data is to use HDF5 http://www.hdfgroup.org/HDF5/, which is basically an organized way of using binary files. It's a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this.
I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?
If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume.
edit: Sorry for being vague about the issues. Steve is right that this might just be an issue of data structure rather than the data storage method. I have to run now, but I will provide more details late tonight or tomorrow.
edit 2: So the more I look into this, the more I realize that this probably isn't a datastore issue anymore. The main issue with unformatted binary was all the headaches reading the data correctly (getting the block sizes and order right and being sure about it). HDF5 pretty much fixed that and there isn't going to be a faster option until the file system limitations are improved (thanks Matt Turk).
The new issues probably come down to data structure. HDF5 is as performant as we can get, even if it is not the nicest interface to query against. Being used to databases, I thought it would be really interesting/powerful to be able to query something like "give me all particles with velocity over x at any time". You can do something like that now, but you have to work at a lower level. Of course, given how big the data is and depending on what you are doing with it, it might be a good thing to work at a low level for performance sake.
MongoDB: http://www.mongodb.org/
Netezza
Products:
http://www.netezza.com/data-warehouse-appliance-products/skimmer.aspx
Hadoop: http://hadoop.apache.org/
Wikipedia's List of Distributed File
Systems:
http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems
EDIT
Rationale for my lack of explanation / etc.:
OP says: "[HDF5]'s a huge improvement over unformatted binary files with a custom header block (still give me nightmares), but I can't help but think there could be a better way to do this."
What does "better" mean? Better structured? He seems to allude to the "unformatted binary files" as being an issue - so maybe that's what he means by better. If so, he'll need something with some structure - hence the first couple suggestions.
OP says: "I imagine the sheer data size is the issue here, but is there some sort of datastore that can handle terabytes of binary data efficiently, or are binary files the only way at this point?"
Yes, there are several. Both structured, and "unstructured" - does he want structure, or is he happy to leave them in some sort of "unformatted binary format"? We still don't know - so I suggest checking out some Distributed File Systems.
OP says: "If it helps, we typically store data columnwise. That is, you have a block of all particle id's, block of all particle positions, block of particle velocites, etc. It's not the prettiest, but it is the fastest for doing something like a particle lookup in some volume."
Again, Does the OP want better structure, or doesn't he? Seems like he wants both - better structure AND faster.... maybe scaling OUT will give him this. This further reinforces the first few options I listed.
OP says (in comments): "I don't know if we can take the hit on io though."
Are there IO requirements? Cost restrictions? What are they?
We can't get something for nothing here. There is no "silver-bullet" storage solution. All we have to go on here for requirements is "lots of data" and "I don't know if I like the lack of structure, but I'm not willing to increase my IO to accommodate any additional structure"... so I don't know what kind of answer he's expecting. He hasn't listed a single complaint about the current solution he has other than the lack of structure - and he's already said he's not willing to pay any overhead to do anything about that... so.... ?

More efficient to store text as file or in DB?

Imagine you're dealing with many strings of text that are about 10,000 characters long entered by users. Would it be more efficient to write those automatically onto pages or input them onto a table in a database? I hope that question is clear enough...
It depends on what sort of "efficiency" you're aiming for.
Here's what I mean:
will you be changing the content of your text strings?
what sorts of searches will you be doing?
when you extract the text do what do you do with it?
My opinion is that provided you're not going to change the content much, nor perform much analysis, you're better off with the database.
10k isn't particularly large, so either is fine. I would personally use the database, as it will allow you to easily search though.
Depends how you're accessing them, but normally using the FS would result in better performance. That's for the obvious reason the DB is another layer built on top of the FS, and using the FS directly, assuming no extra heavy processing (for example, have 100s of named files instead of one big bloated file ordered in a special order you need to parse), would save you the DBMS operations.
I'm wondering if SQLite would be the best of both worlds, or at least, the best database for that size of job.
The real answer her is what you're going to do with these strings.
Databases are meant to be able to quickly return specific records. If you're just going to SELECT * FROM Table and then concat it all together, there's no point in using a database.
However, if you have a relation between your data that you want to be able to search, then a database will likely be more efficient.
E.G., do you want to be able to pull up all the text records from a set of users on a set of dates? Find all records from users who match some records?
These kinds of loads will likely be more efficient than a naive implementation, and still probably faster than a decent one, even if it does avoid some access layers.
There are a lot of considerations. As others said - either approach would work fine for a small number of 10k rows (thousands).
But what's the rest of your app do? If it does everything in the database, then I'd be inclined to put this there as well; the opposite is true as well.
And how will you be selecting these? Do you need to do complex text searches? If so, a database might not be the best. Or, would you be adding new attributes, searching on those attributes - or matching them against data in other tables? In this common case a database would be better.
And if your data is really vast (many millions of 10k rows) and your performance requirements aren't terribly high - you may want to compress them and store them in the file system.
Lastly, how important is data quality? Given the features of a good database it's much easier to guarantee good data quality with a database.

Ideal string length for DBM database?

When using a DBM database (e.g. Berkeley or GDBM), is it better to store data using fewer long strings or more short strings? I can easily structure my data either way. I'm looking for 'better' in the performance sense, but I'm interested in other implications as well.
Berkeley DB, or any other DBM implementation, will incur overhead for each key/value pair. If you're dealing with millions of k/v pairs the overhead will matter, otherwise it's noise and you should choose what is easiest for you the programmer and let the database deal with the data. Overhead and access time will also depend on access method. Hash tables and B-Trees are totally different algorithmic animals. If your data has any degree of key ordering or access patterns dependent on keys then 99% of the time B-Trees are the way to go.
I think you're asking a great design question, but I think for anyone to give you a perfect answer we'd all have to know a lot more about the amount of data your dealing with, access patterns, and many other factors.
If you will be frequently searching or modifying the data, a greater number of short strings will provide better performance.
i.e. You don't want to be searching for a substring of one of those long strings, or modifying some value in the middle of a string frequently.
I think this question is really hard to answer in a completely generic way. There are so many variables here, that you would really need to test some common scenarios to determine the answer that is best for you.
Some factors to consider:
Will larger strings require substring searches?
What kind of searches will you perform over the data?
In the end, its generally better to go with the approach that yields the most normalized schema. Optimization can start from there, and depending upon your db, there are probably better alternatives than restructuring the underlying schema purely for performance.

Store static data in an array, or in a database?

We always have some static data which can be stored in a file as an array or stored in a database table in our web based project. So which one should be preferred?
In my opinion, arrays have some advantages:
More flexible (it can be any structure, which specifies a really complex relation)
Better performance (it will be loaded in memory, which will have better read/write performance compared with a database's I/O operations)
But my colleague argued that he preferred DB approach, since it can keep a uniform data persistence interface, and be more flexible.
So which should be preferred? Or how can we choose? Or we should prefer one in some scenario and another in other scenarios? what are the scenarios?
EDIT:
Let me clarify something. Truly just as Benjamin made the change to the title, the data we want to store in an array(file) won't change so frequently, which means the code won't change the value of the array in the runtime. If the data change very frequently I will use DB undoubtedly. That's why I made such a post.
And sometimes it's hard to store some really complex relations like:
Task = {
"1" : {
"name" : "xx",
"requirement" : {
"level" : 5,
"money" : 100,
}
...
}
Just like the above code sample(a python dict or you can think it as an array), the requirement field is hard to store in DB(store a structure like pickled object directly in DB? not so good I think). So in such condition, I will prefer arrays.
So what's your idea? In such scenario, we should prefer arrays to DB, right?
Regards.
Lets be pragmatic/objetive:
Do you write to your data on runtime? Yes: Db, No: File
Do you update your data more than once per week? Yes: Db, No: File
It's a pain to release an updated data file? Yes: Db, No: File,
Do you read that data often? Yes: File/Cache, No: Db
It is a pain to update that data file and you need extra tools? Yes: db, No: File
For sure I've forgotten other points, but I guess the basics are there.
The "flexiable" array in a file is fraught with a zillion issues already delt with by using a DB. Unless you can prove that the DB is really going to way slower than using the other approach use a DB. Move on and start solving business problems.
Edit
Comment from OP asks what the issues with using a file might be, here are a handful (pause to take a deep breath).
Concurrency: You have to manage the situation where multiple requests may be trying to write back to the file. Not too hard but it becomes a bottleneck.
Performance: Yes modifying an in-memory array is quicker but how do you determine how much and when the array needs to be persisted to a file. Note that using a DB doesn't pre-clude the use of an appropriate in-memory cache. Writing a file back each time a small modification is made isn't going to perform that well.
Scalability: Really a function of the first two. In order to acheive any scalable goals you need to be able to quickly modify small bits of the data that is persisted. IWO if you don't use a DB you would end up writing one. If you find you need more than one webserver to support growing demand where are you going to store the file(s)? Now you've got file I/O over a network (ableit likely a very quick one).
Structure: Your code will be responsible for managing the structure of data, querying it etc if you use an array. How will you do that in way which acheives greater "flexibility" than using a DB? All manner of choices and complexity are needed here.
Reliability: You need to ensure the integrity of your persisted data. In the event of some failure your array/file code would need to ensure that data is at least not so corrupt that the application can continue.
Your colleague is correct, BUT there's where you need to put aside the comp sci textbook and be pragmatic. How often will you be accessing this data from your application? If it's fairly frequently then don't incur the costs of access overhead. Instead of reading from a flat file you could still gain the advantages of a db, but use a caching strategy in your application. Depending on your development language you could look at something like memcache or jtreecache.
It depends on what kind of data you are looking at, and whether or not it needs to be updated regularly.
I tend to keep most things (non-config data) in the database, even if the data isn't going to be repeating (e.g. thosands of rows). Databases will scale so much easier than a flat file, if your system starts to grow fast your flat file might become a burden to your system.
If the data doesn't change very oftern, and your programming in Java, why not use Spring to hold the values?
They can be injected into your bean, and changed easly.
but thats if you'r developing in Java.
Yeah I agree with your implied assessment that databases are overused and basic flat files may work in multitude of scenarios. If your application is read-only (and writes are done by the admin when app restarts) I would definitely go with the file. Even if application writes to the file, but only in append mode (vs random inserts/updates) in one thread, I would also use file. Anything else -- need a real database with random updates, queries, concurrency control etc.

Resources