Maximum recommended children in a Firebase Database node - database

I'm starting off with using Firebase Database for the first time, and I think I've got a decent structure planned out, but I'm worried about the number of child nodes I might end up with.
Is there a recommended limit or average known-good number of child values which can be added to a node without running into noticeable performance problems? I've not had much database experience at all, and I've not been able to find any information on what an acceptable value would be, so I have no idea if my planned structure will scale well.
As a rough estimate, I'm expecting a rough maximum of 30,000 children all-in. I'll only be requesting data from around 10 of those, but as far as I know, Firebase will retrieve the entire node, before filtering out any results, which is why I'm worried about the performance impact of retrieving the entire node. Any help with this would be massively appreciated! Thanks!

As a rough estimate, I'm expecting a rough maximum of 30,000 children all-in.
That's not really a very large number of child nodes.
as far as I know, Firebase will retrieve the entire node, before filtering out any results
If you query the database using a field with an index, the nodes will be filtered on the server. You can create an index to avoid performance problems for larger numbers of child nodes.

Related

Couchbase retrieve data from vbucket

I'm new to Couchbase and wondering if there is any manner to implement a parallel read from bucket. Given that, a bucket contains 1024 vbuckets by default. So could it be possible to split a N1QL query select * from b1 into several queries? It means that one of those queries just reads data from vbucket1 to vbucket100. Because the partition key is used to decide which node the value should be persisted. I think it could be possible to read a part of data from bucket according to a range of partition key. Could someone help me out of this?
Thanks
I don't recommend proceeding down this route. If you are just starting out, you should be worrying about how to represent your data in JSON, how to write effective N1QL queries against it, and how to get a useful set of indexes that support those queries and let them run quickly. You should also make sure that your cluster is properly set up, and you have a proper mix of KV, N1QL, and indexing nodes, with none of them as an obvious bottleneck. And of course you should be measuring performance. Exotic strategies like query partitioning should come after that, if you are still unsatisfied with performance.

Caching Strategy for a Deep, Traversable Egocentric Graph

Firstly, let me explain what I'm building:
I have a D3.js Force Layout graph which is rooted at the center, and has a bunch of nodes spread around it. The center node is an Entity of some sort, the nodes around it are other Entities which are somehow related to the root. The edges are the actual relations (i.e. how the two are related).
The outer nodes can be clicked to center the target Entity and load its relations
This graph is "Egocentric" in the sense that every time a node is clicked, it becomes the center, and only relations directly involved with itself are displayed.
My Setup, in case any of it matters:
I'm serving an API through Node.js, which translates requests into queries to a CouchDB server with huge data sets.
D3.js is used for layout, and aside from jQuery and Bootstrap, I'm not using any other client-side libraries. If any would help with this caching task, I'm open to suggestions :)
My Ideas:
I could easily grab a few levels of the graph each time (recurse through the process of listing and expanding children a few times) but since clicking on any given node loads completely unrelated data, it is not guaranteed to yield a high percentage of the similar data as was loaded for the root. This seems like a complete waste, and actually a step in the opposite direction -- I'd end up doing more processing this way!
I can easily maintain a hash table of Entities that have already been retrieved, and check the list before requesting data for that entity from the server. I'll probably end up doing this regardless of the cache strategy I implement, since it's a really simple way of reducing queries.
Now, how do you suggest I cache this data?
Is there any super-effective strategy you can think of for doing this kind of caching? Both server-and-client-side options are greatly welcomed. A ton of data is involved in this process, and any reduction of querying/processing puts me miles ahead of the game.
Thanks!
On the client side I would have nodes, and have their children either be an array of children, or else a function that serves as a promise of those children. When you click on a given node, if you have data, display it immediately. Else send off an AJAX request that will fill it.
Whenever you display a node (not centered), create an asynchronous list of AJAX requests for the children of the displayed nodes and start requesting them. That way when the user clicks, there is a chance that you already have it cached. And if not, well, you tried and cost them nothing.
Once you have it working, decide how many levels deep it makes sense to go. My guess is that the magic number is likely to be 1. Beyond that the return in responsiveness falls off rapidly, while the server load rises rapidly. But having clicks come back ASAP is a pretty big UI win.
I think you need to do two things:
Reduce the number of requests you make
Reduce the cost of requests
As btilly points out, you're probably best requesting the related nodes for each visible node, so that if they are clicked the visualisation is immediately responsive -- you don't want the query plus transit time as a response lag.
However, if you have this great need to reduce the number of requests, it suggests your requests themselves are too costly, since the total load is requestCost * numRequests. Consider finding ways to pre-calculate the set of nodes related to each node, so that the request is a read request rather than a full DB query. If you're thinking that sounds hard, it's just what Google do every time you search for a new thing; they can't search the internet every time you start typing, so they do that ahead of time, and cache.
This may mean some amount of denormalisation; if you have a cache and a query, there is no guarantee they are in sync, but the question there is whether your dataset changes; is it write once, read many?
To minimise the space needed by all these nodes and their relations, see it more as a particle interaction problem; by partitioning the space you may be able to group nodes so that you only need query a group of nodes for its aggregate neighbours, and store that. That way you can do a much smaller filtration on each request rather than a full DB query. If it's O(n log n) and you make n a hundred times smaller, it's more than 100x faster.

Even partitioning of nonuniform ranged data in cassandra

I've got a rather tricky one, bear with me as I try not to stumble over my words here. I'm doing some research, and my group is transitioning to a cassandra database. Our research used MySQL before, but the data outgrew the database (192 million rows in memory # 16G -- it was the only way to query the data fast enough). The data itself is kinda-sorta static. There's a whole lot of it, but any new data is a somewhat slow trickle at this point.
The data consists of a boatload of classifier-score pairs. We formulate queries for the database which basically say, "give me the top 500 for the following classifiers". Then the database returns that many scores. For example, if we ask for the top 500 scores for 2 classifiers, we get back 1000 rows (each row consisting of a classifier ID and a score -- i.e. [4, 9100]). The scores themselves are non-uniform (the distribution tends to clump toward one end of the values -- which by the way are from -10000 to 10000)
As we transition to cassandra, there are a number of requirements. First of all, we need to be able to query for the top and bottom N scores on a per-classifier basis. Normally I can see that an ordered partitioner would be appropriate for this, however like I said the scores tends to clump at the extremes (which would put too much of a burden on one node). So my first question is, how do I evenly distribute the classifier/score pairs while still being able to query for the top or bottom N.
There is a secondary requirement which pretty much screws up the first one. Sometimes it is necessary to find all scores that are near another score. So if I see classifier 6 with a score of 400, I might ask, show me 500 scores that are the closest to that (all within classifier 6). I'm absolutely stumped about this one. I've read that cassandra supports secondary indices (yay) but only hash type (boo - no ranges). Do we create a seperate ColumnFamily for this use case?
And finally, speed is paramount. The data is being used in an interactive GUI application. Ideally, queries should only take a few seconds. And if data all gets stuck on one particular node, it will slow things down.
We've tried all kinds of clever tricks. Our best idea was to put the data into buckets, so that the top 500 went into bucket 1, the next 500 went into bucket 2, and so on. The advantage is that to get the top 500 we just ask for bucket 1. Also all of the data WOULD be evenly distributed using a random partitioner. However since MOST of our queries are interested only in bucket 1, it would put a lot of burden on just one node (remember, if N classifiers are involved, it's actually 500 * N scores per bucket). The real disadvantage of this scheme is that it falls apart when we need to query based on nearness to a score (we'd have to do some kind of weird binary search over the buckets to find our starting value).
At this point we're running low on ideas. Everything I've seen about cassandra makes me wonder if it's even appropriate for this task. We chose it mainly because of it's horizontal scalability, which is important (much easier to add a node than to shard an RDBM). So I suppose my overall question is: how would you approach this? If cassandra, please address any of the above issues. Otherwise any insight or wisdom would be appreciated. Thanks.
Why not storing the classifier as a column family row key and the score in column name. Since columns are sorted it is really fast to query the top/bottom 500 columns for a given classifier. The second type of query is also possible, when you are looking for the scores near s you can for instance select 500 columns before s and 500 columns after s and then filter the 500 columns near s.

Inspiration needed: Selecting large amounts of data for a highscore

I need some inspiration for a solution...
We are running an online game with around 80.000 active users - we are hoping to expand this and are therefore setting a target of achieving up to 1-500.000 users.
The game includes a highscore for all the users, which is based on a large set of data. This data needs to be processed in code to calculate the values for each user.
After the values are calculated we need to rank the users, and write the data to a highscore table.
My problem is that in order to generate a highscore for 500.000 users we need to load data from the database in the order of 25-30.000.000 rows totalling around 1.5-2gb of raw data. Also, in order to rank the values we need to have the total set of values.
Also we need to generate the highscore as often as possible - preferably every 30 minutes.
Now we could just use brute force - load the 30 mio records every 30 minutes, calculate the values and rank them, and write them in to the database, but I'm worried about the strain this will cause on the database, the application server and the network - and if it's even possible.
I'm thinking the solution to this might be to break up the problem some how, but I can't see how. So I'm seeking for some inspiration on possible alternative solutions based on this information:
We need a complete highscore of all ~500.000 teams - we can't (won't unless absolutely necessary) shard it.
I'm assuming that there is no way to rank users without having a list of all users values.
Calculating the value for each team has to be done in code - we can't do it in SQL alone.
Our current method loads each user's data individually (3 calls to the database) to calculate the value - it takes around 20 minutes to load data and generate the highscore 25.000 users which is too slow if this should scale to 500.000.
I'm assuming that hardware size will not an issue (within reasonable limits)
We are already using memcached to store and retrieve cached data
Any suggestions, links to good articles about similar issues are welcome.
Interesting problem. In my experience, batch processes should only be used as a last resort. You are usually better off having your software calculate values as it inserts/updates the database with the new data. For your scenario, this would mean that it should run the score calculation code every time it inserts or updates any of the data that goes into calculating the team's score. Store the calculated value in the DB with the team's record. Put an index on the calculated value field. You can then ask the database to sort on that field and it will be relatively fast. Even with millions of records, it should be able to return the top n records in O(n) time or better. I don't think you'll even need a high scores table at all, since the query will be fast enough (unless you have some other need for the high scores table other than as a cache). This solution also gives you real-time results.
Assuming that most of your 2GB of data is not changing that frequently you can calculate and cache (in db or elsewhere) the totals each day and then just add the difference based on new records provided since the last calculation.
In postgresql you could cluster the table on the column that represents when the record was inserted and create an index on that column. You can then make calculations on recent data without having to scan the entire table.
First and formost:
The computation has to take place somewhere.
User experience impact should be as low as possible.
One possible solution is:
Replicate (mirror) the database in real time.
Pull the data from the mirrored DB.
Do the analysis on the mirror or on a third, dedicated, machine.
Push the results to the main database.
Results are still going to take a while, but at least performance won't be impacted as much.
How about saving those scores in a database, and then simply query the database for the top scores (so that the computation is done on the server side, not on the client side.. and thus there is no need to move the millions of records).
It sounds pretty straight forward... unless I'm missing your point... let me know.
Calculate and store the score of each active team on a rolling basis. Once you've stored the score, you should be able to do the sorting/ordering/retrieval in the SQL. Why is this not an option?
It might prove fruitless, but I'd at least take a gander at the way sorting is done on a lower level and see if you can't manage to get some inspiration from it. You might be able to grab more manageable amounts of data for processing at a time.
Have you run tests to see whether or not your concerns with the data size are valid? On a mid-range server throwing around 2GB isn't too difficult if the software is optimized for it.
Seems to me this is clearly a job for chacheing, because you should be able to keep the half-million score records semi-local, if not in RAM. Every time you update data in the big DB, make the corresponding adjustment to the local score record.
Sorting the local score records should be trivial. (They are nearly in order to begin with.)
If you only need to know the top 100-or-so scores, then the sorting is even easier. All you have to do is scan the list and insertion-sort each element into a 100-element list. If the element is lower than the first element, which it is 99.98% of the time, you don't have to do anything.
Then run a big update from the whole DB once every day or so, just to eliminate any creeping inconsistencies.

Database scalability - performance vs. database size

I'm creating an app that will have to put at max 32 GB of data into my database. I am using B-tree indexing because the reads will have range queries (like from 0 < time < 1hr).
At the beginning (database size = 0GB), I will get 60 and 70 writes per millisecond. After say 5GB, the three databases I've tested (H2, berkeley DB, Sybase SQL Anywhere) have REALLY slowed down to like under 5 writes per millisecond.
Questions:
Is this typical?
Would I still see this scalability issue if I REMOVED indexing?
What are the causes of this problem?
Notes:
Each record consists of a few ints
Yes; indexing improves fetch times at the cost of insert times. Your numbers sound reasonable - without knowing more.
You can benchmark it. You'll need to have a reasonable amount of data stored. Consider whether or not to index based upon the queries - heavy fetch and light insert? index everywhere a where clause might use it. Light fetch, heavy inserts? Probably avoid indexes. Mixed workload; benchmark it!
When benchmarking, you want as real or realistic data as possible, both in volume and on data domain (distribution of data, not just all "henry smith" but all manner of names, for example).
It is typical for indexes to sacrifice insert speed for access speed. You can find that out from a database table (and I've seen these in the wild) that indexes every single column. There's nothing inherently wrong with that if the number of updates is small compared to the number of queries.
However, given that:
1/ You seem to be concerned that your writes slow down to 5/ms (that's still 5000/second),
2/ You're only writing a few integers per record; and
3/ You're queries are only based on time queries,
you may want to consider bypassing a regular database and rolling your own sort-of-database (my thoughts are that you're collecting real-time data such as device readings).
If you're only ever writing sequentially-timed data, you can just use a flat file and periodically write the 'index' information separately (say at the start of every minute).
This will greatly speed up your writes but still allow a relatively efficient read process - worst case is you'll have to find the start of the relevant period and do a scan from there.
This of course depends on my assumption of your storage being correct:
1/ You're writing records sequentially based on time.
2/ You only need to query on time ranges.
Yes, indexes will generally slow inserts down, while significantly speeding up selects (queries).
Do keep in mind that not all inserts into a B-tree are equal. It's a tree; if all you do is insert into it, it has to keep growing. The data structure allows for some padding, but if you keep inserting into it numbers that are growing sequentially, it has to keep adding new pages and/or shuffle things around to stay balanced. Make sure that your tests are inserting numbers that are well distributed (assuming that's how they will come in real life), and see if you can do anything to tell the B-tree how many items to expect from the beginning.
Totally agree with #Richard-t - it is quite common in offline/batch scenarios to remove indexes completely before bulk updates to a corpus, only to reapply them when update is complete.
The type of indices applied also influence insertion performance - for example with SQL Server clustered index update I/O is used for data distribution as well as index update, where as nonclustered indexes are updated in seperate (and therefore more expensive) I/O operations.
As with any engineering project - best advice is to measure with real datasets (skews page distribution, tearing etc.)
I think somewhere in the BDB docs they mention that page size greatly affects this behavior in btree's. Assuming you arent doing much in the way of concurrency and you have fixed record sizes, you should try increasing your page size

Resources