What is my mongodb database size doubled after sharding? - database

After my mongodb database grows quite a bit I decided to shard the collections.
So i created a new sharded cluster and imported my old data to the cluster using mongorestore.
sh.status() command shows that everything works fine as shown below
However my db size doubled. Seems like instead of balancing the entire db was cloned to both shards.
The images show the result of running db.stats() on both old unsharded db and the new sharded one. There were no new data inserted to the new db after the restore.
Is this a bug with mongodb balancer or am I missing something?

Statistics in databases are approximate and frequently delayed. When a chunk is moved from one shard to another, it is shown as existing on both shards for some time. This is probably because it does in fact exist in two places for a while.
To find out the optimal database size, insert your documents in such a way that no balancing is needed (each document is written into its final shard up front), then measure disk space usage.
To find out actual disk usage, look at disk usage instead of statistics.
Note that all databases have overhead when storing data on disk to attain better performance. In case of MongoDB this can be significant and there are various options for tuning various aspects between the server itself and WiredTiger.

Related

does azure search replica/Partitions impact index update speed?

We have an azure search instance(s1, 2 replica, 2 Partitions) created in 2016, and when I tried to upload 50 million rows to this instance, we found out that the old instance still has a limit of 30million records.
No problem, I created a new azure search instance (s1, 1 replica, 1 Partition), and start to upload the same data up. To my surprise, the uploading speed is much better on the new instance comparing to the old one (almost double the update speed).
I am wondering what could be the reason? The index I was uploading to is a new index, so no one will query it. These are the differences I can see between new and old search index:
No query traffic in the new search instance, old search instance
does get traffic from production environment. But it is on other search indexes.
New search instance has 1 replica, 1 Partition, old one has 2 replica, 2 partitions.
Just very curious on why I see such a speed difference. If I run a search query, actually, the performance will be very similar between old and new. Just the index update speed is much much better.
Query traffic is a factor, but it could also be the replica count. Every replica adds work to the indexing process, while every partition adds to the parallelism available for indexing. If you added a partition to your new service and indexing sped up further, that wouldn't be a surprising result.
All that said, the most likely explanation in your case is that your new service is running on faster hardware than the old one. This is how we were able to remove the document limit for new services.

How to administrate storage of ClickHouse server in a Cluster when disks get full

I'm setting up a ClickHouse server in cluster, but one of the things that doesn't appear in the documentation is how to manage very large amount of data, it says that it can handle up to petabytes of data, but you can't store that much data in single server. You usually will have a few teras in each.
So my question is, how can I handle it to store in a node of the cluster, and then when it requires more space, add another, will it handle the distribution to the new server automatically or will I have to play with the weights in the shard distribution.
When you have more than 1 disk in one server, how can it use them all to store the data?
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
What solution would you find to this? Handling an ever exapanding database to avoid disk space issues in the future.
Thanks
I will assume that you use standard configuration for the ClickHouse cluster: several shards consisting of 2-3 replica nodes, and on each of these nodes a ReplicatedMergeTree table containing data for its respective shard. There are also Distributed tables created on one or more nodes that are configured to query the nodes of the cluster (relevant section in the docs).
When you add a new shard, old data is not moved to it automatically. Recommended approach is indeed to "play with the weights" as you have put it, i.e. increase the weight of the new node until the volume of data is even. But if you want to rebalance the data immediately, you can use the ALTER TABLE RESHARD command. Read the docs carefully and keep in mind various limitations of this command, e.g. it is not atomic.
When you have more than 1 disk in one server, how can it use them all to store the data?
Please read the section on configuring RAID in the administration tips.
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
MergeTree tables in ClickHouse are partitioned by month. You can use ALTER TABLE DETACH/ATTACH PARTITION commands to manipulate partitions. You can e.g. at the start of each month detach the partition for some older month and back it up to Amazon S3. Or you can setup a cluster of cheaper machines with ample disk space and manually move old partitions there. If your queries always include a filter on date, irrelevant partitions will be skipped automatically, else you can setup two Distributed tables: table_recent and table_all (with the cluster config including the nodes with old partitions).
Version 19.15 introduced multidisk strorage configuration. 20.1 introduces time-based data rearrangements.

Eventually consistent document store database similar to cassandra

I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?
In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.

Database tables optimized for both read and write

We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.
Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx
Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.

Hadoop is it recommended only for distributed env?

I have a database whose size could go upto 1TB in a month. If I do a query directly, its taking a long time. So I was thinking of using Hadoop on top of the Database - most of the time my query would involve searching entire database. My database instance would be either 1 or 2, not more than that. After a while we purge the database.
So can we use hadoop framework since it helps processing large amount of data?
Hadoop is not "something you query" but you can use it to process a large amount of data and create a search index which you then load into a system you can query.
You can also look into HBase if you want a store for big data. In addition to HBase there are a number of other key-value or non-relational (NoSQL) stores that work well with large data.
A proper answer depends on the kind of query you are running. Are you always running a specific query? If so, then a key-value store works well; just choose the right keys. If your query needs to search the entire database as you say, and you only make one query every hour or two, then yes, in principle, you could write a simple "query" in Hive that will read from your HDFS store.
Note that querying in Hive only saves you time versus an RDBMS or a simple grep when you have a lot of data and access to a decent-sized cluster. If you only have one machine, it's a non-solution.
Hadoop works better on distributed system. Moreover 1TB is not big data., for this your relational database will do the job.
The real power of hadoop comes when you have to process 100 TB or more of data .. where the relational databases fail.
If look into Hbase it is fast but it is not a substitute to your MySQL or Oracle..

Resources