What is a good way to open large files across a WAN? - sql-server

I have an application deployed into multiple zones and there are some issues with opening larger documents (20-50MB) across the WAN.
Currently the documents are stored in Zone 1 (Americas) and a link stored in the database to the docs.
I have heard some things about blobs in oracle and store binary in MS SQL Server 2005 and then perhaps copying the database to other zones.
Any other suggestions or good results with one of the described options?

Your best option here may be caching the document in the requested zone the first time it is requested, and pinging the source document's last modified each time the cached document is requested in order to determine if it needs refreshed. In this case you're only requesting a small piece of information (a date) across the WAN most of the times the document is accessed. This works best for a subset of documents that are frequently requested.
If you have a large set of documents, each infrequently requested by a disparate group, then you may want to look into replicating the documents in each of your zones each time the master is updated. This may best be accomplished by storing the document as binary data in your master database and having the slaves pull from the master.

If you're running on Windows you could look at Distributed File Systems

Related

How to administrate storage of ClickHouse server in a Cluster when disks get full

I'm setting up a ClickHouse server in cluster, but one of the things that doesn't appear in the documentation is how to manage very large amount of data, it says that it can handle up to petabytes of data, but you can't store that much data in single server. You usually will have a few teras in each.
So my question is, how can I handle it to store in a node of the cluster, and then when it requires more space, add another, will it handle the distribution to the new server automatically or will I have to play with the weights in the shard distribution.
When you have more than 1 disk in one server, how can it use them all to store the data?
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
What solution would you find to this? Handling an ever exapanding database to avoid disk space issues in the future.
Thanks
I will assume that you use standard configuration for the ClickHouse cluster: several shards consisting of 2-3 replica nodes, and on each of these nodes a ReplicatedMergeTree table containing data for its respective shard. There are also Distributed tables created on one or more nodes that are configured to query the nodes of the cluster (relevant section in the docs).
When you add a new shard, old data is not moved to it automatically. Recommended approach is indeed to "play with the weights" as you have put it, i.e. increase the weight of the new node until the volume of data is even. But if you want to rebalance the data immediately, you can use the ALTER TABLE RESHARD command. Read the docs carefully and keep in mind various limitations of this command, e.g. it is not atomic.
When you have more than 1 disk in one server, how can it use them all to store the data?
Please read the section on configuring RAID in the administration tips.
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
MergeTree tables in ClickHouse are partitioned by month. You can use ALTER TABLE DETACH/ATTACH PARTITION commands to manipulate partitions. You can e.g. at the start of each month detach the partition for some older month and back it up to Amazon S3. Or you can setup a cluster of cheaper machines with ample disk space and manually move old partitions there. If your queries always include a filter on date, irrelevant partitions will be skipped automatically, else you can setup two Distributed tables: table_recent and table_all (with the cluster config including the nodes with old partitions).
Version 19.15 introduced multidisk strorage configuration. 20.1 introduces time-based data rearrangements.

Eventually consistent document store database similar to cassandra

I'm looking for an open source data store that scales as easily as Cassandra but data can be queried via documents like MongoDB.
Are there currently any databases out that do this?
In this website http://nosql-database.org you can find a list of many NoSQL databases sorted by datastore types, you should check the Document stores there.
I'm not naming any specific database to avoid a biased/opinion-based answer, but if you are interested in a data store that is as scalable as Cassandra, you probably want to check those which use master-master/multi-master/masterless (you name it, the idea is the same) architecture, where both writes and reads can be split among all nodes in the cluster.
I know Cassandra is optimized towards writes rather than reads, but without further details in the question can't refine the answer with more information.
Update:
Disclaimer: I haven't used CouchDB at all, and haven't tested it's performance either.
Since you spotted CouchDB I'll add what I've found in the official documentation, in the distributed database and replication section.
CouchDB is a peer-based distributed database system. It allows users
and servers to access and update the same shared data while
disconnected. Those changes can then be replicated bi-directionally
later.
The CouchDB document storage, view and security models are designed to
work together to make true bi-directional replication efficient and
reliable. Both documents and designs can replicate, allowing full
database applications (including application design, logic and data)
to be replicated to laptops for offline use, or replicated to servers
in remote offices where slow or unreliable connections make sharing
data difficult.
The replication process is incremental. At the database level,
replication only examines documents updated since the last
replication. Then for each updated document, only fields and blobs
that have changed are replicated across the network. If replication
fails at any step, due to network problems or crash for example, the
next replication restarts at the same document where it left off.
Partial replicas can be created and maintained. Replication can be
filtered by a javascript function, so that only particular documents
or those meeting specific criteria are replicated. This can allow
users to take subsets of a large shared database application offline
for their own use, while maintaining normal interaction with the
application and that subset of data.
Which looks quite scalable to me, as it seems you can add new nodes to the cluster and then all the data gets replicated.
Also partial replicas seems an interesting option for really big data sets, which I'd configure these very carefully, in order to prevent situations where a given query to the database might not yield valid results, for example, in the case of a network partition and having only access to a partial set.

SQL storage sizing - How to get statistics of what data is being accessed

How can I monitor which data is being accessed and which frequency?
I'm in need to migrate several (very) small SQL Server instances, each which several small databases. Current configuration is based in a lot of also small servers with local storage. New configuration is based in a single server with a single NAS.
So far, the SQL Server memory and CPU sizing is OK. Also DB sizes and total IOPS. But there's no existing documentation of what data set is actually being accessed. So, basically, I don't have a clue about what are the real storage requirements since the total amount of IOPS may be for only a couple of tables (so it would work like a charm with just a couple of SSD) or if the whole set of databases are being scanned all the time and I'll need several dozens of disks.
So, back to the question: How can I "profile" and get statistics of what data is being accessed? Either at SQL or Windows level?
The best way to see how much a table or groups of tables are being used is to use SQL Server Audit. It has very little impact on SQL Server's performance and can be easily set up to monitor selects (unlike triggers) in addition to inserts/updates/deletes.

Distributed FS with deterministic multiple masters?

I'm looking for a distributed file (or other storage) system for managing a very large number of mutable documents. Each document can be rather large (1-100MB). Some reads need to be guaranteed to be working from the latest data, and some can be read from eventually-consistent replicated data. Each document could be a self-contained file (say, a SQLite database or other custom file format).
For optimal performance, the node of the distributed file system on which writes happen for each document must be different. In other words, server A is the master for document 1 and server B is replicating it, but server B is the master for document 2 and server A is replicating it. For my application, a single server is not going to be able to handle all of the write traffic for the whole system, so having a single master for all data is not acceptable.
Each document should be replicated across some number of servers (say, 3). So if I have 1000 documents and 10 servers, each server would have a copy of 300 documents, and be the master for 100 of those. Ideally, the cluster would automatically promote servers to be masters for documents whose master server had crashed, and re-balance the storage load as new servers are added to the cluster.
I realize this is a pretty tall order... is there something available that meets most of my core needs?
I think HDFS would fit the criteria you listed above.

Copying data from a local database to a remote one

I'm writing a system at the moment that needs to copy data from a clients locally hosted SQL database to a hosted server database. Most of the data in the local database is copied to the live one, though optimisations are made to reduce the amount of actual data required to be sent.
What is the best way of sending this data from one database to the other? At the moment I can see a few possibly options, none of them yet stand out as being the prime candidate.
Replication, though this is not ideal, and we cannot expect it to be supported in the version of SQL we use on the hosted environment.
Linked server, copying data direct - a slow and somewhat insecure method
Webservices to transmit the data
Exporting the data we require as XML and transferring to the server to be imported in bulk.
The data copied goes into copies of the tables, without identity fields, so data can be inserted/updated without any violations in that respect. This data transfer does not have to be done at the database level, it can be done from .net or other facilities.
More information
The frequency of the updates will vary completely on how often records are updated. But the basic idea is that if a record is changed then the user can publish it to the live database. Alternatively we'll record the changes and send them across in a batch on a configurable frequency.
The amount of records we're talking are around 4000 rows per table for the core tables (product catalog) at the moment, but this is completely variable dependent on the client we deploy this to as each would have their own product catalog, ranging from 100's to 1000's of products. To clarify, each client is on a separate local/hosted database combination, they are not combined into one system.
As well as the individual publishing of items, we would also require a complete re-sync of data to be done on demand.
Another aspect of the system is that some of the data being copied from the local server is stored in a secondary database, so we're effectively merging the data from two databases into the one live database.
Well, I'm biased. I have to admit. I'd like to hypnotize you into shelling out for SQL Compare to do this. I've been faced with exactly this sort of problem in all its open-ended frightfulness. I got a copy of SQL Compare and never looked back. SQL Compare is actually a silly name for a piece of software that synchronizes databases It will also do it from the command line once you have got a working project together with all the right knobs and buttons. Of course, you can only do this for reasonably small databases, but it really is a tool I wouldn't want to be seen in public without.
My only concern with your requirements is where you are collecting product catalogs from a number of clients. If they are all in separate tables, then all is fine, whereas if they are all in the same table, then this would make things more complicated.
How much data are you talking about? how many 'client' dbs are there? and how often does it need to happen? The answers to those questions will make a big difference on the path you should take.
There is an almost infinite number of solutions for this problem. In order to narrow it down, you'd have to tell us a bit about your requirements and priorities.
Bulk operations would probably cover a wide range of scenarios, and you should add that to the top of your list.
I would recommend using Data Transformation Services (DTS) for this. You could create a DTS package for appending and one for re-creating the data.
It is possible to invoke DTS package operations from your code so you may want to create a wrapper to control the packages that you can call from your application.
In the end I opted for a set of triggers to capture data modifications to a change log table. There is then an application that polls this table and generates XML files for submission to a webservice running at the remote location.

Resources