Let's say we have thousands of block storage devices in NetApp through ONTAP, a service which offers block storage (this question intends not to be focused on NetApp nuances) that were created in Kubernetes using default database/app solutions (e.g. official MariaDB, official MongoDB, etc.), but for whatever reason now they are not available anymore in Kubernetes, and we just have dangling block storage.
So, having this sheer amount of storage without any particular organization behind it: Is there an existing software to programatically identify from which database or application is this block storage storing information?
I understand the whole requirement would be to mount the block storage programatically for all elements, and then perform the actual task of identifying which kind of database/app has created the structure of that data we are seeing. Reasonably, the most complicated part would be the "identification" process.
Therefore, the main question is: Is there any software available that could help us identify which block storage was once storing, for instance, a MariaDB database, or any other database? Otherwise, I guess we will have to develop a software specifically for this, and this would be eminently time-consuming.
Related
So I just started messing around with Google BigQuery about 10 minutes ago, and I was wondering if anyone is aware of the underlying architecture that they're using to store the data? For example, is this just the next generation of their own BigTable infrastructure?
Also, is it clear what sorts of strategies they're using for indexes, index rebuilds, etc? I'm just trying to analyze whether this is mature enough at this point where you can be 100% sure of what's going on with your data end-to-end, or is there a bit of a black box area where "things just work"?
There are no indexes... every query is a table scan. The query architecture is described here.
Your data is stored in a proprietary columnar format called ColumnIO on Colossus (a successor to GFS). Colossus replicates the data within a datacenter and your data is also replicated to other geographic regions to make sure it stays available even if a Google datacenter goes offline.
To answer your specific questions
While data may be temporarily stored in Bigtable, all data is stored long-term in Colossus (for now!).
New data added to bigquery is encrypted at rest (that is, whenever it is written out to permanent storage). It is also encrypted when sent over the network.
As mentioned, no indexes, so there are no strategies for rebuilding the index. Depending on how you add data to your table, your table may be coalesced, which means rewriting the underlying files in a more efficient manner.
Colossus underlies a massive amount of Google data across a wide range of services, ColumnIO is a standard throughout Google. I would call both of these technologies mature.
However, you should also consider it a black box. All of the details here may change as storage systems at Google mature or architectures change. However, it should always "just work" (within SLA caveats, of course)
If you're interested in more details about how BigQuery works under the covers or how to use it effectively, here is a shameless plug for our book on the subject which is due out in June.
I'm in need of a certain web API, but the ones have found have rate limits that won't be enough for me.
I can set up my own database and store the results that I get from the API, to reduce the amount of calls. This will eventually lead to a point where I, more or less, have copied the API database.
Is this okay? It might be morally questionable, but is there anything that actually prevents me from doing it?
Thank you.
It depends on the terms of service of the API vendor.
ex: google maps
10.1 Restrictions on How You May Use the Maps API(s). Except as explicitly
permitted in Section 8 (Licenses from
Google to You) or the Maps APIs
Documentation, you must not (nor may
you permit anyone else to) do any of
the following:
10.1.3 Restrictions against Data Export or Copying.
(b) No Pre-Fetching, Caching, or
Storage of Content. You must not
pre-fetch, cache, or store any
Content, except that you may store:
(i) limited amounts of Content for the
purpose of improving the performance
of your Maps API Implementation if you
do so temporarily, securely, and in a
manner that does not permit use of the
Content outside of the Service; and
(ii) any content identifier or key
that the Maps APIs Documentation
specifically permits you to store. For
example, you must not use the Content
to create an independent database of
“places.”
Ultimately it will depend on the terms and conditions set out by the API provider. You may find though that you can run in to issues where your copy is not as up-to-date as the live version - and you will need to find a way to refresh your local copy without going over the call limits. It also won't necessarily solve any issues you have with call throttling with write operations which must update the master database(s).
It occurs to me that state control in languages like C# is not well supported.
By this, I mean, it is left upto the programmer to manage the state of in-memory objects. A common use-case is that instance variables in the domain-model are copies of information residing in persistent storage (i.e. the database). Clearly this violates the single point of authority principle, and "synchronisation" has to be managed by the developer.
I envisage a system where instead of instance variables, we have simple public access/mutator methods marked with attributes that link them to the database, and where reads and writes are mediated by a framework that decides whether to hit the database. Does such a system exist?
Am I completely missing the point, or is there some truth to this idea?
If i understand correctly what you want: Any OR-Mapper with Lazy Loading is working this way. For example i use Genome and there every entity is a pure proxy and you have quite much influence to tell the OR-Mapper how to cache the fields.
Actually there's the concept of data prevalence (as implemented by prevayler in Java) where the in-memory objects are the single point of authority (SPA) for the data.
Also, some object databases (as db4o) blur lines a bit between the object representation and the "store" representation.
On the other hand, by bringing the SPA for the data inside the application, you need to handle transactions and/or data persistence by yourself. There is some work done on transactional memory systems such as JVSTM (currently in use by the information system of my old college) but it's not in widespread use.
On the other hand, if the data lives in a database, you can just commit the data when everything is good (or use the support for transactions built in the database) and be sure that data isn't corrupted or lost. You trade in the SPA principle for better data reliability and transactions (and other advantages of using a separate data store)
Need to be able to set server(s) that replicate all information, as a master data store that has all the data.
Also need servers that specifically store/replicate certain data, available in local LANs, so that when the internet connection goes down, they can still access their local data. Under normal circumstances, the clients will access most of their data from the local LAN, and may use others when the local LAN server goes down.
This is wanted alongside the benefits of a distributed data store, such as failure resistance and speed.
Which Distributed Key-Value Data Store or other data storage method would be most suited for this?
Try out CouchDB. Your use case reads like it was build for it. Point taken, CouchDB is much more than a key/value store, but on the other hand, not less suitable for it.
Add replication and as an added bonus fault tolerance, conflict detection (and resolution) and an easy API (HTTP).
Let me know if you have any other questions.
Of course you must remember that replication is something completely different from backup, because one system's programmatic failure in handling the data can quickly replicate to other nodes resulting in total mayhem.
Maybe using a Hadoop File System or OpenAFS would be a good solution here?
I haven't used any of those systems in real-life scenarios, only had interest in them during my research on peer-to-peer and distributed storage solutions, but I think they're worth a try.
Have you checked out the new Microsoft's Velocity? http://msdn.microsoft.com/en-us/data/cc655792.aspx. Unlike many other cloud services, you can run the setup (for Velocity) on your premises.
Question:
Should I write my application to directly access a database Image Repository or write a middleware piece to handle document requests.
Background:
I have a custom Document Imaging and Workflow application that currently stores about 15 million documents/document images (90%+ single page, group 4 tiffs, the rest PDF, Word and Excel documents). The image repository is a commercial, 3rd party application that is very expensive and frankly has too much overhead. I just need a system to store and retrieve document images.
I'm considering moving the imaging directly into a SQL Server 2005 database. The indexing information is very limited - basically 2 index fields. It's a life insurance policy administration system so I index images with a policy number and a system wide unique id number. There are other index values, but they're stored and maintained separately from the image data. Those index values give me the ability to look-up the unique id value for individual image retrieval.
The database server is a dual-quad core windows 2003 box with SAN drives hosting the DB files. The current image repository size is about 650GB. I haven't done any testing to see how large the converted database will be. I'm not really asking about the database design - I'm working with our DBAs on that aspect. If that changes, I'll be back :-)
The current system to be replaced is obviously a middleware application, but it's a very heavyweight system spread across 3 windows servers. If I go this route, it would be a single server system.
My primary concerns are scalabity and performace - heavily weighted toward performance. I have about 100 users, and usage growth will probably be slow for the next few years.
Most users are primarily read users - they don't add images to the system very often. We have a department that handles scanning and otherwise adding images to the repository. We also have a few other applications that receive documents (via ftp) and they insert them into the repository automatically as they are received, either will full index information or as "batches" that a user reviews and indexes.
Most (90%+) of the documents/images are very small, < 100K, probably < 50K, so I believe that storage of the images in the database file will be the most efficient rather than getting SQL 2008 and using a filestream.
Oftentimes scalability and performance are ultimately married to each other in the sense that six months from now management comes back and says "Function Y in Application X is running unacceptably slow, how do we speed it up?" And all too the often the answer is to upgrade the back end solution. And when it comes to upgrading back ends, its almost always going to less expensive to scale out than to scale up in terms of hardware.
So, long story short, I would recommend building a middleware app that specifically handles incoming requests from the user app and then routes them to the appropriate destination. This will sufficiently abstract your front-end user app from the back end storage solution so that when scalability does become an issue only the middleware app will need to be updated.
This is straightforward. Write the application to an interface, use some kind of factory mechanism to supply that interface, and implement that interface however you want.
Once you're happy with your interface, then the application is (mostly) isolated from the implementation, whether it's talking straight to a DB or to some other component.
Thinking ahead a bit on your interface design but doing bone stupid, "it's simple, it works here, it works now" implementations offers a good balance of future proofing the system while not necessarily over engineering it.
It's easy to argue you don't even need an interface at this juncture, rather just a simple class that you instantiate. But if your contract is well defined (i.e. the interface or class signature), that is what protects you from change (such as redoing the back end implementation). You can always replace the class with an interface later if you find it necessary.
As far as scalability, test it. Then you know not only if you may need to scale, but perhaps when as well. "Works great for 100 users, problematic for 200, if we hit 150 we might want to consider taking another look at the back end, but it's good for now."
That's due diligence and a responsible design tactic, IMHO.
I agree with gabriel1836. However, an added benefit would be that you could for a time run a hybrid system for a time since you aren't going to convert 14 millions documents from your proprietary system to you home grown system overnight.
Also, I would strongly encourage you to store the documents outside of a database. Store them on a file system (local, SAN, NAS it doesn't matter) and store pointers to the documents in the database.
I'd love to know what document management system you are using now.
Also, don't underestimate the effort of replacing the capture (scanning and importing) provided by the proprietary system.