For an upcoming project we are looking for a decentralized data storing solution. Unfortunately IPFS is not suitable for our needs, since one has to upload/download entire files, even when doing minor changes. Thus we are in need for some kind of "decentralized RDBMS", where it is possible to secure updates (for example to tables, rows, columns, etc.) via Blockchain/DLT.
Is something like this currently available?
Related
In many sub-system designs for messaging applications (twitter, facebook e.t.c) I notice duplication of where user message history is stored. On other hand they use tokenizing indexer like ElasticSeach or Solr. It's good for search. On other hand still use some sort of DB for history. Why to duplicate? Why the same instance of ES/Solr/EarlyBird can not be used for history? It's in fact able to.
The usual problem is the following - you want to search and also ideally you want to try index data in a different manner (e.g. wipe index and try new awesome analyzer, that you forgot to include initially). Separating data source and index from each other makes system less coupled. You're not afraid, that you will lose data in the Elasticsearch/Solr.
I am usually strongly against calling Elasticsearch/Solr a database. Since in fact, it's not. For example none of them have support for transactions, which makes your life harder, if you want to update multiple documents following standard relational logic.
Last, but not least - one of the hardest operation in Elasticsearch/Solr is to retrieve stored values, since it's not much optimised to do so, especially if you want to return 10k documents at once. In this case separate datasource would also help, since you will be able to return only matched document ids from Elasticsearch/Solr and later retrieve needed content from datasource and return it to the user.
Summary is just simple - Elasticsearch/Solr should be more think of as a search engines, not data storage.
True that ES is NOT a database per se and will never be. But no one says you cannot use it as such, and many people actually do. It really depends on your specific use case(s), and in the end it's all a question of the trade-offs you are ready to make to support your specific needs. As with pretty much any technology in general, there is no one-size-fits-all approach and with ES (and the like) it's no different.
A primary source of truth might not necessarily be a relational DBMS and they are not necessarily "duplicating" the data in the sense that you meant, it can be anything that has a copy of your data and allows you to rebuild your ES indexes in case something goes wrong. I've seen many many different "sources of truth". It could simply be:
your raw flat files containing your historical logs or business data
Kafka topics that you can replay anytime easily
a snapshot that you take from ES on a regular basis
a relational DB
you name it...
The point is that if something goes wrong for any reason (and that happens), you want to be able to recreate your ES indexes, be it from a real DB, from backups or from raw data. You should see that as a safety net. Even if all you have is a MySQL DB, you usually have a backup of it, so you're already "duplicating" the data in some way.
One thing that you need to think of, though, when architecting your system, is that you might not necessarily need to have the entirety of your data in ES, since ES is a search and analytics engine, you should only store in there what is necessary to support your search and analytics needs and be able to recreate that information anytime. In the end, ES is just a subsystem of your whole architecture, just like your DB, your messaging queue or your web server.
Also worth reading: Using ElasticSeach as primary source for part of my DB
Can anyone suggest a database solution for storing large documents which will have multiple branched revisions? Partial edits of content should be possible without having to update the entire document.
I was looking at XML databases and wondering about the suitability of them, or maybe even using a DVCS (like Mercurial).
It should preferably have Python bindings.
Try Fossil -- it has a good delta encoding algorithm, and keeps all versions. It's backed by a single SQLite database, and has both a web based and a command line UI.
This depends on your storage behavior and use case. If you plan to store a massive number of "document revisions" and keep historical versions, and can comply with a write-once-read-many pattern, you should look into something like Hadoop HDFS. This requires a lot of (cheap) infrastructure to run your cluster, but you will be able to keep adding revisions/data over time and will be able to quickly look it up using a MapReduce algorithm.
We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?
This is a design doubt am facing, I have a collection of 1500 images which are to be displayed on an asp.net page, the images to be displayed differ from one page to another, the count of these images will increase in the time to come,
a.) is it a good idea to have the images on the database, but the round trip time to fetch the images from the database might be high.
b.) is it good to have all the images on a directory, and have a virtual file system over it, and the application will access the images from the directory
Do we have in particular any design strategy in a traditional database for fetching images with the least round trip time, does any solution other than usage of a traditional database exists?
Edit 1:
Each image is replaced by its new entry for every 12 hours, so having them on the database might not be a good idea as far I can think of, but how better will it be to use a data-store and index these images?
Edit 2:
Yes, we are planning to run the application on a cluster. And if we try to use a datastore (if it is a good option to go with) then is it compatible with C# & ASP.NET?
ps: I use SQL Server to store these images.
All of the previous comments are really good... In the absence of very specific requirements we have to make broad generalizations to illustrate your options. Here are a few examples.
If raw speed is what you need, then flat files are a clear winner. Whether your using Apache or IIS, they're both optimized to serve static file-based content very fast. High performance sites all know this, and will store much of their content with dynamic handling in mind but will then "publish" select pieces of their dynamic content; in static versions; to their web farm on a periodic or event-driven basis. Although it requires a bit of orchestration, it can be done cheaply and can really reduce the load of your database server, backend network, etc. As a simple example, publish to a folder with the root of that structure being dynamically assessed. when you're ready to publish updates, write a new folder, and then change the root path. No down time, cheap, and easy. On a related note, pulling all this information from a backend store will require you to load these things into memory. This will ultimately translate to more time in Garbage Collection and consequently will mean your application is slower, even if you're using multi-processor/core gardening.
If you need fine-grained control over how images are organized/exposed, then folders may not be the most appropriate. If for example, you need to organize an individual users images and you need to track a lot of meta data around the images then a database may be a good fit. With that said, your database team will probably hate you for it, because this presents a number of challenges from a database management perspective. Additionally, if you're using an ORM you may have some trouble making this work and may find your memory footprint grows to unacceptable levels due to hidden proxy objects, second-level caching, etc. This can all be mitigated so just watch out and make sure you profile your application. With that said, a structured store (like a DB) is more ideal for this use case.
Considering security... Depending on what these images represent, flat files inevitably lead to concerns about canonicalization attacks, brute-force enumeration of browsable folder structures, replaying cookies, urls, viewstate, etc. If you're using a custom Role-based or Claims-based security model you may find using flat files becomes somewhat painful since you'll have to map filesystem security constraints to logical/contextual security constraints. Issues like these often lead me to favoring a structured store.
The aformentioned cache idea is a good one and can help create some middle-ground with respect to how often you actually hit your database, although it will not help with concerns related to memory consumption, GC, etc... You could employ built-in caching mechanisms although a Cache/Grid that supports backing stores would be much better if you can afford it (Ex. NCache, ScaleOut, etc.). These provide nice scalability/redundency, and can also be used to offload storage of session state, viewstate, and a lot more.
Hope this helps.
You basically have 2 options
1) Store the binary in the database. VARBINARY(MAX) field will be a good choice of datatype.
2) Store the path to the image stored on disk in the database. NVARCHAR(MAX) will be a good choice for datatype.
There are of course pro's and con's of both solutions. Without knowing more about your requirements Its hard to advise which is the best way.
I prefer not to store images in the database - Instead just store a link(path/filename/id etc) to the correct image.
Then if you implement a HttpHandler to serve up the images, you can store them in whatever location you like. Heres a very basic implementation:
public class myPhototHandler: IHttpHandler
{
public bool IsReusable {
get { return true; }
}
public void ProcessRequest(System.Web.HttpContext context)
{
if (Context.User.Identity.IsAuthenticated) {
var filename = context.Request.QueryString("f") ?? String.Empty;
string completePath = context.Server.MapPath(string.Format("~/App_Data/Photos/{0}", filename));
context.Response.ContentType = "image/jpeg";
context.Response.WriteFile(completePath);
}
}
}
For a great resource on setting up a handler check out this blog post and the other related posts.
I wouldn't overcomplicate your solution. The downside to storing images in a database is the database bloat and storage requirements for backups, especially if the images are only good for 12 hrs. If in doubt, keep it simple, so when requirements change, you haven't invested that much time anyways.
This is how I did it on my site.
Store the images in a folder.
If you want control over the filename the user sees, or employ conditional logic, use an HttpHandler to serve up the image, otherwise just use its full path and filename in an img tag.
If you are talking high-volume mega site, perhaps consider using a content delivery network.
Have you considered caching your images to mitigate the round-trip time to SQL server? Caching might be appropriate at the browser (via HTTP Headers) and/or the HTTP handler serving the image (via System.Web.Caching).
Storing images in SQL server can be convenient because you don't have to worry about maintaining pointers to the file system. However, the size of your database will obviously be much larger which can make backups and maintenance more complex. You might consider using a different file group within your database for your image tables, or a separate database altogether, so that you can maintain row data separate from image data.
Using SQL server also means you'll have easy options for concurrency control, partitioning and replication, should they be appropriate to your application.
I'm writing a CAD (Computer-Aided Design) application. I'll need to ship a library of 3d objects with this product. These are simple objects made up of nothing more than 3d coordinates and there are going to be no more than about 300 of them.
I'm considering using a relational database for this purpose. But given my simple needs, I don't want any thing complicated. Till now, I'm leaning towards SQLite. It's small, runs within the client process and is claimed to be fast. Besides I'm a poor guy and it's free.
But before I commit myself to SQLite, I just wish to ask your opinion whether it is a good choice given my requirements. Also is there any equivalent alternative that I should try as well before making a decision?
Edit:
I failed to mention earlier that the above-said CAD objects that I'll ship are not going to be immutable. I expect the user to edit them (change dimensions, colors etc.) and save back to the library. I also expect users to add their own newly-created objects. Kindly consider this in your answers.
(Thanks for the answers so far.)
The real thing to consider is what your program does with the data. Relational databases are designed to handle complex relationships between sets of data. However, they're not designed to perform complex calculations.
Also, the amount of data and relative simplicity of it suggests to me that you could simply use a flat file to store the coordinates and read them into memory when needed. This way you can design your data structures to more closely reflect how you're going to be using this data, rather than how you're going to store it.
Many languages provide a mechanism to write data structures to a file and read them back in again called serialization. Python's pickle is one such library, and I'm sure you can find one for whatever language you use. Basically, just design your classes or data structures as dictated by how they're used by your program and use one of these serialization libraries to populate the instances of that class or data structure.
edit: The requirement that the structures be mutable doesn't really affect much with regard to my answer - I still think that serialization and deserialization is the best solution to this problem. The fact that users need to be able to modify and save the structures necessitates a bit of planning to ensure that the files are updated completely and correctly, but ultimately I think you'll end up spending less time and effort with this approach than trying to marshall SQLite or another embedded database into doing this job for you.
The only case in which a database would be better is if you have a system where multiple users are interacting with and updating a central data repository, and for a case like that you'd be looking at a database server like MySQL, PostgreSQL, or SQL Server for both speed and concurrency.
You also commented that you're going to be using C# as your language. .NET has support for serialization built in so you should be good to go.
I suggest you to consider using H2, it's really lightweight and fast.
When you say you'll have a library of 300 3D objects, I'll assume you mean objects for your code, not models that users will create.
I've read that object databases are well suited to help with CAD problems, because they're perfect for chasing down long reference chains that are characteristic of complex models. Perhaps something like db4o would be useful in your context.
How many objects are you shipping? Can you define each of these Objects and their coordinates in an xml file? So basically use a distinct xml file for each object? You can place these xml files in a directory. This can be a simple structure.
I would not use a SQL database. You can easy describe every 3D object with an XML file. Pack this files in a directory and pack (zip) all. If you need easy access to the meta data of the objects, you can generate an index file (only with name or description) so not all objects must be parsed and loaded to memory (nice if you have something like a library manager)
There are quick and easy SAX parsers available and you can easy write a XML writer (or found some free code you can use for this).
Many similar applications using XML today. Its easy to parse/write, human readable and needs not much space if zipped.
I have used Sqlite, its easy to use and easy to integrate with own objects. But I would prefer a SQL database like Sqlite more for applications where you need some good searching tools for a huge amount of data records.
For the specific requirement i.e. to provide a library of objects shipped with the application a database system is probably not the right answer.
First thing that springs to mind is that you probably want the file to be updatable i.e. you need to be able to drop and updated file into the application without changing the rest of the application.
Second thing is that the data you're shipping is immutable - for this purpose therefore you don't need the capabilities of a relational db, just to be able to access a particular model with adequate efficiency.
For simplicity (sort of) an XML file would do nicely as you've got good structure. Using that as a basis you can then choose to compress it, encrypt it, embed it as a resource in an assembly (if one were playing in .NET) etc, etc.
Obviously if SQLite stores its data in a single file per database and if you have other reasons to need the capabilities of a db in you storage system then yes, but I'd want to think about the utility of the db to the app as a whole first.
SQL Server CE is free, has a small footprint (no service running), and is SQL Server compatible