Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
No doubt that, DBMS plays vital role in today's developer's life which is easy way of retrieving the data particularly when we don't require JOINS. But apart from easy factor, which is faster? Files or Databases?
Depends upon the situation. One might conclude that a filesystem is faster, under the belief that the DBMS must use a filesystem to store its data and therefore is only adding a layer of buffer. That is not strictly true as some DBMS (eg Oracle) implement and can use their own filesystem. One might conclude that a filesystem is faster, under the belief that system calls (eg fread() and fwrite()) have less overhead than a SQL call (eg SELECT *). That also is not strictly true, as the overhead from reading multiple files and joining multiple files may be less efficient than the DBMS implementation of data storage (eg btree in file).
The only way to know is to choose a scenario and benchmark it. As with any design, one must balance the tradeoffs: complexity of a DBMS vs ease of filesystem, ease of DBMS selection vs complexity of filesystem reads, etc.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I'm developing a web backend with two modules. One handles a relatively small amount of data that doesn't change often. The other handles real-time data that's constantly being dumped into the database and never gets changed or deleted. I'm not sure whether to have separate databases for each module or just one.
The data between the modules is interconnected quite a bit, so it's a lot more convenient to have it in a single database.
But anything fails, I need the first database to be available for reads as soon as possible, and the second one can wait.
Also I'm not sure how much performance impact the constantly growing large database would have on the first one.
I'd like to make dumps of the data available to public, and I don't want users downloading gigabytes that they don't need.
And if I decide to use a single one, how easy is it to separate them later? I use Postgres, btw.
Sounds like you have a website with its content being the first DB, and some kind of analytics being the second DB.
It makes sense to separate those physically (as in on different servers). Especially if one of those is required to be available as much as possible. Separating mission critical parts from something not that important is a good design. Also, smaller DB means shorter recovery times from a backup, if such need to arise.
For the data that is interconnected, if you need remote lookup from one DB into another, Foreign Data Wrappers may help.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have following scenario, others may have different. How should we decide between Redis as persistent primary database and Elastic Search.
In a micro-service, database has lots of read requests, in comparison to write request. Also my data will have only 8-10 columns or keys in terms of JSON (Simple data structure).
If my database hardly gets write request in respect to read request, why should we not use Redis as persistent Database. I went through Redis Office document and found why should we use it as persistent database [Goodbye Cache: Redis as a Primary Database]
But still not convinced fully to use it as a Primary Database
The answer would depend on your application and what it does internally. But assuming you don't need particularly complicated queries to get the data (no complex filtering, for example) and you can fit all your information in memory, I see Redis as a completely valid alternative to a traditional database.
If you want the strongest possible guarantees Redis can offer, you'd want to enable both RDB and AOF persistence options (read https://redis.io/topics/persistence).
The big advantage of a set-up like this is you can trust Redis to improve the throughput of the application, and maintain a very good level of performance over time, even with a growing dataset.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
We have a web service that needs a somewhat POSIX-compatible shared filesystem for the application servers (multiple redundant systems running in parallel behind redundant load balancers). We're currently running GlusterFS as the shared filesystem for the application servers but I'm not happy with the performance of the system. Compared to actual raw performance of the storage servers running GlusterFS, it starts to look more sensible to run DRBD and single NFS server with all the other GlusterFS servers (currently 3 servers) waiting in hot-standby role.
Our workload is highly read oriented and usually deals with small files and I'd be happy to use "eventually consistent" system as long as a client can request sync for a single file if needed (that is, client is prepared to wait until the file has been successfully stored in the backend storage). I'd even accept a system where such "sync" requires querying the state of the file via some other way than POSIX fdatasync(). File metadata such as modification times is not important, only filename and the contents.
I'm currently aware of possible candidates and the problems each one currently has:
GlusterFS: overall performance is pretty poor in practice, performance goes down while adding new servers/bricks.
Ceph: highly complex to configure/administrate, POSIX compatibility sacrifices performance a lot as far as I know.
MooseFS: partially obfuscated open source (huge dumps of internally written code published seldomly with intentionally lost patch history), documentation leaves lots to desire.
SeaweedFS: pretty simple design and supposedly high performance, future of this project is unclear because pretty much all code is written and maintained by Chris Lu - what happens if he no longer writes any code? Unclear if the "Filer" component supports no single point of failure.
I know that CAP theorem prevents ever having truly consistent and always available system. Is there any good system for distributed file system where writes must be durable, but read performance is really good and the system has no single point of failure?
I am Chris Lu working on SeaweedFS. There are plans to commercialize it. (By adding more advanced features.)
The filer does not have simple point of failure, you can have multiple filer instances. The filer store can be any key-value store. If you need no SPOF, you can use Cassandra, Redis cluster, CockroachDB, TiDB, or Etcd. Or you can add your own key-value store option, which is pretty easy.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm just a little bit confused of this concept.
I heard the words "Distributed system" a lot, but I'm not really sure my stuff is kind of "Distributed system".
Basically, we have a master server( a very big one) as the front line production server.
Then , in order to reduce the load of master server(no crush it by ton of tasks). We put all kind of jobs into different small servers.
These small server consummate with master server pull & push processed data between each other.
But once I heard "Distributed System" i really get frightened, it feels so big for me , I don't really know my job is related or not.
From your small description it sounds like you have a bonafide distributed system.
From our good friend wikipedia:
A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.[1] The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components.[1] Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.
I think you fit the description because:
1) A common goal is being done by different servers.
2) The servers are communicating with each other by passing messages.
That second one is pretty important for multiple reasons. Besides the benefits you get from having these servers communicating with each other, it also means that as an engineer you are tackling the traditional problems that people in the distributed systems field handle. It exposes you to these problems and while you might not feel like you are in the field or you might not use the same jargon, you will be presented with the same problems and solutions.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Let's use the example of a Human Resources database. The transactional database that the HR personnel use on a day-to-day basis handles all of the hiring and firing that takes place on a daily basis. There is also a Dimensional Data Warehouse that pulls from that transactional database.
assuming that latency is sufficiently low, which of the following arguments would be considered "best practice"?
1) the Transactional database should only have to keep track of how that data currently is. it shouldn't have to keep track of slowly changing data (For example, the history of which managers a specific employee has had, how his salary has evolved over time, etc.). The ETL Process should detect changes in the transitional database, and update slowly changing dimensions in the data warehouse.
2) The transactional database is more than capable of tracking it's own historical information. If something were to ever change twice in between ETL sessions, you would lose the first change forever. The main purpose of the Dimensional database is for efficient query performance in reports, so it is still doing it's job. This also allows the ETL process to be faster and simpler.
I feel like both arguments have merits, and if they are both valid arguments, I am happy to simply choose between them.
Am I missing something that isn't being taken into consideration?
Are one of the arguments flat out wrong?
I think what #marek-grzenkowicz said is correct. If the business requirements of the HR transactional/operational system state that a history of changes are required, then they belong in the transactional/operational system. Likewise, if the business requirements state that this history (or perhaps a history at a different level of granularity) are required, the warehouse would store that as well. It is possible that the histories may be stored in both systems.
I too recommend "The Data Warehouse Toolkit". I'm reading it now and it seems to have a lot of time- and field-tested design patterns for modeling your data. The 3rd edition of this book just came out a couple weeks ago.