How to store information? Database vs Data-Structure vs Log files [closed] - database

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Recently I came across a scenario in a question:
There are n websites with n pages each and n users visiting the sites....each visit of the user has to be saved and the pages he/she has visited ( not mentioned whether in database or log files, so it's up to the developer )
I decided to go on with it and do something in datastructures but when I discussed this thing with a friend of mine, he said, we can save it in database and this logically sounded correct too.
So we have 3 ways of storing anything in general...log files data-structure database
Now I am really confused, when should one go with data-structures, databases or simply log files, not only for this particular scenario but in a generic way?
What's the real difference?
I understand that this question is primarily opinion based but couldn't get a concrete result while browsing!

Log files are often / usually output-only - these files will rarely, if ever, get read, possibly only read manually. Some types of files may have random access, allowing you to fairly efficiently find a given record by a single index (through binary search), but you can't (easily) have multiple indices on the data in a single file, which is a trivial task for a database. If you just want to log something for manual processing later, a log file can work fine (even if a database can work too).
Databases is the standard in the industry, in that they provide you with persistence, efficient reading and writing, a standard interface and redundancy (but of course they need to be set up correctly).
A pure data structure solution typically doesn't consider persistent storage, as in making sure your data is kept when the program stops running for some reason. If you do want to write to and read from persistent storage, this will often come with a fair bit of complexity to do efficiently and regularly. And multiple / complex indices is a bit of a hassle to cater for. That's not to say data structures can't be used with persistent storage - databases are built using data structures and some data structures are specifically made for disk reads and writes. But you don't want to be figuring this out on a low level - it's best to just let a database take care of it if you need persistence.
You could also combine data structures and databases, using the database as persistent storage and use the data structure to cache the results so you only need to do (slower) writes to the database and you can do (faster) reads from the data structure. This is not uncommon in large systems with external databases. Although anything more complex than a standard map data structure is probably overcomplicating your cache and make indicate a bigger problem with your design.
What you have there sounds like an interview question, for which they may be expecting a data structure solution and simply saying "use a database" may be frowned upon. However, if it's a system design question, you'd almost certainly need to include some sort of a database in your design instead of concerning yourself with data structures.

Related

Should databases be separated based on size and load? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 months ago.
Improve this question
I'm developing a web backend with two modules. One handles a relatively small amount of data that doesn't change often. The other handles real-time data that's constantly being dumped into the database and never gets changed or deleted. I'm not sure whether to have separate databases for each module or just one.
The data between the modules is interconnected quite a bit, so it's a lot more convenient to have it in a single database.
But anything fails, I need the first database to be available for reads as soon as possible, and the second one can wait.
Also I'm not sure how much performance impact the constantly growing large database would have on the first one.
I'd like to make dumps of the data available to public, and I don't want users downloading gigabytes that they don't need.
And if I decide to use a single one, how easy is it to separate them later? I use Postgres, btw.
Sounds like you have a website with its content being the first DB, and some kind of analytics being the second DB.
It makes sense to separate those physically (as in on different servers). Especially if one of those is required to be available as much as possible. Separating mission critical parts from something not that important is a good design. Also, smaller DB means shorter recovery times from a backup, if such need to arise.
For the data that is interconnected, if you need remote lookup from one DB into another, Foreign Data Wrappers may help.

How do you handle collection and storage of new data in an existing system? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am new to system design and have been asked to solve a problem.
Given a car rental service website, I need to work on a new feature.
The company has come up with some more data that they would like to capture and analyze along with the data that they already have.
This new data can be something like time and cost to assemble a car.
I need to understand the following:
1: How should I approach the problem, from API design perspective?
2: Is changing the schema of your tables going to do any good, if that is an option?
3: Which databases can be used?
The values once stored can be changed. For example, the time to assemble can reduce or increase, hence the users should be able to update the values.
To answer you question let's divide it in two parts, ideal architecture and Q&A's
Architecture:
A typical system would consist of many technologies working together to solve a practical problem. Problems can be solved in many ways and may have more than one solution. We are not talking about efficiency and effectively of any architecture here as it's whole new subject to explore. But it's always wise to choose what's best for your use case.
Since you already have existing software built, it's always helpful to follow it's existing design pattern which will help you understand existing code in detail and allow you to create logical blocks which will fit nicely and actually help in integrating functionality instead of working against it.
Since this clears the pre planning phase let's discuss on how this affects what solution is ideal for your use case in my opinion.
Q&A's
1. How should I approach the problem, from API design perspective?
There will be lots of assumption, anything but less system consisting of api should have basic functionality of authentication and authorization when ever needed. Apart from that, try to stick to full REST specification, which will allow API consumers to follow standard paths and integration would have minimal impact when deciding what endpoints would look like and what they expect from consumer.
Regardless, not all systems are ideal for such use case and thus it's in up to system designer how much of system is compatible with standard practices.
Name convention matters when newer version api will have api/v2
paths and old one having api/v1, which is good practice for routing
new functionality. Which allows system to expand seamlessly.
2: Is changing the schema of your tables going to do any good, if that is an option?
In short term when you do not have much data, it's relatively easy to migrate data. When it becomes huge, it's much more painful and resource intensive.
Good practices would allow you to prevent such scenarios where you might not need migrate data.
Database normalization becomes so crucial in such cases when potential data structure would grow rapidly and requires attention.
Regardless of using any sql or nosql solution, a good data structure will always be helpful in both data management and programming implementation.
In my opinion, getting data structures near perfect is always a good
idea, because it will reduce future costs of migration and
frustration it brings. Still some use cases requires addition on
columns and its okay to add them as long as it does not have much
impact on existing code. Otherwise it can always be decoupled in
separate table for additional fields.
3: Which databases can be used?
Typically any rdbms is enough for this kind of tasks. You might be surprised when you see case studies of large data creators still using mysql in clusters.
So answer is, as long as you have normal scenario, go ahead and pick any database of your choice, until you hit its single instance scalability limits. And those limits are pretty huge for small to mod scale apps.
How should I approach the problem, from API design perspective?
Design a good data model which is appropriate for the data it needs to store. The API design will follow from the data model.
Is changing the schema of your tables going to do any good, if that is an option?
Does the new data belong in the existing tables? Then maybe you should store it there. Except: can you add new columns without breaking any existing applications? Maybe you can but the regression testing you'll need to undertake to prove it may be ruinous for your timelines. Separate tables are probably the safer option.
Which databases can be used?
You're rather vague about the nature of the data you're working with, but it seems structured (numbers?). So that suggests a SQL with strong datatypes would be the best fit. Beyond that, use whatever data platform is currently being used. Any perceived benefits from a different product will be swept away by the complexities and hassle of deploying it.
Last word. Talk this over with your boss (or whoever set you this task). Don't rely on the opinions of some random stranger on the interwebs.

Strategies to building a database of 30m images [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Summary
I am facing the task of building a searchable database of about 30 million images (of different sizes) associated with their metadata. I have no real experience with databases so far.
Requirements
There will be only a few users, the database will be almost read-only, (if things get written then by a controlled automatic process), downtime for maintenance should be no big issue. We will probably perform more or less complex queries on the metadata.
My Thoughts
My current idea is to save the images in a folder structure and build a relational database on the side that contains the metadata as well as links to the images themselves. I have read about document based databases. I am sure they are reliable, but probably the images would only be accessible through a database query, is that true? In that case I am worried that future users of the data might be faced with the problem of learning how to query the database before actually getting things done.
Question
What database could/should I use?
Storing big fields that are not used in queries outside the "lookup table" is recommended for certain database systems, so it does not seem unusual to store the 30m images in the file system.
As to "which database", that depends on the frameworks you intend to work with, how complicated your queries usually are, and what resources you have available.
I had some complicated queries run for minutes on MySQL that were done in seconds on PostgreSQL and vice versa. Didn't do the tests with SQL Server, which is the third RDBMS that I have readily available.
One thing I can tell you: Whatever you can do in the DB, do it in the DB. You won't even nearly get the same performance if you pull all the data from the database and then do the matching in the framework code.
A second thing I can tell you: Indexes, indexes, indexes!
It doesn't sound like the data is very relational so a non-relational DBMS like MongoDB might be the way to go. With any DBMS you will have to use queries to get information from it. However, if your worried about future users, you could put a software layer between the user and DB that makes querying easier.
Storing images in the filesystem and metadata in the DB is a much better idea than storing large Blobs in the DB (IMHO). I would also note that the filesystem performance will be better if you have many folders and subfolders rather than 30M images in one big folder (citation needed)

What type of database is the best for storing array or object like data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm just curious what the best method would be if I'm trying to have a bot running on my Node server that I could play Blackjack against.
But for multiple connected clients via sockets, each connected socket will have their own bot to play against but I need some way to keep the bots available cards for each time they send a POST request with whatever card they pull out of their deck.
I figured MySQL would get messy really quickly because I cannot just store an array or an object and splice out each card as it gets used, but I'm not really familiar with which database would specialize in this kind of use.
If I didn't make any sense, basically:
I need to store cards for the bot (but for each connected users session) not just 1 deck for 1 person but multiple decks for multiple people.
I'm not asking you to write any code for me, just point me in the direction of which database would be ideal for this kind of setup.
I was thinking maybe Redis or MongoDB?
Redis would probably be fastest, especially if you don't need a durability guarantee - most of the game can be played out using Redis' in-memory datastore, which is probably gonna be faster than writing to any disk in the world. Perhaps periodically, you can write the "entire game" to disk. If the project is not meant for commercial purposes, i.e. computer errors aren't gonna cause players to lose money, this is definitely an enticing choice.
MongoDB is popular, especially easy to get started with Node, and is definitely faster than most relational SQL solutions, but transactions may be a problem. For a prototype or proof-of-concept projec, it should do fine. But you may also want to look at other "NoSQL" solutions as well.
Cassandra is another popular document-oriented DB, and many people prefer it over MongoDB, for various reasons - most notably, for better scalability.
The choice really highly depends on how you model your data. In your current scenario, I know you want to simply store an object/array, which sounds like you are basically going the way of the aggregated document (MongoDB). You are, in effect, "denormalizing" the entire DB into an aggregate, and performing reads/writes on the entire object every single time in order to achieve consistency. This is a prevalent technique in MongoDB and other document-oriented DBs. But do note that this solution only works because you are not operating across partitions. Think about what happens when you have multiple servers serving the application writing to a separate DB cluster.
You've really got to analyze and decide for yourself what is the best way to model data, if scalability is a concern. Would it be a better model to NOT continually write to this array? For example, generate the sequence of cards once, store it in DB as a Game, and only do reads on it to draw cards? Then, each player's move can be stored as a very succinct data structure Hit referencing a card from the Game. Although the data becomes very relational (back to old school SQL), but the writes are much smaller, and your server never gets into a lock state waiting for players to release the Game object. It may or may not work for your use case, but think about how to model the data for maximum reads and minimum independent writes.
Personally (IMO), if this project is for fun, I'd go with Redis as an in-memory cache layer where most reads/writes happens, and write the game logs into Cassandra. But if this is serious business and I need some real consistency guarantees, I'd probably go back to relational DBs, with a Redis cache layer to speed up reads.
Because there is no one correct answer, the only advice anyone can give is to weigh your application's persistence needs against the strengths/weaknesses of each DB solution, and do a hell lot of research before making an important decision like "which technology to use for persistence". For example, there may be long-term problems with MongoDB that you overlooked - if you'd just Google "MongoDB problems" or "MongoDB sucks". Hell, there may even be long-term problems with all current NoSQL offerings with regards to transactions or consistency.

Log file not stored [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I need to do research about log files not stored in the database. I do not know much about database systems so i need someone to give at least some ideas about it. What i was told is that some of the log files was not written in a bank's database.Log files are coming from various sources like atms,website vs. For example, the reason could be high rate of data flow causing some data to be left out.
The question is what are the reasons behind it and what could be the solutions to them?
I would really appreciate if you could share some articles about it.
Sorry if i could not explain it well. Thanks in advance
Edit:what i meant was not there is a system not writing some of log files to database intentionally. What i tried to mean is that some of the log files are not written into database and the reason is not known and my intention is to identify the possible reasons and solutions to them.the database belongs to a bank and as you can imagine, lots of data is flowing to database per second
Well, the questions is not very clear, so let me rephrase it:
What are the reasons why application logs are not stored in a database
It depends of the context, and there are different reasons:
First question, is why you might store logs in database? Usually you do it because they contains relevant data to you that you want to manipulate.
So why not store always these datas:
you are not interested by the log, except when something goes wrong, but then it's more debugging than storing log.
you don't want to mix business data (users, transaction, etc...) with not so important / relevant data
the amount of log is too important for your current system and putting them in a database might crash it completly
you might want to use another system to dig into the log, with a different typoe of storage (haddop, big data, nosql )
when you do database backup, you usually backup all the database. Logs are not 'as important' as other critical data, are bigger, and then would take too much place
there is no need to always put logs in database. Using plain text and some other tools (web server log for instance) is usually more than enough.
So that's for these reason that logs are in general not stored in the same database than the application.

Resources