Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am preparing for the system design interview, and since I have little experience with this topic, I bought the "Grokking the system design interview" course from educative.io, which was recommended by several websites.
However I read it, I think I did not manage to understand several things, so if someone could answer my questions, that would be helpful.
Since I have no experience with nosql, I find it difficult to chose the proper db system. Several times the course just do not give any reasoning why it chose one db over another one. For example in chapter "Designing Youtube or Netflix" the editors chose mysql for db role with no explanation. In the same chapter we have the following non-functional requirements:
"The system should be highly available. Consistency can take a hit (in
the interest of availability); if a user doesn’t see a video for a
while, it should be fine."
Following the above hint and taking into account the size of the system and applying the material in the "CAP theorem" chapter for me it seems or Cassandra and CouchDB would be a better choise. What do I miss here?
Same question goes for "Designing Facebook’s Newsfeed"
Is CAP theorem still applicable?
What I mean is: HBase is according to the chapter "CAP theorem" good at consistency and partition tolerance, but according to the HBase documentation, it also supports High Availibility since version 2.X. So it seems to me that it is a one fits all / universal solution for db storage which goes against CAP theorem, unless they sacrificed something for HA. What do I miss here?
The numbers are kind of inconsistent around the course about how much RAM/storage/bandwidth can a computer handle, I guess they are outdated. What are the current numbers for a, regular computers, b, modern servers?
Almost every chapter has a part called "Capacity Estimation and Constraints", but what is calculated here changes from chapter to chapter. Sometimes only storage is calculated, often bandwidth too, sometimes QPS is added, sometimes there are task specific metrics. How do I know what should I calculate for a specific task?
Thanks in advance!
Each database is different and fulfills different requirements. I recommend you read dynamo-paper, and familiarize yourself with the rest of the terminology used in it (two-phase locking, leader/follower, multi-leader, async/sync replication, quorums), and know what guarantees the different databases provide. Now to the questions:
MySQL can be configured to prioritize Availability at the cost of Consistency with its asynchronous replication model (the leader doesn't wait for acknowledgement from its followers before committing a write; if a leader crashes before the data gets propagated to the followers, the data is lost), so it can be one of the suitable solutions here.
From the documentation of HBase, HBase guarantees strong consistency, even at the cost of availability.
The promise of high availability is for reads, not for writes i.e. for reading stale data while the rest of the system recovers from failure and can accept additional writes.
because of this single homing of the reads to a single location, if the server becomes unavailable, the regions of the table that were hosted in the region server become unavailable for some time.
Since all writes still have to go through the primary region, the writes are not highly-available (meaning they might block for some time if the region becomes unavailable).
The numbers used are estimates by the candidate i.e. you decide what are the specs of a single hypothetical server, and how many servers you would need in order to scale and accommodate the storage/throughput requirement.
You don't know in advance (although you can make a guess based on the requirements e.g. if it's a data storage system, a streaming service etc., I still wouldn't recommend it). Instead, you should ask the interviewer what area they are interested in, and you make estimates for it. The interview, especially the system design part, is a discussion, don't follow a template to the letter. You recognize the different areas you can tackle about the system, and approach them based on the interviewer's interest.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm currently receiving 2000 prices per second from a stock exchange and need to save those in an appropriate database. My current choice is PostgresQL which is way too slow. I need to save those prices (ticks) in an aggregated form like OHLC. So if I want to save D1 data for instance, I need to first get the previous D1 record for the stock from the database, check if the high or low price has changed and set a new close price and then save it to the database again. This is taking forever and is not possible with Postgres. I don't want to save the OHLC data, I prefer querying (aggregating) those in real-time.
So my requirements are:
persistance
fast writes (currently 2k per second, up to 10k)
queries, e.g. aggregating OHLC data in real-time (50-100 per second)
adoptable to any modern programming language without writing raw queries (SDK for Python or JS for that database)
deployable on AWS or GCP without hassle
I was thinking about Apache Cassandra. I'm not familiar with Cassandra, are powerful queries like OHLC one possible? Are there any alternatives to Cassandra?
Thanks in advance!
Given what I've understood from your question, I believe Cassandra should easily fit your use-case.
Regarding your requirements:
persistence : Cassandra will not only persist your data but also cover redundancy with minimal configuration;
fast writes : this is what Cassandra is most optimized for and while the exact throughput depends on a lot of factors, in general Cassandra will manage writes measured in the thousands/sec/core; Also, the eventual number o writes is not really relevant as Cassandra can scale linearly with no real penalty so 5k,10k, 100k or more are all doable;
adaptability : Cassandra has official drivers for the most common languages(Python, C family, NodeJs, Java, Ruby, PHP, Scala) as well as community developed ones for more languages (list of divers);
deployable : It's very easy to deploy in the cloud. You can chose to deploy it manually on independent instances or maybe use a managed Cassandra cluster (AWS has one, it's called 'AWS Keyspaces', Datastax(the company driving most of the development behind Cassandra) has one called 'Astra' and there are even more possible solutions. Given that Cassandra is one of the major players when it comes to big-data storage finding a place for you DB in the cloud should be easy.
I have only mentioned 4 of the 5 requirements. That is because when talking about reading, things get more complex and a larger discussion is needed.
500-100 reads/s given the 2k+ writes per second seem to be in line with the general idea of Cassandra being optimized for write intensive tasks. In Cassandra the way you will model your tables will dictate how well things can work. For a task like you have described my first thoughts are:
You bucket each stock per day => you get a partition with around 30k rows (1 update/s for 8 trading hours) and a size of under 0.2MB (30k * 4B). This would be well within the recommended values and clearly under the worst case scenario ones;
when you need the aggregated data you have 2 options:
2a. You read the partition as is and aggregate it application side (what I would recommend);
2b. You implement an "User-Defined Aggregate" function on your database that will do the work (docs). This should be doable although I won't guarantee it. Apart from being harder to implement, the problem is that putting this kind of extra workload on the DB might not be want you want given your apparent use-case. Let me explain: I'd expect your reading load to be most active during certain times, (before, during and after trading hours) with times when the load is lighter. Depending on your architecture, you could have multiple application instances up during peak times, and then scale them back during off-peak in order to lower costs. While applications can be easily scaled up and down on cloud providers like AWS and GC. Cassanadra cannot be scaled up and down like this (5 nodes in the morning, 3 in the night and so on)(well it could but it's not designed to and would be a terrible decision). So moving as much of the non-constant workload to the application seems the best idea;
(Optional) have a worker that at the end of the day/trading day will aggregate the values for each stock and save them to another table so that when looking at historic data it will be easier. This data could even be bucketed by week, month or even year depending on how much space the aggregated data takes.
You could also add Spark and Kafka in front of Casandra for a more powerful approach to the real-time aggregation but we should't deviate that much from the question at hand.
Cassandra is very powerful with the right modeling and the right architecture. At first glance what you need seems to be a good fit for Cassandra however as powerful as it can be, as bad as it can get if you use it in ways it wasn't designed to. I hope this answer puts you on a path into making the right decision.
Cheers.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am new to system design and have been asked to solve a problem.
Given a car rental service website, I need to work on a new feature.
The company has come up with some more data that they would like to capture and analyze along with the data that they already have.
This new data can be something like time and cost to assemble a car.
I need to understand the following:
1: How should I approach the problem, from API design perspective?
2: Is changing the schema of your tables going to do any good, if that is an option?
3: Which databases can be used?
The values once stored can be changed. For example, the time to assemble can reduce or increase, hence the users should be able to update the values.
To answer you question let's divide it in two parts, ideal architecture and Q&A's
Architecture:
A typical system would consist of many technologies working together to solve a practical problem. Problems can be solved in many ways and may have more than one solution. We are not talking about efficiency and effectively of any architecture here as it's whole new subject to explore. But it's always wise to choose what's best for your use case.
Since you already have existing software built, it's always helpful to follow it's existing design pattern which will help you understand existing code in detail and allow you to create logical blocks which will fit nicely and actually help in integrating functionality instead of working against it.
Since this clears the pre planning phase let's discuss on how this affects what solution is ideal for your use case in my opinion.
Q&A's
1. How should I approach the problem, from API design perspective?
There will be lots of assumption, anything but less system consisting of api should have basic functionality of authentication and authorization when ever needed. Apart from that, try to stick to full REST specification, which will allow API consumers to follow standard paths and integration would have minimal impact when deciding what endpoints would look like and what they expect from consumer.
Regardless, not all systems are ideal for such use case and thus it's in up to system designer how much of system is compatible with standard practices.
Name convention matters when newer version api will have api/v2
paths and old one having api/v1, which is good practice for routing
new functionality. Which allows system to expand seamlessly.
2: Is changing the schema of your tables going to do any good, if that is an option?
In short term when you do not have much data, it's relatively easy to migrate data. When it becomes huge, it's much more painful and resource intensive.
Good practices would allow you to prevent such scenarios where you might not need migrate data.
Database normalization becomes so crucial in such cases when potential data structure would grow rapidly and requires attention.
Regardless of using any sql or nosql solution, a good data structure will always be helpful in both data management and programming implementation.
In my opinion, getting data structures near perfect is always a good
idea, because it will reduce future costs of migration and
frustration it brings. Still some use cases requires addition on
columns and its okay to add them as long as it does not have much
impact on existing code. Otherwise it can always be decoupled in
separate table for additional fields.
3: Which databases can be used?
Typically any rdbms is enough for this kind of tasks. You might be surprised when you see case studies of large data creators still using mysql in clusters.
So answer is, as long as you have normal scenario, go ahead and pick any database of your choice, until you hit its single instance scalability limits. And those limits are pretty huge for small to mod scale apps.
How should I approach the problem, from API design perspective?
Design a good data model which is appropriate for the data it needs to store. The API design will follow from the data model.
Is changing the schema of your tables going to do any good, if that is an option?
Does the new data belong in the existing tables? Then maybe you should store it there. Except: can you add new columns without breaking any existing applications? Maybe you can but the regression testing you'll need to undertake to prove it may be ruinous for your timelines. Separate tables are probably the safer option.
Which databases can be used?
You're rather vague about the nature of the data you're working with, but it seems structured (numbers?). So that suggests a SQL with strong datatypes would be the best fit. Beyond that, use whatever data platform is currently being used. Any perceived benefits from a different product will be swept away by the complexities and hassle of deploying it.
Last word. Talk this over with your boss (or whoever set you this task). Don't rely on the opinions of some random stranger on the interwebs.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm just curious what the best method would be if I'm trying to have a bot running on my Node server that I could play Blackjack against.
But for multiple connected clients via sockets, each connected socket will have their own bot to play against but I need some way to keep the bots available cards for each time they send a POST request with whatever card they pull out of their deck.
I figured MySQL would get messy really quickly because I cannot just store an array or an object and splice out each card as it gets used, but I'm not really familiar with which database would specialize in this kind of use.
If I didn't make any sense, basically:
I need to store cards for the bot (but for each connected users session) not just 1 deck for 1 person but multiple decks for multiple people.
I'm not asking you to write any code for me, just point me in the direction of which database would be ideal for this kind of setup.
I was thinking maybe Redis or MongoDB?
Redis would probably be fastest, especially if you don't need a durability guarantee - most of the game can be played out using Redis' in-memory datastore, which is probably gonna be faster than writing to any disk in the world. Perhaps periodically, you can write the "entire game" to disk. If the project is not meant for commercial purposes, i.e. computer errors aren't gonna cause players to lose money, this is definitely an enticing choice.
MongoDB is popular, especially easy to get started with Node, and is definitely faster than most relational SQL solutions, but transactions may be a problem. For a prototype or proof-of-concept projec, it should do fine. But you may also want to look at other "NoSQL" solutions as well.
Cassandra is another popular document-oriented DB, and many people prefer it over MongoDB, for various reasons - most notably, for better scalability.
The choice really highly depends on how you model your data. In your current scenario, I know you want to simply store an object/array, which sounds like you are basically going the way of the aggregated document (MongoDB). You are, in effect, "denormalizing" the entire DB into an aggregate, and performing reads/writes on the entire object every single time in order to achieve consistency. This is a prevalent technique in MongoDB and other document-oriented DBs. But do note that this solution only works because you are not operating across partitions. Think about what happens when you have multiple servers serving the application writing to a separate DB cluster.
You've really got to analyze and decide for yourself what is the best way to model data, if scalability is a concern. Would it be a better model to NOT continually write to this array? For example, generate the sequence of cards once, store it in DB as a Game, and only do reads on it to draw cards? Then, each player's move can be stored as a very succinct data structure Hit referencing a card from the Game. Although the data becomes very relational (back to old school SQL), but the writes are much smaller, and your server never gets into a lock state waiting for players to release the Game object. It may or may not work for your use case, but think about how to model the data for maximum reads and minimum independent writes.
Personally (IMO), if this project is for fun, I'd go with Redis as an in-memory cache layer where most reads/writes happens, and write the game logs into Cassandra. But if this is serious business and I need some real consistency guarantees, I'd probably go back to relational DBs, with a Redis cache layer to speed up reads.
Because there is no one correct answer, the only advice anyone can give is to weigh your application's persistence needs against the strengths/weaknesses of each DB solution, and do a hell lot of research before making an important decision like "which technology to use for persistence". For example, there may be long-term problems with MongoDB that you overlooked - if you'd just Google "MongoDB problems" or "MongoDB sucks". Hell, there may even be long-term problems with all current NoSQL offerings with regards to transactions or consistency.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We're (re)designing a corporate information system. For the database design, we are exploring these options:
[Option 1->] a single CompanyBigDatabase that has everything,
[Option 2->] several databases across the company (say, HRD_DB, FinanceDB, MarketingDB), which then synchronized through a layer of application. EmployeeTable is owned by HRD, if Finance wants to refer to employees, it queries EmployeeTable from HRD_DB via a web-service.
What is the best practice? What's the pros and cons? We want it to have high availability and to be reasonably reliable. Does Option 1 necessitate clustering-and-all for this? Do big companies and universities (like Toyota, Samsung, Stanford Uni, MIT, ...), always opt for Option 1?
I was looking in many DB textbooks but I could not find a sufficient explanation on this topic.
Any thoughts, tips, links, or advice is welcome. Thanks.
Ive have done this type of work for 20 yrs. Enterprise Architecting is one term used to describe this. If you are asking this question, in a real enterprise scenario, im going to recommend you get advice. If it's a uni question, There are so many things to consider:
budget
politics
timeframes
legacy systems or green field,
Scope of Build
In house or Hosted
Complete Outsource of some or all of the functionality (SaaS)
....
Entire Methodologies are written to support projects that do this.
You can end up with many answers to the variables.
Even agreeing on how to weight features and outcomes is tough.
This is HUGE question you could right a book on.
Its like a 2 paragraph question where I have seen 10 people spend a month putting a business case together to do X. Thats just costing and planning the various options. Without selection of the final approach.
So I have not directly answered your question... that my friend is a
serious research project, not really a StackOverflow question.
There is no single answer. It depends on the many other factors such as database load, application architecture, scalability and etc. My suggestion start the simplest way possible (single database) and change it based on the needs.
Single database has it's advantages: simpler joins, referential integrity, single backup. Only separate pieces of data when you have valid reason/need.
In my opinion, it would be more appropriate to have database normalized and have several databases across the company based on departments. This would allow you to manage data more effectively in terms of storing, retrieving and updating information and providing access to users based on department type or user type. You can also provide different views of the database. It will be a lot more easier to manage data.
There is a general principle of databases in particular, and computing in general, that there should be a single authoritative source for every data item.
Extending this to sets of data, as soon as you have multiple customer lists, multiple lists of items, multiple email addresses, you are soon into a quagmire of uncertainty that will then call for a business intelligence solution to resolve them all.
Now I'm a business intelligence chap by historical leaning, but I'd be the first to say that this is not a path that you want to go down simply because Marketing and Accounts cannot decide the definition of "customer". You do it because your well-normalised OLTP systems do not make it easy to count how many customers there were yesterday, last week, and last year.
Nor should they either, because then they would be in danger of sacrificing their true purpose -- to maintain a high performance, high-integrity persistent store of the "data universe" that your company exists in.
So in other words, the single database approach has data integrity on it's side, and you really do not want to work in a company that does not have data integrity. As a Business Intelligence practitioner I can tell you that it is a horrible place.
On the other hand, you are going to have practical situations in which you simply must have separate systems due to application vendor constraints etc, and in that case the problem becomes one of keeping the data as tightly coupled as possible, and of Metadata Management (ugh) in which the company agrees what data in what location has what meaning.
Either will work and other decisions will mostly affect your specification. To some extent you question could be described as 'Should I go down the ERP path or the SAAS path"? I think it is telling that right now most systems are tending towards SAAS.
How will you be managing the applications? If they will be updated at different times separate DBs make more sense. (SAAS path). On the other hand having one DB to connect to, one authorisation system, one place to look for details, one place to backup, etc appears to decrease complexity in the technical space. But then does not allow decisions affecting one part of the business to be considered separately from other parts of the business
Once the business is involved trying to get a single time each department agrees to an upgrade can be hell. Having a degree of abstraction so that you only have to get one department to align before updating its part of the stack has real advantages in coming years. And if your web services are robust and don't change with each release this can be a much easier path.
Don't forget you can have views of data in other DBs.
And as for your question of how do most big companies work; generally by a miss-mash of big and little systems that sometimes talk to each other, sometimes don't and often repeat data. Having said that repeating data is a real problem; always have an authoritative source and copies (or even better only one copy). A method I have seen work well in a number of enterprises is to have one place details can be CRUDed (Created, Retrieved, Updated and Deleted) and numerous where it can be read.
And really this design decision has little or nothing to do with availability and reliability. That tends to come from good design (simplicity, knowing where things live, etc) good practices (good release practices, admin practtices, backups, intelligent redundancy, etc) and spending money. Not from having one or multiple systems.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm an embedded guy, not a database guy. I've been asked to redesign an existing system which has bottlenecks in several places.
The embedded device is based around an ARM 9 processor running at 220mHz.
There should be a database of 50k entries (may increase to 250k) each with 1k of data (max 8 filed). That's approximate - I can try to get more precise figures if necessary.
They are currently using SqlLite 2 and planning to move to SqlLite 3.
Without starting a flame war - I am a complete d/b newbie just seeking advice - is that the "best" decision? I realize that this might be a "how long is a piece of string?" question, but any pointers woudl be greatly welcomed. I don't mind doing a lot of reading & research, but just hoped that you could get me off to a flying start. Thanks.
p.s Again, a total rewrite, might not even stick with embedded Linux, but switch to eCos, don't worry too much about one time conversion between d/b formats. Oh, and accesses should be infrequent, at most one every few seconds.
edit: ok, it seems they have 30k entries (may reach 100k or more) of only 5 or 6 fields each, but at least 3 of them can be a search key for a record. They are toying with "having no d/b at all, since the data are so simple", but it seems to me that with multiple keys, we couldn't use fancy stuff like a quicksort() type search (recursive, binary search). Any thoughts on "no d/b", just data-structures?
Btw, one key is 800k - not sure how well SqlLite handles that (maybe with "no d/b" I have to hash that 800k to something smaller?)
Also SQLite is the Database chosen by virtually all mobile operating systems. Android, Iphone OS and Symbian ship with SQLite which makes me think that manpower was spent to optimize it for the processor in those phones (nearly always ARM).
I would stick with SQLite, it's widely supported and pretty rich in features.
Firebird (previously Interbase) claims to work well embedded.
HypersonicQL (HQL) is small and fast and also claims to be suitable for embedded use.
Alas, I have no personal experience to back up either claim.
SQLite is probably a pretty safe bet. However, if performance is really important for your application and you do not need a relational database, I would suggest you take a look at Berkeley DB link text . Berkeley DB is not a relational database though. In other words, if your data is grouped in different tables and you constantly need to query result sets that require relating data from more than one table, you probably need a relational database. Berkeley DB is better suited for something like look up tables (i.e., the data is organized in a few tables and you don't need to query data from more than one of them in order to produce the result sets you want). Berkeley DB is very fast but it will require more work on your end in order to get the most out of it.
if you want an alternative, then berkeleydb is worth looking at. it used to be owned by sleepycat software, but is now available from oracle. it's a barebones database engine; is directly programmable (rather than a sql) frontend. it's used as part of the core engine in many major databases, and as the database in many embedded devices - it used to be particularly popular for managing routing tables in routers.
it tends to get overlooked these days for more fashionable setups, but i've found it to be decent, solid and for the numbers you are talking about it can be lightning fast.
I will suggest sqlite3 too.
It is used by many famous application.
SQLite is ok, but don't plan to use if you plan to insert, update and delete data that involves more that 6 millon rows(All at the same time, or any partial part). The thing is that the VACCUM keyword has to be done everynow and then and it becomes a very severe bottleneck for performance, even when it's automatic.
8 Years late, but as an update: I've had pretty good experience using Raima Database Manager. If you are looking for a small footprint db, they can get down to 40k. One of the reasons I like RDM is the platform independence, it is portable across 32-bit and 64-bit machines and between big-endian and little-endian architectures as well as support for most operating systems, meaning you can use it on Embedded Linux and eCos as mentioned in the first post. And it's performance gets better as you add better hardware and users as opposed to SQLite
i am not familiar with embed system, but iphone use arm9, and sqlite as DB
The 01-11-10 Embedded.com Newsletter does a nice job of covering this topic. The newsletter can be found at Embedded.com: Embedded.com Tech Focus Newsletter (1-11-10): Embedding Databases.