Maximum concurrent operations on couchbase single bucket - database

I’m building a server for customers where each customer need to have each access to a database for serving his/her clients.
So my thought was to assign each customer to a specific bucket but just to find out now that a single couchbase only serve maximum of 10 buckets as recommended. But now, i don’t know if sharing a single bucket across my customers using their ID combining with the collection documents name they are creating as namespace in document type will affect the performance of all customers due to heavy operation by each customer clients on a single bucket.
I will also appreciate any database platform that can also handle this kind of project at large that performance of one customer will affect others.

If you expect the system to be heavily loaded, the activities of one user will affect the activities of another user whether they are sharing a single bucket or operating in separate buckets. There are only so many cycles to go around, so if one user is placing a heavy load on the system, the other users will definitely feel it. If you absolutely want the users completely isolated, you need to set up separate clusters for each of them.
If you are ok with the load from one user affecting the load from another, your plan for having users sharing a bucket by adding user ids to each document sounds workable. Just make sure you are using a separator that can not be part of the user id, so you can unambiguously separate the user id from the document id.
Also be aware that while Couchbase supports multiple buckets, it tends to run best with just one. Buckets are distinctly heavyweight structures.

Related

How big tech companies share databases across multiple teams?

How multiple teams(which own different system components/micro-services) in a big tech company share their databases.
I can think of multiple use cases where this would be required. For example in an e-commerce firm, same product will be shared among multiple teams like product at first will be part of product onboarding service, then may be catalog service (which stores all products and categories), then search service, cart service, order placing service, recommendation service, cancellation & return service and so on.
If they don't share any db then
Do they all have redundant copy of the products with same product ID and
Wouldn't there be a challenge to achieve consistency among multiple team.
There are multiple related doubt I have in both the case wether they share DB or not.
I have been through multiple tech blogs and video on software design, and still didn't get satisfying answer. Do share some resources which can give a complete workflow of how things work end-to-end in a big tech firm.
Thank you
In the microservice architecture, each microservice exposes endpoints where other microservice can access shared information between the services. So one service would store as minimal information of a record that is managed by another microservice.
For example if a user service would like to fetch orders for a particular user in an e-commerce case, then the order service would expose an endpoint given a user id would return all orders related to the userid supplied and so on...so essentally the only field related to the user that the order service needs to store is the userid, the rest of the user details is irrelevant to it.
To further improve the cohesion and understanding between teams, data discovery apis/documentation are also built to share metadata of databases to other teams to further explain what each table/field means for one to efficiently plan out a microservice. You can read more about how such companies build data discovery tools
here
If I understand you correctly, you are unsure how different departments receive data in a company?
The idea is that you create reusable and effective API's to solve this problem.
Let's generically say the company we're looking at is walmart. Walmart has millions of items in a database(s). Each item has a unique ID etc etc.
If Walmart is selling items online via walmart.com, they have to have a way to get those items, so they create API's and use them to grab items based on certain query conditions.
Now, let's say walmart has decided to build an app... well they need those exact same items! Well, good thing we already created those API's, we will use the exact same ones to grab the data.
Now, how does Walmart manage which items are available at which store, and at what price? They would usually link this meta data through additional database schema tables and tying them all together with primary and foreign keys.
^^ This essentially allows walmart to grab ONLY the item out of their CORE database that only has details that are necessary to the item (e.g. name, size, color, SKU, details, etc), and link it to another database that is say, YOUR local walmart that contains information relevant to only your walmart location in regard to that item (e.g. price, stock, aisle number etc).
So using multiple databases yes, in a sense.
Perhaps this may drive you down some more roads: https://learnsql.com/blog/why-use-primary-key-foreign-key/
https://towardsdatascience.com/designing-a-relational-database-and-creating-an-entity-relationship-diagram-89c1c19320b2
There's a substantial diversity of approaches used between and even within big tech companies, driven by different company/org cultures and different requirements around consistency and availability.
Any time you have an explicit "query another service/another DB" dependency, you have a coupling which tends to turn a problem in one service into a problem in both services (and this isn't a necessarily a one-way thing: it's quite possible for the querying service to encounter a problem which cascades into a problem in the queried service (this is especially possible when a cache becomes load-bearing, which has led to major outages at at least one FANMAG in the not-that-distant past)).
This has led some companies that could be fairly called big tech to eschew that approach in their service design, typically by having services publish events describing what has changed to a durable log (append-only storage). Other services subscribe to that log and use the events to construct their own eventually consistent view of the data owned by the other service (i.e. there's some level of data duplication, with services storing exactly the data they need to function).

Database structure when implementing a Slack style workspace/instance architecture

I'm working on an app that has a Slack style workspace architecture where the user can access the same function of the application under multiple "instances" (workspaces).
I'm going to continue with using Slack as an example to explain my issue.
When any action is taken in my application I need to validate that the user has the rights to perform an action on the specified resource and that the resource is within the same workspace as the user.
The first tables I create such as Users have a simple database relationship to the workspace. Using a WorkspaceId field in the Users table for example.
My issue is as I create more tables which are "further" away such as UserSettings which might be a one to one relationship to the Users table I now have to do a join to the Users record to get the workspace which the UserSettings record belongs to.
So now I am thinking is it worth adding a workspaceId value on all tables since I will endup doing a lot of JOINs in my database to continue verifying that the user has permissions to that resource.
Looking for advice/architecture patterns which may help with the scenario.
I'm assuming your main concern with multiple JOIN statements is that the query performance will suffer. Multiple JOIN statements don't always mean a query will be slow. The query performance depends on many factors, how large the dataset is and how well indexed it is, what database engine and ultimately what the query plan is. You'll only end up with lots of JOIN statements if you decide to normalize the database that way. Using a full third normal form is rarely the right choice for a schema because of the potential performance impacts it can have. Some duplication of data is generally okay, the trade off you are making is storage cost vs query performance. To decide on how to normalize the database there are many questions you should be asking here's some that come to mind:
What type of queries do you expect to make?
How often will each type of query be made?
How often will the data change and can a cache be used?
Does a different storage technology better suit the use case?
Is some of the data small enough that it can be all in one table?
In my experience designing user management systems, usually ends up with a cache or similar mechanism for having fast user to a given users permissions that has an acceptable expiry window. This means you are only querying the database for a given user at the expiry window and using the cache a majority of the time. This is why many security systems and user systems don't immediately update settings. The more granular and flexible the type of permission you want to grant user the more expensive the query is going to be because of the complexity. At which point you can decide to denormalize the data or use a coaching mechanism.

Best approach for caching lists of objects in memcache

Our Google AppEngine Java app involves caching recent users that have requested information from the server.
The current working solution is that we store the users information in a list, which is then cached.
When we need a recent user we simply grab one from this list.
The list of recent users is not vital to our app working, and if it's dropped from the cache it's simply rebuilt as users continue to request from the server.
What I want to know is: Can I do this a better way?
With the current approach there is only a certain amount of users we can store before the list gets to large for memcache (we are currently limiting the list to 1000 and dropping the oldest when we insert new). Also, the list is going to need updating very quickly which involves retrieving the full list from memcache just to add a single user.
Having each user stored in cache separately would be beneficial to us as we require the recent user to expire after 30 minutes. At the moment this is a manual task we do to make sure the list does not include expired users.
What is the best approach for this scenario? If it's storing the users separately in cache, what's the best approach to keeping track of the user so we can retrieve it?
You could keep in the memcache list just "pointers" that you can use to build individual memcache keys to access user entities separately stored in memcache. This makes the list's memcache size footprint a lot smaller and easy to deal with.
If the user entities have parents then pointers would have to be their keys, which are unique so they can be used as memcache keys as well (well, their urlsafe versions if needed).
But if the user entities don't have parents (i.e. they're root entities in their entity groups) then you can use their datastore key IDs as pointers - typically shorter than the keys. Better yet, if the IDs are numerical IDs you can even store them as numbers, not as strings. The IDs are unique for those entities, but they might not be unique enough to serve as memcache keys, you may need to add a prefix/suffix to make the respective memcache keys unique (to your app).
When you need a user entity data you 1st obtain the "pointer" from the list, build the user entity memcache key and retrieve the entity with that key.
This, of course, assumes you do have reasons to keep that list in place. If the list itself is not mandatory all you need is just the recipe to obtain the (unique) memcache keys for each of your entities.
If you use NDB caching, it will take care of memcache for you. Simply request the users with the key using ndb.Key(Model, id).get() or Model.get_by_id(id), where id is the User id.

Separating data from different sites

We are creating a web solution that contains large number of users, their events, calendars and content to be managed. This solution can be white-labeled and can be sold to other vendors as a services, i.e. Though the hosting is in our SINGLE server but thy will have their own administrator and there own users and separate contents, that are completely disconnected to the other vendors. For example we are going to host the solution as
www.example.com/company1
www.example.com/company2
www.example.com/company3
The question is should we use different database for different company, or we should use single database for managing all the company.
Thanks
You should use separate databases for each company, unless you are offering some sort of service where the companies know that data is being pooled.
This is a question of data protection. No matter how much you swear that one company can only see their data in the table, you may not be able to convince prospective clients of this fact.
In addition, you need to keep the options open of running the databases on different servers. You don't want peak performance at one company to affect another company. Or, you don't want a special change for one company -- which might require bringing down the application with their knowledge -- to affect other clients.

What issues to check for when going from single to multiple users

I have inherited a (rather large) database, which at the moment is only ever accessed by a single user. In the future I want this to be accessible to multiple users at the same time (which can be done using Filemaker Network).
I am concerned that multiple user access may break much of the functionality (for example searches, which change records in tables). What other things should I look out for which could cause multi-user problems?
Searches should just be queries - those shouldn't impact other users above/beyond overall performance.
Updates should be reflected across all users - that's the benefit of using filemaker for multiple users. If you need to keep recordsets for distinct users then you will need to look at making significant changes.

Resources