Social network with video chat feature using neo4j - database

I am using neo4j to build a social network web app where users that are friends can communicate with each other through video calls. Each participating user will also be able to submit a review at the end of each call. I structured the graph such that two (:User) nodes can have a [:FRIEND] relationship between each other. For a particular video call, I am planning on creating a (:VideoCall) node (which contains properties such as roomId) and a [:PARTICIPANT] relationship from the (:VideoCall) node to each participating (:User) node. The [:PARTICIPANT] relationship will have a rating property containing the user's review for that video call. Would this model be performant if there are a large number of user and video call nodes? Is there a better way to design the database for this type of feature?

Yes it should be performing well. Just make sure you have properties that you want to look up by indexed and constraints in place
What kind of use cases would you want to cover besides regular ones?

It is a good model if the video calls involve multiple users AND you want to use roomId as a condition for queries because in this way you can easily find all users that have participated in a specific video call.
However, I noticed that you mentioned it is a social networking web app. So chances are the video calls are just between TWO users. If that's the case, then there'a an alternative to your current model: Make video calls as an edge between users: (:user)-[:videocall]->(:user) Properties such as roomId can be assigned to the edge. This model saves memory because you have fewer nodes.

Related

How big tech companies share databases across multiple teams?

How multiple teams(which own different system components/micro-services) in a big tech company share their databases.
I can think of multiple use cases where this would be required. For example in an e-commerce firm, same product will be shared among multiple teams like product at first will be part of product onboarding service, then may be catalog service (which stores all products and categories), then search service, cart service, order placing service, recommendation service, cancellation & return service and so on.
If they don't share any db then
Do they all have redundant copy of the products with same product ID and
Wouldn't there be a challenge to achieve consistency among multiple team.
There are multiple related doubt I have in both the case wether they share DB or not.
I have been through multiple tech blogs and video on software design, and still didn't get satisfying answer. Do share some resources which can give a complete workflow of how things work end-to-end in a big tech firm.
Thank you
In the microservice architecture, each microservice exposes endpoints where other microservice can access shared information between the services. So one service would store as minimal information of a record that is managed by another microservice.
For example if a user service would like to fetch orders for a particular user in an e-commerce case, then the order service would expose an endpoint given a user id would return all orders related to the userid supplied and so on...so essentally the only field related to the user that the order service needs to store is the userid, the rest of the user details is irrelevant to it.
To further improve the cohesion and understanding between teams, data discovery apis/documentation are also built to share metadata of databases to other teams to further explain what each table/field means for one to efficiently plan out a microservice. You can read more about how such companies build data discovery tools
here
If I understand you correctly, you are unsure how different departments receive data in a company?
The idea is that you create reusable and effective API's to solve this problem.
Let's generically say the company we're looking at is walmart. Walmart has millions of items in a database(s). Each item has a unique ID etc etc.
If Walmart is selling items online via walmart.com, they have to have a way to get those items, so they create API's and use them to grab items based on certain query conditions.
Now, let's say walmart has decided to build an app... well they need those exact same items! Well, good thing we already created those API's, we will use the exact same ones to grab the data.
Now, how does Walmart manage which items are available at which store, and at what price? They would usually link this meta data through additional database schema tables and tying them all together with primary and foreign keys.
^^ This essentially allows walmart to grab ONLY the item out of their CORE database that only has details that are necessary to the item (e.g. name, size, color, SKU, details, etc), and link it to another database that is say, YOUR local walmart that contains information relevant to only your walmart location in regard to that item (e.g. price, stock, aisle number etc).
So using multiple databases yes, in a sense.
Perhaps this may drive you down some more roads: https://learnsql.com/blog/why-use-primary-key-foreign-key/
https://towardsdatascience.com/designing-a-relational-database-and-creating-an-entity-relationship-diagram-89c1c19320b2
There's a substantial diversity of approaches used between and even within big tech companies, driven by different company/org cultures and different requirements around consistency and availability.
Any time you have an explicit "query another service/another DB" dependency, you have a coupling which tends to turn a problem in one service into a problem in both services (and this isn't a necessarily a one-way thing: it's quite possible for the querying service to encounter a problem which cascades into a problem in the queried service (this is especially possible when a cache becomes load-bearing, which has led to major outages at at least one FANMAG in the not-that-distant past)).
This has led some companies that could be fairly called big tech to eschew that approach in their service design, typically by having services publish events describing what has changed to a durable log (append-only storage). Other services subscribe to that log and use the events to construct their own eventually consistent view of the data owned by the other service (i.e. there's some level of data duplication, with services storing exactly the data they need to function).

UI - Consuming different microservices API and consolidate the list into grid view and allow users to sort the data

Currently in our front end project (AngularJS), we need to consume different endpoints that are built in microservices architecture and show the data in the list view. Then we need to allow users to sort the data based on the columns selected by user. For eg, we are listing 10 columns out of which 6 are rendered from Service A and other 4 columns are pulled from another Service B. Both the services don't have direct relation mapping instead based on the object id Service B returns the data.
Now we have consolidated the list and shown the columns and allowed users to choose columns of their choice. As a next step, we need to allow users to sort any column data seamlessly. Is there any best practice followed in microservices paradigm to retrieve the data from both the services and sort them and show the result.
We have few options like
list all the data at once from both the services and sort the data in frontend. But problem with this approach is, if there are more dataset then user might feel slowness and at times browser can get hanged. We are using AngularJs in our project and already facing slowness when data set grows.
Introduce an intermediate API service(light weight nodejs server) which will helps to coordinate the request and it internally handles requesting data between different services and sends the result back.
Create an intermediate API service which will cache the data and orchestrates the request and responds the data from multiple services.
Can any one just share any other practices can be followed for the above use case? In current microservices trends, all API services are exposed as separate service and it makes frontend world a bit complex to handle services between different APIs and show data to users in UI to interact.
Any suggestions or approaches or hint will be helpful.
Thanks in advance.
Srini
Like you said, there are a few ways to handle the scenario you have. In my opinion the best approach would be option two. It is similar to the Gateway Aggregation Pattern where you introduce a gateway layer to handle the aggregation of your service APIs. The added benefit is that you may be able to park some common functionalities in this gateway layer if required.
Of course, the obvious drawback would be that you now have another layer that needs to be highly available and managed. So do consider the pros and cons carefully before deciding on your approach. For example, if this is the only aggregation that you will ever need, then 3 may be a better option.

Multiple microservices and database associations

I have a question concerning microservices and databases. I am developing an application: a user sees a list of countries and can click through it so he can see a list of attractions of that country. I created a country-service, auth-service (contains users for oAuth2) and an attraction-service. Each service has its own database. I mapped the association between an attraction and its country by the iso code (for example: BE = belgium): /api/attraction/be.
The approach above seems to work but I am a bit stuck with the following: a user must be able to add an attraction to his/her list of favorites, but I do not see how that's possible since I have so many different databases.
Do I create a favorite-service, do I pass id's (I don't think I should do this), what kind of business key can I create, how do I associate the data in a correct way...?
Thanks in advance!
From the information you have provided, using a standalone favourite service sounds like the right option.
A secondary simpler and quicker option might be to also to handle this on your user service which looks after the persistence of your users data as favourites are exclusive to a user entity.
As for ID's, I haven't seen many reasons as to why this might be a bad idea? Your individual services are going need to store some identifying value for related data and the main issue here I feel is just keeping this ID field consistent across your different services. What you choose just needs to be reliable and predictable to keep things easy and simple as your system grows.
If you are using RESTful HTTP, you already have a persistent, bookmarkable identification of resources, URLs (URIs, IRIs if you want to be pedantic). Those are the IDs that you can use to refer to some entity in another microservice.
There is no need to introduce another layer of IDs, be it country codes, or database ids. Those things are internal to your microservice anyway and should be transparent for all clients, including other microservices.
To be clear, I'm saying, you can store the URI to the country in the attractions service. That URI should not change anyway (although you might want to prepare to change it if you receive permanent redirects), and you have to recall that URI anyway, to be able to include it in the attraction representation.
You don't really need any "business key" for favorites either, other than the URI of the attraction. You can bookmark that URI, just as you would in a browser.
I would imagine if there is an auth-service, there are URIs also for identifying individual users. So in a "favorites" service, you could simply link the User URI with Attraction URIs.

App Engine entity groups: grouping all of a user's data vs avoiding them as long as possible

I'm working on a web application that allows users to create simple websites and publish them on a static web host, and I have problems deciding if and how I should use ancestors in the gray area between necessary and avoidable.
The model is rather simple: currently every User has one or more Website entities. The Website entities store all the basic information about a website, plus it's nested navigation menu that refers to Page entities (the navigation tree is stored as a JSON property). The Page entity types are based on a PolyModel, and there are several page types that behave differently (There's a GalleryPage, for example).
There are no entity groups (or rather, no entities with ancestors) as of yet, and I'll only need a couple of transactions. When updating a Page's name, for example, I have to update it in the Page entity itself as well as in the navigation tree on the Website entity.
I think I understand how entity groups work and the basic implications of using them, but I have trouble deciding on the "best" way to structure my data in the absence of strong reasons for either approach. I could:
Go entirely without ancestors on my entities. As far as I understand I can still use cross-group transactions as long as I get the entities by key and don't need more than 5 within the transaction. The downside is that I'd depend on the XG transactions and there might come a point where I can't ninja my way around using ancestor queries anymore (and then it might be too late).
Make the user object the parent of all his Website's, Page's and other data. This would give the user a strongly consistent view of all of his data, allow me to use transactions whenever I add a feature that would need them, but limits the sustained writes to 1-5/sec. But, as a user will only ever be updating his own data, this might actually work and behave just the same for 1000 users as it will for 1.
Try to use even smaller entity groups (like seperating the navigation from the Website and making that the parent of the Website's Pages). But I'm not quite sure if there's much benefit to this, because most of the editing happens on Pages anyway.
So I guess the real question is: how do you decide when to use ancestor relationships on App Engine when there's no obvious reason for or against them? Would you go for the convenience of strongly consistent queries and being able to use transactions freely while adding features later, or would you avoid them at all costs until there's a very obvious reason for them, even if might limit my ability to do transactions later?
I read the related documentation, read the chapter on transactions in "Programming App Engine", looked at quite a few of the Google I/O videos, but I still find it hard to make that decision.

Graph Database to Count Direct Relations

I'm trying to graph the linking structure of a web site so I can model how pages on a given domain link to each other. Note I'm not graphing links to sites not on the root domain.
Obviously this graph could be considerable in size. One of the main queries I want to perform is to count how many pages directly link into a given url. I want to run this against the whole graph (shudder) such that I end up with a list of urls and the count of incoming links to that url.
I know one popular way of doing this would be via some kind of map reduce - and I may still end up going that way - however I have a requirement to be able to view this report in (near) realtime which isn't generally map reduce friendly.
I've had a quick look at neo4j and OrientDb. While both of these could model the relationship I want it's not clear if I could query them to generate the report I want. At this point I'm not committed to any particularly technology.
Any help would be greatly appreciated.
Thanks,
Paul
both OrientDB and Neo4J supports Blueprints as common API to make graph operations like traversal, counting, etc.
If I've understood well your use case your graph seems pretty simple: you have a "URL" Vertex that links each other with one type of Edge "Links".
To execute operation against graphs take a look at Gremlin.
You might have a look at structr. It is a open source CMS running on top of Neo4j and exactly has those types of inter-page links.
For getting the number of links pointing to the page you just have to iterate the incoming LINKS_TO links for the current page-node.
What is the use-case for your query ? A popular page list? So it would just contain the top-n pages? You might then try to just start at random places of the graph traverse incoming LINKS_TO relationships to your current node(s) in parallel and put them into a sorting structure, so you always start/continue with the first 20 or so top page-nodes that already have the highest number of incoming links (until they're finished).
Marko Rodriguez has some similar "page-rank" examples in the Gremlin documentation. He's also got several blog posts where he talks about this.
Well with Neo4J you won't be able to split the graph across servers to distribute the load. you could replicate the database to distribute the computation, but then updating will be slow (as you have to replicate the updates). I would attack the problem by updating a count of inbound links to each node as new relationships are added as a property of the node. Neo4J has excellent write performance. Of course you don't need to persist this information because direct relationships are cheap to retrieve (you don't get a collection of all related nodes just an iterator).
You should also take a look at a highly scalable graph database product, such as InfiniteGraph. If you email their technical support I think they will be able to point you at some sample code that does a large part of what you've described here.

Resources