Graph Data Structures with millions of nodes (Social network)

Graph Data Structures with millions of nodes (Social network) - c

In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.
If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely
to make a path between you and someone else. How could you use this data to pick where you
start traverse?

Your question seems interesting and curious :)
1) Well... of course, data is stored in disks, not in ram.
Disks have systems that avoid failure, in particular, RAID-5 for example.
Redundancy is the key: if one system fail there is another system ready to take his place.
There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.
In places like google or facebook redundancy is not 2, is 1200000000 :)
And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.
2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :)
Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.
3) The end of the graph is not infinite.
So it is possible with actual technology indeed.
The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges.
This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.
Being a linear growth is easy to add resources when needed.
Add more computers the more you get rich :)
Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.
Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).

Related

Alternative to scanning AWS DynamoDB?

I understand that scanning DynamoDB is not reccomended and is bad practice.
Let's say I have a food ordering website and I want to do a daily scan of all users to find out who hasn't ordered food in the last week so I can send them an email (just an example).
This would put some very spikey demand on the database, especially with a large user base.
Is there an alternative to these scheduled scans that I'm missing? Or in this scenario is a scan the best tool for the job?

There are a lot of different possible answers to this question. As so often all of this begins with the simple truth that the best way to do something like this, depends on the actual specifics and what you are trying to optimise for (cost, latency, duration etc.).
Since this appears to be a "once a week" thing I guess latency and "job" duration are not high on the priority list, but cost might be.
The next important thing to consider is implementation complexity. For example: if your service only has 100 users, I would not bother with any of the more complex solutions and just do a scan. But if your service has millions of users, this is probably not a great idea anymore.
For the purpose of this answer I am going to assume that your user base has become too large to just do a scan. In this scenario I can think of two possible solutions:
Add a separate index that allows you to "query" for the last order date easily.
Use a S3 backup
The first should be fairly self explanatory. As often described in DynamoDB articles, you are supposed to define your "access patterns" and build indexes around them. The pro here is that you are still operating within DynamoDB, the con is the added cost.
My preferred solution is probably to just do scheduled backups of the table to S3 and then process the backup somewhere else. Maybe a custom tool you write or some AWS service that allows processing large amounts of data. This is probably the cheapest solution but processing time might not be "super fast".
I am looking forward to other solutions to this interesting question.

Architecture Setup and Database

I am trying to create a social app in which users can follow their friends and their personalised feed is real time as well.
Question : Is Graph Database the best option to cater to such problem. What is the experience when the data reaches millions. Also, what is the right way to proceed for feeds, do we keep Kafka stream for each user? How do I start with the whole setup with respect to over engineering, a start point and the flow.

As usual, it depends 100% on how you use these technologies.
Neo4j (a graph database) can store fairly large amounts of data:
A graph database with a quadrillion nodes? Such a monstrous entity is
beyond the scope of what technologist are trying to do now. But with
the latest release of the Neo4j database from Neo Technology, such a
graph is theoretically possible.
There is effectively no limit to the sizes of graphs that people can
run with Neo4j 3.0, which was announced today, says Neo Vice President
of Products Philip Rathle.
“Before Neo4j 3.0, graph sizes were limited to tens of billions of
records,” Rathle says. “Even though they may not have tens of billions
of data items to actually store in a graph, just having a ceiling made
them nervous.”
By adopting dynamically sized pointers, Neo4j can now scale up to run
the biggest graph workloads that customers can throw at it. The
company expects some of its customers will begin to put that extra
capacity to use, for things such as crunching IoT data, identifying
fraud, and generating product recommendations.
Source: https://www.datanami.com/2016/04/26/neo4j-pushes-graph-db-limits-past-quadrillion-nodes/
Start with something simple, Neo4j sounds like a good starting point. Once you start hitting bottlenecks or scaling issues, then you can start looking at other solutions. It's very hard to predict where your bottlenecks will be without real-world data.
Realtime feeds at scale is hard to build, first define how realtime do you want it to be. Is 1 minute still considered realtime? Maybe 5 minutes?
The number you choose here will directly affect your technology choices.
Either way, more information is needed to give a more detailed answer.

How to implement sharding?

First world problems: We've got a production system that is growing rapidly, and we are aiming to grow our user base even more. At peak times our DB is flatlining at 100% CPU, which I take as an indication that it's pretty much stretched to the limit. Being an AWS instance, we could always throw some more hardware at it, but long term, it seems we will need to implement sharding.
I've Googled all over and found lots of explanations of what sharding is, why it is a good idea under certain circumstances, what design considerations, etc... but not a word on the practicality of how to do it.
What are the practical steps to shard a database? How do you redirect queries to the appropriate shard? And how do you run reports that require data from all shards?

The first thing you'll want to decide is whether or not you want to take on the complexity of routing queries in your application. If you decide to roll your own implementation, there are a number of complexities that you'll need to deal with over time.
You'll need a scheme to distribute data and queries evenly across the cluster. You'll need to ensure that this scheme is forward-compatible with a larger cluster, as if your data is already big enough to require a sharded architecture, it's likely that you'll need to add more servers.
The problem with sharding schemes is that they force you to make tradeoffs that you wouldn't have to make with a single-server database. For example, if you are sharding by user_id, any query which spans multiple users will need to be sent to all servers (or a subset of servers) and the results must be accumulated in your client application. This is especially complex if you are using aggregate queries that rely on the ordering of the data, such as MAX(), or any histogram computation.
All of this complexity isn't meant to scare you, but it's something you'll need to pay attention to. There are tools out there that can help you (disclosure: my company makes a tool called dbShards) but you can definitely put together your own solution, especially if your application is mature and the query patterns are quite predictable.

Why Database is not used with 3D and Drawing softwares (to store points and other primitive shapes)?

My question is why Databases are not used with Drawing, 3D Modelling, 3D Design, Game Engines and architecture etc. software to save the current state of the images or the stuff that is present on the screen or is the part of a project in a Database.
One obvious answer is the speed, the speed of retrieving or saving all the millions of triangles or points forming the geometry is very low, as there would be hundreds or thousands of queries per second, but is it really the cause? Considering the apparent advantages of using databases can allow sharing the design live over a network when it is being saved at a common location, and more than one people can work on it at a time or can use can give live feedback when something is being designed when it is being shared, specially when time based update is used, such as update after every 5 or 10 seconds, which is not as good as live synchronization, but should be quick enough. What is the basic problem in the use of Databases in this type of software that caused them not to be used this way, or new algorithms or techniques not being developed or studied for optimizing the benefits of using them this way? And why this idea wasn't researched a lot.

The short answer is essentially speed. The speed of writing information to a disk drive is an order of magnitude slower than writing it to ram. The speed of network access is in turn an order of magnitude slower than writing or reading a hard disk. Live sharing apps like the one you describe are indeed possible though, but wouldn't necessarily require what you would call a "database", nor would using a database be such a great idea. The reason more don't exist is that they are actually fiendishly difficult to program. Programming by itself is difficult enough, even just thinking in a straight line, with a single narrative. But to program something like that requires you to be able to accurately visualise multiple parallel dimensions acting on the same data simultaneously, without breaking anything. This is actually difficult.

Your obvious answer is correct; I'm not an expert in that particular field but I'm at a point that even from a distance you can see that's (probably) the main reason.
This doesn't negate the possibility that such systems will use a database at some point.
Considering the apparent advantages of using databases can allow
sharing the design live over a network when it is being saved at a
common location...
True, but just because the information is being passed around doesn't mean that you have to store it in a database in order to be able to do that. Once something is in memory you can do anything with it - the issue with not persisting stuff is that you will lose data if the server fails / stops / etc.
I'll be interested to see if anyone has a more enlightened answer.

Interesting discussion... I have a biased view towards avoiding adding too much "structure" (as in in the indexes, tables, etc. in a DBMS solution) to spacial problems or challenges beyond those that cannot be readily recreated from a much smaller subset of data. I think if the original poster would look at the problem from what is truly needed to answer the need/develop solutions ... he may not need to query a DBMS nearly as often... so the DBMS in question might relate to a 3D space but in fact be a small subset of total values in that space... in a similar way that digital displays do not concern themselves with every chnage in pixel (or voxel) value but only those that have changed. Since most 3D problems would have a much lower "refresh" rate vs video displays the hits on the DBMS would be infrequent enough to possibly make a 3D DB look "live".

How to share data across an organization

What are some good ways for an organization to share key data across many deparments and applications?
To give an example, let's say there is one primary application and database to manage customer data. There are ten other applications and databases in the organization that read that data and relate it to their own data. Currently this data sharing is done through a mixture of database (DB) links, materialized views, triggers, staging tables, re-keying information, web services, etc.
Are there any other good approaches for sharing data? And, how do your approaches compare to the ones above with respect to concerns like:
duplicate data
error prone data synchronization processes
tight vs. loose coupling (reducing dependencies/fragility/test coordination)
architectural simplification
security
performance
well-defined interfaces
other relevant concerns?
Keep in mind that the shared customer data is used in many ways, from simple, single record queries to complex, multi-predicate, multi-sort, joins with other organization data stored in different databases.
Thanks for your suggestions and advice...

I'm sure you saw this coming, "It Depends".
It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.
My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.
The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.
For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.
When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.
The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.
You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.
Now, that's an extreme, simplified, and very coarse example of EC.
But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.
Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.
On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"
Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.
These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.
This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.
So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.
Edit, in response to first comment.
Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.
Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.
The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.
The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.
All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.
But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.

This is definitely not a comprehensive reply. Sorry, for my long post and I hope it adds to thoughts that would be presented here.
I have a few observations on some of the aspect that you mentioned.
duplicate data
It has been my experience that this is usually a side effect of departmentalization or specialization. A department pioneers collection of certain data that is seen as useful by other specialized groups. Since they don't have unique access to this data as it is intermingled with other data collection, in order to utilize it, they too start collecting / storing the data, inherently making it duplicate. This issue never goes away and just like there is a continuos effort in refactoring code and removing duplication, there is a need to continuously bring duplicate data for centralized access, storage and modification.
well-defined interfaces
Most interfaces are defined with good intention keeping other constraints in mind. However, we simply have a habit of growing out of the constraints placed by previously defined interfaces. Again a case for continuos refactoring.
tight coupling vs loose coupling
If any thing, most software is plagued by this issue. The tight coupling is usually a result of expedient solution given the constraint of time we face. Loose coupling incurs a certain degree of complexity which we dislike when we want to get things done. The web services mantra has been going rounds for a number of years and I am yet to see a good example of solution that completely alleviates the point
architectural simplification
To me this is the key to fighting all the issues you have mentioned in your question. SIP vs H.323 VoIP story comes into my mind. SIP is very simplified, easy to build while H.323 like a typical telecom standard tried to envisage every issue on the planet about VoIP and provide a solution for it. End result, SIP grew much more quickly. It is a pain to be H.323 compliant solution. In fact, H.323 compliance is a mega buck industry.
On a few architectural fads that I have grown up to.
Over years, I have started to like REST architecture for it's simplicity. It provides a simple unique access to data and easy to build applications around it. I have seen enterprise solution suffer more from duplication, isolation and access of data than any other issue like performance etc. REST to me provides a panacea to some of those ills.

To solve a number of those issues, I like the concept of central "Data Hubs". A Data Hub represents a "single source of truth" for a particular entity, but only stores IDs, no information like names etc. In fact, it only stores ID maps - for example, these map the Customer ID in system A, to the Client Number from system B, and to the Customer Number in system C. Interfaces between the systems use the hub to know how to relate information in one system to the other.
It's like a central translation; instead of having to write specific code for mapping from A->B, A->C, and B->C, with its attendance exponential increase as you add more systems, you only need to convert to/from the hub: A->Hub, B->Hub, C->Hub, D->Hub, etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight