I've got a use case where I wouldn't need a full-blown features of a graph db. I would already receive a network (connected nodes and edges) from a datasource and all I'd have to do is to apply a few graph operations on it (eg: Shortest Path between 2 nodes, Discover relationships etc). The network size isn't huge (max 200 nodes). I was wondering if I can use an embedded Tinkerpop for this? Anything I should watch out for? This is a production application and will be used in a highly multithreaded context etc etc.
When you say "emebedded TinkerPop" I assume that you mean the in-memory graph database that it provides called "TinkerGraph". If that is correct, then I would say that it can be used for the kind of use case that you describe. If you have a small graph that fits in-memory and you don't need any bells and whistles, it should be a good fit and likely the best choice of any TinkerPop-enabled graph.
The only thing you should be aware of is that TinkerGraph is not thread-safe for mutations. You don't want to have multiple threads mutating the graph at the same time. For the graph size you describe, it sounds as though it should be cheap to clone copies of the graph to share among threads if you need to.
Related
I'd like to measure how many data (in MB) were transfered between 2 Access databases during specific operations.
On of the databases is on local disk, the other is in network. I'd like to work out design solution which will minimalise data transfer during specific operations (like data loading, appending data, checking data quality) having all needed data in 'mother' db. E.g. where to put data marts and how to synchronise them. Possibility of measurement data transfer in different scenarios will help me out in finding the best design solution having max performance.
Is there some tool which provide this possibility?
Access doesn't really include built-in instrumentation for the information you want. Look into the ISAMStats Method to see whether you can use the information it gives you as a reasonable approximation.
That sample chapter contains a wealth of other information regarding application optimization. So, since your goal is to maximise performance, you may find it useful even if ISAMStats is not.
I would think that Wireshark would be able to capture the information you need. As described in Wikipedia:
Wireshark is a free and open-source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education.
I have used Wireshark (and its predecessor, Ethereal) for many years. It's one of those tools that I don't need to use often, but it can be a godsend when I do need to use it.
I want to start a big cakePHP project where performance will be an issue. I will have a users table with act as tree behavior and many financial data related to the users. This application will make a lot of dynamic reports aggregating data for different tree nodes etc.
Since there is on github an easy to use library which sets data source of model to redis, I was wondering if it's a good idea to use it for entire app? Is there anyone who has experience with it, and what could be potential problems if I decide to depend on redis as main/only data storage?
EDIT: I have installed redis and Tried to use RedisModel for two models with simple relation HasMany/BelongsTo. When I tried to simply use those models like standard AppModels - it simply wont work (Redis Error: Missing key). Apparently you can't use Model->find Model->save etc. in standard way. You have to use redis methods instead (setKeyValue ect.). This means that pagination and other cakePHP futures will also not work. So maybe it is not the best idea to use redisModel for all my models...
I cannot speak for CakePHP specifically, but I'll talk about redis in general and the points of your question in particular, it should be applicable to your framework of choice in the end. Let's see:
You mention you want to start an application where performance will be an issue — I just wanted to mention you should be careful with the assumption that you will need a nosql solution, because this is hard to assess beforehand. Redis is hella fast, but MySQL for instance has been proven to be capable to handling millions of records and operations just fine, provided it's properly configured and used, and it's much simpler if you need lots of relational structures.
Concerning Redis as the main and only data store:
Redis is perfectly stable for the job. Instagram
reportedly stored 300 million key-value pairs pseudo-sharded
using hashes to great effect, and while it's not the only data
storage system they use, it goes to show redis is pretty reliable.
This very site (Stack Overflow) uses redis also extensively for
caching purposes.
Redis is also reported to have an overall excellent continuous uptime on average (which shouldn't be surprising considering the point above)
Options exists to mitigate downtime issues, replication is supported to some extent, and Redis Cluster is coming soon to support proper distributed approaches.
The main problem you could face is not understanding properly how its
persistence works. You should absolutely read this and this article before you get started because this point is important. In a nutshell, redis does not write changes immediately to disk, which means that depending on your configuration, a crash can cause a data loss ranging from a few seconds to several minutes since the last disk write. This might or might not be a problem depending on your use case; if the data is extremely sensitive (ie, financial records) you might want to think twice before jumping to redis, or build a system where redis is not exclusively used but rather combined with another storage system.
Relational structures in a non-relational data store like redis mean doing more work and often duplicating/denormalizing data. It can be done, but it's something to consider; in your question you mention you'll need to aggregate data to generate dynamic reports, are you sure you want to use redis for this? it sounds like a relational database would give you way more flexibility at a very small cost of performance. If you know in advance you'll need to run complex queries over your data, it could be a good idea not to reinvent the wheel unless you absolutely need to.
My advice here would be to first get a better feeling on what redis is and how works, potentially build your own models instead of relying on others to better understand what can and cannot be done, and from there assess where you want to take it. Redis is reliable enough to be used standalone, but at the end of the day what's smart is to use the right tool for the right job, and you might find some things of your app work well with redis while some others are better off to a more traditional storage system.
In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.
If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely
to make a path between you and someone else. How could you use this data to pick where you
start traverse?
Your question seems interesting and curious :)
1) Well... of course, data is stored in disks, not in ram.
Disks have systems that avoid failure, in particular, RAID-5 for example.
Redundancy is the key: if one system fail there is another system ready to take his place.
There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.
In places like google or facebook redundancy is not 2, is 1200000000 :)
And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.
2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :)
Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.
3) The end of the graph is not infinite.
So it is possible with actual technology indeed.
The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges.
This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.
Being a linear growth is easy to add resources when needed.
Add more computers the more you get rich :)
Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.
Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).
I am working to develop an application that needs data distributed across countries. Content will be supplied "per region", but needs to be able to be easily copied to another region. On top of this I have general information that needs to be shared and synchronized across the databases.
The organisation I work for is considering implementing this system themselves, but it feels like there should be some good solutions out there already (I am open to cloud solutions - the less my company needs to manage the better)?
This might be a vague question, but I think it is possible to answer it well.
What are my options when developing this kind of distributed data system?
Update:
Should have elaborated (but I'm not sure how much I can say given NDA). Suffice to say, I have "Content" which I need stored on some space (files). I need metadata stored about the content distributed over several nodes (that might be hosted by us or some one else) to allow fast-paced communication and regionalized differences in data. I need to control HOW data is replicated between nodes, but preferably in a standards compliant way. (Preferably not written by us)
You can try CouchDB. Its off-line replication model sounds like a good fit for geo distributed system.
Interesting question - but it would really help to get more context.
You talk about "data", which usually means something with a fairly well-defined structure, often implemented in a relational database.
You also talk about "content", which usually means something with a (much) less well-defined structure, often implemented as a document of some type. Many solutions exist for structuring "documents", e.g. file systems or web sites.
Assuming we are talking about structured data, the simplest thing to do is have single repository, accessible everywhere. Have a look at "cloud" offerings - Amazon's a good bet. Creating your own global data repository is a significant undertaking - but if you're dealing with highly confidential data, or have specific performance requirements, it may the way to go.
If neither of those options work, you're in the world of "enterprise service bus". Google it, but be careful - it's a complex field, and you really want to find someone who knows what they're doing.
Having said that, using an off the shelf ESB is many times less painful than building your own distributed data structure.
I know it's years after asking, but I was looking up the answer to the same question and it looks like Cassandra may fit the bill. Once setup, it looks and acts like other database solutions (Tables, Views, SQL, Transactions, etc.), but it can also be entirely decentralized. Each instance acts as a node in a cluster of other Cassandra nodes. They synchronize behind the scenes and if one goes down, the others pick up the slack. This makes Cassandra both highly scalable and highly fault tolerant.
I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.
My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.
Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?
Tough question - but I'll share my experience and let you decide if it helps.
If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):
To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
To make it easier to export the data in a flexible manner
To enable access to your processed data to external clients (known format)
To allow more flexible transformation of the data when preparing for viewing
Factors which you should consider when making the decision:
What is the target platform(s) (windows, linux, android, iPhone, PDA)?
What technology base? (Java, .Net, C, C++, ...)
What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
What operational behaviours do you need to take into account (connected to network, disconnected)?
On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.
If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).
Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.
Just a stream-of-consciousness response - hope that helps a little.
Edit:
Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).