Node Hash in ArangoDB? - graph-databases

I'm using ArangoDB for Graph-Versioning and would be looking for a faster method to evaluate whether or not a Node is the same in two different collections.
Apart from hashing each node before I write it - does ArangoDB have any mechanism that lets me read the Hash of the node?
I usually access the Database with Python-Arango.
If hashing it by myself is the only viable option what would be a reasonable Hash-Function for these types of documents in a Graph-DB? _id should not be included as the same node in two different collections would still differ. _rev would not really matter, and I am not sure if _key is in fact required as the node is identified by it any way.

You need to make your own hash algo to do this.
The issue is that the unique values of a document that build the hash are user specific, so you need to build that hash value externally and save it with every document.
To confirm uniqueness, you can do that via a Foxx Microservice or in your AQL query, where you throw an error if multiple nodes are ever found with duplicate hashes.
If you want to enforce uniqueness on inserts, then you'll need to build that logic externally.
You then have the option of trusting your uniqueness or setting up a Foxx Microservice that would scour the collections in scope to ensure no other document had the same hash value.
The performance of querying many other collections would be poor, so an alternative to that is to set up a Foxx Queue that accepted document updates, and you then have a Foxx service performing the INSERT/UPDATE commands from the queue. That way you don't slow down your client application, and data will be eventually updated in Arango as fast as possible.

Related

Best approach for caching lists of objects in memcache

Our Google AppEngine Java app involves caching recent users that have requested information from the server.
The current working solution is that we store the users information in a list, which is then cached.
When we need a recent user we simply grab one from this list.
The list of recent users is not vital to our app working, and if it's dropped from the cache it's simply rebuilt as users continue to request from the server.
What I want to know is: Can I do this a better way?
With the current approach there is only a certain amount of users we can store before the list gets to large for memcache (we are currently limiting the list to 1000 and dropping the oldest when we insert new). Also, the list is going to need updating very quickly which involves retrieving the full list from memcache just to add a single user.
Having each user stored in cache separately would be beneficial to us as we require the recent user to expire after 30 minutes. At the moment this is a manual task we do to make sure the list does not include expired users.
What is the best approach for this scenario? If it's storing the users separately in cache, what's the best approach to keeping track of the user so we can retrieve it?
You could keep in the memcache list just "pointers" that you can use to build individual memcache keys to access user entities separately stored in memcache. This makes the list's memcache size footprint a lot smaller and easy to deal with.
If the user entities have parents then pointers would have to be their keys, which are unique so they can be used as memcache keys as well (well, their urlsafe versions if needed).
But if the user entities don't have parents (i.e. they're root entities in their entity groups) then you can use their datastore key IDs as pointers - typically shorter than the keys. Better yet, if the IDs are numerical IDs you can even store them as numbers, not as strings. The IDs are unique for those entities, but they might not be unique enough to serve as memcache keys, you may need to add a prefix/suffix to make the respective memcache keys unique (to your app).
When you need a user entity data you 1st obtain the "pointer" from the list, build the user entity memcache key and retrieve the entity with that key.
This, of course, assumes you do have reasons to keep that list in place. If the list itself is not mandatory all you need is just the recipe to obtain the (unique) memcache keys for each of your entities.
If you use NDB caching, it will take care of memcache for you. Simply request the users with the key using ndb.Key(Model, id).get() or Model.get_by_id(id), where id is the User id.

Riak backend choice: bitcask vs leveldb

I'm planning to use Riak as a backend for a service that stores user session data. The main key used to retrieve data (binary blob) is named UUID and actually is a uuid, but sometimes the data might be retrieved using one or two other keys (e.g. user's email).
Natural option would be to pick leveldb backend with possibility to use secondary indexes for such scenario, but as secondary index search is not very common (around 10% - 20% of lookups), I was wondering if it wouldn't be better to have a separate "indexes" bucket, where such mapping email->uuid would be stored.
In such scenario, when looking using "secondary" index, I would first lookup the uuid in the "indexes" bucket, and then normally read the data using primary key.
Knowing that bitcask is much more predictable when it comes to latency and possibly faster, would you recommend such design, or shall I stick to leveldb and secondary indexes?
I think that both scenario would work. One way to choose which scenario to use is if you need expiration. I guess you'll want to have expiration for user sessions. If that's the case, then I would go with the second scenario, as bitcask offers a very good expiration feature, fully customizable.
If you go that path, you'll have to cleanup the metadata bucket (in eleveldb) that you use for secondary indexes. That can be done easily by also having an index of the last modification of the metadata keys. Then you run a batch to do a 2i query to fetch old metadata and delete them. Make sure you use the latest Riak, that supports aggressive deletion and reclaiming of disk space in leveldb.
That said, maybe you can have everything in bitcask, and avoid secondary indexes altogether. Consider this data design:
one "data" bucket: keys are uuid, value is the session
one "mapping_email" bucket: keys are email, values are uuid
one "mapping_otherstuff" bucket: same for other properties
This works fine if :
most of the time you let your data expire. That means you have no bookkeeping to do
you don't have too many mapping as it's cumbersome to add more
you are ready to properly implement a client library that would manage the 3 buckets, for instance when creating / updating / deleting new values
You could start with that, because it's easier on the administration, bookkeeping, batch-creation (none), and performance (secondary index queries can be expensive).
Then later on if you need, you can add the leveldb route. Make sure you use multi_backend from the start.

What NoSQL database (categories) support versioning?

I thought that regardless of whether a NoSQL aggregate store is a key-value, column-family or document database, it would support versioning of values. After a bit of Googling, I'm concluding that this assumption is wrong and that it just depends on the DBMS implementation. Is this true?
I know that Cassandra and BigTable support it (both column-family stores). It SEEMS that Hbase (column family) and Riak (Key-Value) do but Redis and Hadoop (Key-Value) do not. Mongo DB (document) doesCouchbase does but MongoDB does not (document stores). I don't see any pattern here. Is there a rule of thumb? (for example, "key value stores generally do not have versioning, while column-family and document databases do")
What I'm trying to do: I want to create a database of website screenshots from URL to PNG image. I'd rather use a key-value store since, versioning aside, it is the simplest solution that satisfies the problem. But when website changes or is decomissioned and I update my database I don't want to lose old images. Even if I select a key-value database that has versioning, I want to have the luxury to switch to a different key-value database without the constraint that many key-value DBs do not support versioning. So I'm trying to understand at what level of sophistication in the continuum of aggregate NoSQL databases does versioning become a feature implicit to the data model.
You don't really need versioning support from the Key-Value store.
The only thing you really need from the data Store is an efficient scanning/range query feature.
This means the datastore can retrieve entries in lexicographical order.
Most KV-stores do, so this is easy.
This is how you do it:
Create versioned keys.
In case you cant hash the original name to a fixed length, prepend the length of the original key. then put in the hash of the key or the original key itself, and end with a fixed length encoded version number (so it is lexicographically ordered from high version to low by inverting the number against the max version).
Query
Do a range query from the maximum possible version up to version 0, but only retrieving exactly one key.
Done
If you dont need explicit versions, you can also use a timestamp, so you can insert without getting the last version.
A really interesting approach to this is the Datomic database. Rather store versions, in Datomic, there are no updates only inserts. The entire database is immutable meaning you can specify the moment of truth you want to see the database as on connect and the entire history will appear to only contain the changes made up to that point. Or to think of it another any anything inserted into the database can be queried for its history looking backward. You can also branch the database and create data in one branch that isn't in the other (in programming it is like a database based on git, where multiple histories can be created)

Symfony2 Doctrine Use Cache Tables

For the project I'm working on, we have a fully normalized database where no information is redundant.
I'd like to keep this method, but also add "cache" tables, which are essentially tables which have pre-computed information. I'd love to be able to have this information in separate tables (which could then be blown away and regenerated as needed).
For example, part of this involves a forum. One "cached" value would be the number of posts a user has made. There is no need to keep this in any of the normalized tables, because it can be calculated based on a count of posts linked with that user. However, this is a (relatively) expensive call, so the cache table would keep track of this value for me and I can pull from it as needed.
I'm also strongly considering using a NoSQL database like MongoDB for this, because the cached tables would essentially have no joins or foreign keys (making it perfect for MongoDB).
Any ideas how I should approach this using Doctrine in Symfony2? Anyone done this before?
Thanks a ton!
Update
As greg0ire comments, it looks like Doctrine has some built in caching functionality: http://docs.doctrine-project.org/projects/doctrine-orm/en/latest/reference/caching.html
Does anyone know if I can employ this to cache my values without storing them in the database?
For example, if I had an unmapped property $postCount, can I use Doctrine to cache that value (or I guess, the object with that value populated)?
The only problem with this approach (caching to memory instead of a database), is we're working in a clustered environment, so I'd either have to build the cache multiple times (each server the user hits), or set get a shared caching server set up (which is a bit tricky).
I'll continue to investigate this route, but does anyone know of any database stored methods?
Thanks.
I think you may be looking for Doctrine's result cache
Here is the related part of the sf2 configuration.

Regularly updated data and the Search API

I have an application which requires very flexible searching functionality. As part of this, users will need have the ability to do full-text searching of a number of text fields but also filter by a number of numeric fields which record data which is updated on a regular basis (at times more than once or twice a minute). This data is stored in an NDB datastore.
I am currently using the Search API to create document objects and indexes to search the text-data and I am aware that I can also add numeric values to these documents for indexing. However, with the dynamic nature of these numeric fields I would be constantly updating (deleting and recreating) the documents for the search API index. Even if I allowed the search API to use the older data for a period it would still need to be updated a few times a day. To me, this doesn't seem like an efficient way to store this data for searching, particularly given the number of search queries will be considerably less than the number of updates to the data.
Is there an effective way I can deal with this dynamic data that is more efficient than having to be constantly revising the search documents?
My only thoughts on the idea is to implement a two-step process where the results of a full-text search are then either used in a query against the NDB datastore or manually filtered using Python. Neither seems ideal, but I'm out of ideas. Thanks in advance for any assistance.
It is true that the Search API's documents can include numeric data, and can easily be updated, but as you say, if you're doing a lot of updates, it could be non-optimal to be modifying the documents so frequently.
One design you might consider would store the numeric data in Datastore entities, but make heavy use of a cache as well-- either memcache or a backend in-memory cache. Cross-reference the docs and their associated entities (that is, design the entities to include a field with the associated doc id, and the docs to include a field with the associated entity key). If your application domain is such that the doc id and the datastore entity key name can be the same string, then this is even more straightforward.
Then, in the cache, index the numeric field information by doc id. This would let you efficiently fetch the associated numeric information for the docs retrieved by your queries. You'd of course need to manage the cache on updates to the datastore entities.
This could work well as long as the size of your cache does not need to be prohibitively large.
If your doc id and associated entity key name can be the same string, then I think you may be able to leverage ndb's caching support to do much of this.

Resources