How to cache batches of IDs "locally" in a serverless environment? - database

Traditionally, in a non-serverless environment, I would have the following system. Say I have a custom ID generation protocol for all my models. Say I also have 20 servers scattered around. I give each server a slice of IDs to work with off the whole stack of IDs. When they are done or the server goes down, it returns the IDs back to the system so they don't get wasted. The reason for sending each server a batch of IDs is so that every time a new record is created you don't need to fetch from a central ID server to get the next ID. Instead they have a local set they can work with freely.
How would you do this sort of thing in a serverless system? I am deploying to Vercel and wondering what the appropriate architecture might be for such an ID batching system. There are other use cases for needed a persistent copy of data in a local server, so if you don't like the ID example just imagine another sort of system. How do you solve this optimization problem in a serverless environment?

Serverless is an approach. Like all such things (solutions), it should be matched to the problem - not the other way around. Is this simply a case where serverless is a good solution choice for dealing with 80% of your problem, and that all you need to do is choose something appropriate to deal to the other 20%?
Assuming you have the freedom to do this, can't you just have the serverless parts of the solution consume non-serverless services - e.g. an ID Service?
Separately to this, caching comes to mind - just the general idea of having some data close by which might be mastered somewhere else. Caching patterns like Write Behind would allow you to work with local copies (i.e. immediate consumption) whilst farming out the cache-master communication.

Related

Redis database snapshot diffs or other suggested DB for network/resource monitoring

I have a monitoring service that polls a REST API for information about the latest resources (list of hosts/list of licenses). The monitoring service cache's all this data in a Redis database. Everything works great for discovering new resources.
However the problem I am facing is when a host drops off the network. The challenge I am facing is that I haves no way of knowing that the host has disappeared from the list of hosts. The REST API only gives me a way of querying a list of hosts.
One way that I can come up (theoretically) is by taking a diff of the rdb at different time intervals. However this does not seem efficient to me and honestly I am not sure how I would do this with redis.
The suggestions I am looking for are, maybe some frameworks which are best suited for this kind of an operation or if need be a different database that might be as efficient as redis yet gives me the functionality I need to take diffs. Time series databases spring to mind but I have no experience in them and not sure how they can be used to solve this problem precisely.
There's no need to resort to anywhere besides Redis itself - it is robust enough to continue serving your requirements as long as you tell it what to do (like any other software ;)).
The following is an example but as you didn't specify how you're caching your data, I'll assume for simplicity's sake that you have a key per every host/license in your list where you store some string/binary value, like:
SET acme.org "some cached value"
You have a lot of such keys because the monitoring REST API returns a list, so a common way to keep everything order is use another key to store that list for each request returned by the API. You can achieve that with a Set:
SADD request:<timestamp> acme.org foo.bar ...
Sets are particularly useful here because you can perform Set operations, SDIFF and SINTER and store-variants in your case, to keep track of the current online and dropped hosts. For example:
MULTI
SINTERSTORE online:<timestamp> request:<timestamp> request:<previous-timestamp>
SDIFFSTORE dropped:<timestamp> request:<timestamp> request:<previous-timestamp>
EXEC
Note: as you're caching things it is good practice to expiry values (TTL) to all relevant keys and use an appropriate eviction policy.

Transient filesystem key/value store from Ruby on Heroku?

I need to run a process that deals with a large volume of data. Too large to reasonably work with in RAM. The data does not need to be preserved between process instances or shared between process instances though, so I was hoping to use some kind of disk-based database for the storage.
My first thought was SQLite, but Heroku explicitly does not support that. Te second thing that I tried was Ruby's PStore, but that turned out to be way too flakey for the job. The next thing that I tried was DBM, which Heroku does not seem to say that they don't support. When I tried to run the code after deploying it, however, I got "LoadError: cannot load such file -- dbm".
Given that I do not need data to be persisted between process instances, is there any way to work around Heroku's bias against support for file-based relational or key/value data stores?
I'm not sure what the structure of the data is, but have you considered using Redis? There is an effective free tier on Heroku. The issue with file based solutions are that Heroku uses an ephemeral file system. It's not designed to scale disk space among computing units.

Is it possible to transfer result set between web services?

I would like to know if its possible?
Step1: call one web service A which retrieves data from the database and stores in a result set (dynamically)
Step2: web service A calls web service B to process the data stored in the result set.
Is it possible to share result sets which could be of large or small size between web services. If its not possible whats the best options
/SR
Might have more elegant solutions depending on which web services you are specifically referring to, but most of them out there have built-in environment variables, functions, etc.. for parsing the URL string, so you can pass data that way. Or with cookies. Or with XML or SOAP
I don't see any problem here. As soon as one web-service can handle data "of large or small size" you can insert another web-service in a middle.
However, there are might be performance penalties in converting data to and from XML/JSON/SOAP several times.
You don't say what your execution environment is -- please do, btw -- but in any environment I've heard of or worked in, there are several ways to do what you want to do. Leaving aside standalone programs, interprocess communication is (I claim, without thinking too much) the most common requirement on any OS.
-- b

Distributed Key-Value Data Store with Offline Access (Static Partitioning)

Need to be able to set server(s) that replicate all information, as a master data store that has all the data.
Also need servers that specifically store/replicate certain data, available in local LANs, so that when the internet connection goes down, they can still access their local data. Under normal circumstances, the clients will access most of their data from the local LAN, and may use others when the local LAN server goes down.
This is wanted alongside the benefits of a distributed data store, such as failure resistance and speed.
Which Distributed Key-Value Data Store or other data storage method would be most suited for this?
Try out CouchDB. Your use case reads like it was build for it. Point taken, CouchDB is much more than a key/value store, but on the other hand, not less suitable for it.
Add replication and as an added bonus fault tolerance, conflict detection (and resolution) and an easy API (HTTP).
Let me know if you have any other questions.
Of course you must remember that replication is something completely different from backup, because one system's programmatic failure in handling the data can quickly replicate to other nodes resulting in total mayhem.
Maybe using a Hadoop File System or OpenAFS would be a good solution here?
I haven't used any of those systems in real-life scenarios, only had interest in them during my research on peer-to-peer and distributed storage solutions, but I think they're worth a try.
Have you checked out the new Microsoft's Velocity? http://msdn.microsoft.com/en-us/data/cc655792.aspx. Unlike many other cloud services, you can run the setup (for Velocity) on your premises.

Document/Image Database Repository Design Question

Question:
Should I write my application to directly access a database Image Repository or write a middleware piece to handle document requests.
Background:
I have a custom Document Imaging and Workflow application that currently stores about 15 million documents/document images (90%+ single page, group 4 tiffs, the rest PDF, Word and Excel documents). The image repository is a commercial, 3rd party application that is very expensive and frankly has too much overhead. I just need a system to store and retrieve document images.
I'm considering moving the imaging directly into a SQL Server 2005 database. The indexing information is very limited - basically 2 index fields. It's a life insurance policy administration system so I index images with a policy number and a system wide unique id number. There are other index values, but they're stored and maintained separately from the image data. Those index values give me the ability to look-up the unique id value for individual image retrieval.
The database server is a dual-quad core windows 2003 box with SAN drives hosting the DB files. The current image repository size is about 650GB. I haven't done any testing to see how large the converted database will be. I'm not really asking about the database design - I'm working with our DBAs on that aspect. If that changes, I'll be back :-)
The current system to be replaced is obviously a middleware application, but it's a very heavyweight system spread across 3 windows servers. If I go this route, it would be a single server system.
My primary concerns are scalabity and performace - heavily weighted toward performance. I have about 100 users, and usage growth will probably be slow for the next few years.
Most users are primarily read users - they don't add images to the system very often. We have a department that handles scanning and otherwise adding images to the repository. We also have a few other applications that receive documents (via ftp) and they insert them into the repository automatically as they are received, either will full index information or as "batches" that a user reviews and indexes.
Most (90%+) of the documents/images are very small, < 100K, probably < 50K, so I believe that storage of the images in the database file will be the most efficient rather than getting SQL 2008 and using a filestream.
Oftentimes scalability and performance are ultimately married to each other in the sense that six months from now management comes back and says "Function Y in Application X is running unacceptably slow, how do we speed it up?" And all too the often the answer is to upgrade the back end solution. And when it comes to upgrading back ends, its almost always going to less expensive to scale out than to scale up in terms of hardware.
So, long story short, I would recommend building a middleware app that specifically handles incoming requests from the user app and then routes them to the appropriate destination. This will sufficiently abstract your front-end user app from the back end storage solution so that when scalability does become an issue only the middleware app will need to be updated.
This is straightforward. Write the application to an interface, use some kind of factory mechanism to supply that interface, and implement that interface however you want.
Once you're happy with your interface, then the application is (mostly) isolated from the implementation, whether it's talking straight to a DB or to some other component.
Thinking ahead a bit on your interface design but doing bone stupid, "it's simple, it works here, it works now" implementations offers a good balance of future proofing the system while not necessarily over engineering it.
It's easy to argue you don't even need an interface at this juncture, rather just a simple class that you instantiate. But if your contract is well defined (i.e. the interface or class signature), that is what protects you from change (such as redoing the back end implementation). You can always replace the class with an interface later if you find it necessary.
As far as scalability, test it. Then you know not only if you may need to scale, but perhaps when as well. "Works great for 100 users, problematic for 200, if we hit 150 we might want to consider taking another look at the back end, but it's good for now."
That's due diligence and a responsible design tactic, IMHO.
I agree with gabriel1836. However, an added benefit would be that you could for a time run a hybrid system for a time since you aren't going to convert 14 millions documents from your proprietary system to you home grown system overnight.
Also, I would strongly encourage you to store the documents outside of a database. Store them on a file system (local, SAN, NAS it doesn't matter) and store pointers to the documents in the database.
I'd love to know what document management system you are using now.
Also, don't underestimate the effort of replacing the capture (scanning and importing) provided by the proprietary system.

Resources