I would like to know few things about Camel Headers
what are the pros and cons of setting too many headers in camel Exchange?
what are the pros and cons of setting too large headers in camel Exchange?
If you have too many or too large, then by definition there will be problems. But let's consider simply many and large instead.
Many Headers
The headers in Camel are held in a java.util.TreeMap, so there may be some performance characteristics to consider with this data structure. Perhaps there may be an issue if many headers are added all at once in their natural order, as the tree would need to rebalance several times. Also, keep in mind that searching for a specific header would be a O(log n) operation, so there could be efficiency issues if multiple queries of specific headers occur.
Large Headers
By themselves, large headers don't necessarily cause any problems. The issues which arise are in systems where there are several exchanges, each with separate large object needing to be manipulated. It's taxing on the system to hold all these things in memory, but not because of any deficiency of Camel.
That said, it would be atypical to have truly large headers. Typically, if you need to process large objects in Camel, it is better to have them as streams in the body of the message.
Related
Rolling with a similar example: If you have Companies and companies have Divisions and divisions have Employees, when you make a GET request for a company how do you decide what data to embed? For example you could return a Company with nested Divisions and People. Or, you could make 3 separate calls.
I've been using the GitHub API a bit and they seem to embed some data providing its not an array. For example, when you request a repo it may have the owner embedded but not issues and pull requests. How did they decide on this?
Also, it seems like, depending on your data store SQL vs NoSQL mileage may vary here.
Found this example m but its not quite the same.
when you make a GET request for a company how do you decide what data to embed?
I think your asking something like "when designing a resource, how do you decide what information belongs in the representation, and what information is linked?"
An answer to that is to pay attention to the fact that caching happens at the resource level -- the target-uri is the cache key. When we invalidate a resource, we invalidate all of its representations.
An implication here is that we want the caching policy of the resource to be favorable for all of the information included in the representation.
Mixing data that changes minute-by-minute with data that has an expected half life of a year makes it difficult to craft a sensible caching policy. So it might make more sense to carve the information into piles with similar life cycles, and have a separate resource for each.
Consider a website like stack overflow: the branding (logo, style sheets, images) doesn't change very often; the page contents (questions, answers, comments) change at a higher cadence. The total network bandwidth consumed is probably considerably lower if you link to the big, slowly changing design elements.
(Also, once you do use links, you have the ability to move the different resources independently - moving content to a CDN, routing high and low priority requests differently, and so on).
https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull
Is there a limit on taskqueue pull task tag names, or can there be millions of arbitrary task tags?
There's no documented hard limit, and I wouldn't expect it to be; a couple reasons come to mind:
Internally tasks are stored in bigtable as everything else, and thus one could imagine tags are indexed as we're used to for our own data, and there's no limit there.
The database is designed to find indexed data very efficiently, and they purposely denied us of methods to group tags so we couldn't use them to fan-in data, meaning no need to merge-join and thus guaranteed performance that scales indefinitively :)
In this thread people are talking about how reliable is the queue when testing the limits, and this quote is interesting:
We were using many different tags (basically regrouping events per user
with several million users).
So at least this one guy just went with it and used millions of tags, with no issues directly related to the practice.
I have a basic question with regards to FileSystem usage
I want to use a embedded KeyValue store, which is very write oriented. (persistent) Say my value size is
a) 10 K
b) 1 M
and read and updates are equal in number
Cant I simply create files containing the value and there name acting as keys.
Wont it as fast as using a KeyValue store as LevelDB or RocksDB.
Can anybody please help me understand .
In principle, yes, a filesystem can be used as a key-value store. The differences only come in when you look at individual use cases and limitations in the implementations.
Without going into too much details here, there are some things likely to be very different:
A filesystem splits data into fixed size blocks. Two files can't typically occupy parts of the same block. Common block sizes are 4-16 KiB; you can calculate how much overhead your 10 KiB example would cause. Key/value stores tend to account for smaller-sized pieces of data.
Directory indexes in filesystems are often not capable of efficiently iterating over the filenames/keys in sort order. You can efficiently look up a specific key, but you can't retrieve ranges without reading pretty much all of the directory entries. Some key/value stores, including LevelDB, support efficient ordered iterating.
Some key/value stores, including LevelDB, are transactional. This means you can bundle several updates together, and LevelDB will make sure that either all of these updates make it through, or none of them do. This is very important to prevent your data getting inconsistent. Filesystems make this much harder to implement, especially when multiple files are involved.
Key/value stores usually try to keep data contiguous on disk (so data can be retrieved with less seeking), whereas modern filesystems deliberately do not do this across files. This can impact performance rather severely when reading many records. It's not an issue on solid-state disks, though.
While some filesystems do offer compression features, they are usually either per-file or per-block. As far as I can see, LevelDB compresses entire chunks of records, potentially yielding better compression (though they biased their compression strategy towards performance over compression efficiency).
Lets try to build Minimal NoSQL DB server using Linux and modern File System in 2022, just for fun, not for serious environment.
DO NOT TRY THIS IN PRODUCTION
—————————————————————————————————————————————
POSIX file Api for read write,
POSIX ACL for native user accounts and group permission management.
POSIX filename as key ((root db folder)/(tablename folder)/(partition folder)/(64bitkey)). Per db and table we can define permission for read/write using POSIX ACL. (64bitkey) is generated in compute function.
Mount BTRFS/OpenZFS/F2fs as filesystem to provide compression (Lz4/zstd) and encryption (fscrypt) as native support. F2fs is more suitable as it implements LSM which many nosql db used in their low level architecture.
Meta data is handled by filesystem so no need to implement it.
Use Linux and/or filesystem to configure page or file or disk block cache according to read write patterns as implemented in business login written in compute function or db procedure.
Use RAID and sshfs for remote replication to create Master/Slave high availability and/or backup
Compute function or db procedure for writing logic could be NodeJS file or Go binary or whatever along with standard http/tcp/ws server module which reads and write contents to DB.
This week I read an interesting article which explain how the authors implemented an activity. Basically, they're using two approaches to handle activities, which I'm adapting to my scenario, so supposing we hava an user foo who has a certain number (x) of followers:
if x<500, then the activity will be copyied to every follower feed
this means slow writes, fast reads
if x>500, only a link will be made between foo and his followoers
in theory, fast writes, but will slow reads
So when some user access your activity feed, the server will fetch and merge all data, so this means fast lookups in their own copyied activities and then query accross the links. If a timeline has a limit of 20, then I fetch 10 of each and then merge.
I'm trying to do it with Riak and the feature of Linking, so this is my question: is linking faster than copy? My idea of architecture is good enough? Are there other solutions and/or technologies which I should see?
PS.: I'm not implementing a activity feed for production, it's just for learning how to implement one which performs well and use Riak a bit.
Two thoughts.
1) No, Linking (in the sense of Riak Link Walking) is very likely not the right way to implement this. For one, each link is stored as a separate HTTP header, and there is a recommended limit in the HTTP spec on how many header fields you should send. (Although, to be fair, in tests you can use upwards of a 1000 links in the header with Riak, seems to work fine. But not recommended). More importantly, querying those links via the Link Walking api actually uses MapReduce on the backend, and is fairly slow for the kind of usage you're intending it for.
This is not to say that you can't store JSON objects that are lists of links, sure, that's a valid approach. I'm just recommending against using Riak links for this.
2) As for how to properly implement it, that's a harder question, and depends on your traffic and use case. But your general approach is valid -- copy the feed for some X value of updates (whether X is 500 or much smaller should be determined in testing), and link when the number of updates is greater than X.
How should you link? You have 3 choices, all with tradeoffs. 1) Use Secondary Indices (2i), 2) Use Search, or 3) Use links "manually", meaning, store JSON documents with URLs that you dereference manually (versus using link walking queries).
I highly recommend watching this video: http://vimeo.com/album/2258285/page:2/sort:preset/format:thumbnail (Building a Social Application on Riak), by the Clipboard engineers, to see how they solved this problem. (They used Search for linking, basically).
I am writing an application, which parses a large file, generates a large amount of data and do some complex visualization with it. Since all this data can't be kept in memory, I did some research and I'm starting to consider embedded databases as a temporary container for this data.
My question is: is this a traditional way of solving this problem? And is an embedded database (other than structuring data) supposed to manage data by keeping in memory only a subset (like a cache), while the rest is kept on disk? Thank you.
Edit: to clarify: I am writing a desktop application. The application will be inputted with a file of size of 100s of Mb. After reading the file, the application will generate a large number of graphs which will be visualized. Since, the graphs may have such a large number of nodes, they may not fit into memory. Should I save them into an embedded database which will take care of keeping only the relevant data in memory? (Do embedded databases do that?), or I should write my own sophisticated module which does that?
Tough question - but I'll share my experience and let you decide if it helps.
If you need to retain the output from processing the source file, and you use that to produce multiple views of the derived data, then you might consider using an embedded database. The reasons to use an embedded database (IMHO):
To take advantage of RDBMS features (ACID, relationships, foreign keys, constraints, triggers, aggregation...)
To make it easier to export the data in a flexible manner
To enable access to your processed data to external clients (known format)
To allow more flexible transformation of the data when preparing for viewing
Factors which you should consider when making the decision:
What is the target platform(s) (windows, linux, android, iPhone, PDA)?
What technology base? (Java, .Net, C, C++, ...)
What resource constraints are expected or need to be designed for? (RAM, CPU, HD space)
What operational behaviours do you need to take into account (connected to network, disconnected)?
On the typical modern desktop there is enough spare capacity to handle most operations. On eeePCs, PDAs, and other portable devices, maybe not. On embedded devices, very likely not. The language you use may have build in features to help with memory management - maybe you can take advantage of those. The connectivity aspect (stateful / stateless / etc.) may impact how much you really need to keep in memory at any given point.
If you are dealing with really big files, then you might consider a streaming process approach so you only have in memory a small portion of the overall data at a time - but that doesn't really mean you should (or shouldn't) use an embedded database. Straight text or binary files could work just as well (record based, column based, line based... whatever).
Some databases will allow you more effective ways to interact with the data once it is stored - it depends on the engine. I find that if you have a lot of aggregation required in your base files (by which I mean the files you generate initially from the original source) then an RDBMS engine can be very helpful to simplify your logic. Other options include building your base transform and then adding additional steps to process that into other temporary stores for each specific view, which are then in turn processed for rendering to the target (report?) format.
Just a stream-of-consciousness response - hope that helps a little.
Edit:
Per your further clarification, I'm not sure an embedded database is the direction you want to take. You either need to make some sort of simplifying assumptions for rendering your graphs or investigate methods like segmentation (render sections of the graph and then cache the output before rendering the next section).