Trunk-Based Deployment: How do you avoid Feature Flag Clutter? - continuous-deployment

For developers that use Trunk-based development, how have you approached dealing with an ever-growing collection of feature flags in your codebase?
My concern is if you are heavily leveraging feature flags for every release and every new feature, wouldn't the amount of feature flag code start to make the code less readable and possibly harder to maintain? For the sake of this question, assume feature flags are being handled by an external FFaaS.
Based on my own reasoning I can see a few options:
Never delete feature flags. Keep them around in case you may need them (ie for sunsetting a feature you are phasing out at some later date).
Periodically remove old feature flags that have remained on for X amount of time. This solves the code readability issue, but this breaks the trunk-based deploy paradigm as you lose out on the fallback measure of turning on/off a flag since the flag itself is being removed. You may also lose out on the above case where you'd have to manually track down functionality to phase out, or maybe re-introduce feature flags back in to facilitate some similar transition.
How have people handled the logistics around using this development system?

Related

Versioning with SemVer

I need a bit of help/advice with versioning with SemVer.
I'm working on a client's website who has sent several related amends both large and small to his site in a word document (like they always do).
I have a branch based off my master branch for these new amends, and have created commits for each completed amend I have done so far.
The idea was that I would complete all of the amends and then release them in the next release (v2.0.0) because I think all of these changes are related and all of them combined are significant enough to warrant a bump in version number.
The issue I have is that the client wants a few of these amends to be made live immediately, before the release of 2.0.0, so what would the best way of handling this be - would I upload these few completed amends into the existing version and increment the minor number, or would I bump it up to 2.0.0 even though all of the amends aren't complete?
I am a bit of a noob when it comes to versioning, but am trying to learn as best I can by reading and trying to make sense of Semantic Versioning site.
You should always consider these two things:
What the real changes are? If there are no visible changes, and/or if there are no major changes underneath, it may be better to avoid to increment the major version number.
What the customer should perceive of your changes? Saying version 1.1 or version 2.0 may make some difference in how the changes are perceived.
So if the modifications are limited and/or there are no visible things that have changed, it may make sense to increment the minor only and then wait for all of them to be complete to bump it to 2.0.0.

Techniques to improve transaction rate

Lighttpd, nginx and others use a range of techniques to provide maximum application performance such as AIO, sendfile, MMIO, caching and epoll and lock free data structures.
My collegue and I have written a little application server which uses many of these techniques and can also server static files. So we tested it with apache bench and compared ours with lighttpd and nginx and have at least matched the performance for static content for files from 100 bytes to 1K.
However, when we compare the transaction rate over the same static files to that of G-WAN, G-WAN is miles ahead.
I know this question may be a little subjective but what techniques apart from the obvious ones I've mentioned might Pierre Gauthier be using in GWAN that would enable him to achieve such astounding performance?
Following G-WAN server for years, I have read the (many) talks covering this question on the old G-WAN forum.
From what I can remember, what was repeatedly addressed were the program:
architecture (specific comparisons were made with nginx, lighty and cherokee)
implementation (how overall branching, request parsing and response building were made)
lean common path (the path followed by all types of requests: dynamic, static, handlers)
Pierre often mentionned other servers to explain what in their specific architecture and implementation was slowing them down.
As time goes, since G-WAN seems to stack more and more features (C# scripts support, a reverse-proxy and a load balancer are expected with the next version), it seems that those 3 points above are more and more important.
This is probably why each new release of G-WAN seems to be willing to be faster than the previous: the more work you do, the more extra fat must be eliminated because its cost gets higher. And like for a race car or a plane this is an incremental process, one calling for more of the other.
If you are looking for the 'secret' of G-WAN's speed then I guess that here is the key point. But if you want more details then you should rather talk directly to the G-WAN author.
Check out G-WAN's timeline. An update on August 8, 2011 might give you idea on what he is using.
G-WAN Timeline
Pierre mentioned that G-WAN uses it's wait-free Key-Value store a lot on G-WAN's core functions. Which gives it more speed since there's no locks being used.
He also uses a Lorenz Waterwheel inspired technique to handle threads. I am not sure how it works but he said that it allows G-WAN to run faster in every possible case.

Combining cache methods - memcache/disk based

Here's the deal. We would have taken the complete static html road to solve performance issues, but since the site will be partially dynamic, this won't work out for us.
What we have thought of instead is using memcache + eAccelerator to speed up PHP and take care of caching for the most used data.
Here's our two approaches that we have thought of right now:
Using memcache on >>all<< major queries and leaving it alone to do what it does best.
Usinc memcache for most commonly retrieved data, and combining with a standard harddrive-stored cache for further usage.
The major advantage of only using memcache is of course the performance, but as users increases, the memory usage gets heavy. Combining the two sounds like a more natural approach to us, even though the theoretical compromize in performance.
Memcached appears to have some replication features available as well, which may come handy when it's time to increase the nodes.
What approach should we use?
- Is it stupid to compromize and combine the two methods? Should we insted be focusing on utilizing memcache and instead focusing on upgrading the memory as the load increases with the number of users?
Thanks a lot!
Compromize and combine this two method is a very clever way, I think.
The most obvious cache management rule is latency v.s. size rule, which is used in CPU cached also. In multi level caches each next level should have more size for compensating higher latency. We have higher latency but higher cache hit ratio. So, I didn't recommend you to place disk based cache in front of memcache. Сonversely it's should be place behind memcache. The only exception is if you cache directory mounted in memory (tmpfs). In this case file based cache could compensate high load on memcache, and also could have latency profits (because of data locality).
This two storages (file based, memcache) are not only storages that are convenient for cache. You also could use almost any KV database as they are very good at concurrency control.
Cache invalidation is separate question which can engage your attention. There are several tricks you could use to provide more subtle cache update on cache misses. One of them is dog pile effect prediction. If several concurrent threads got cache miss simultaneously all of them go to backend (database). Application should allow only one of them to proceed and rest of them should wait on cache. Second is background cache update. It's nice to update cache not in web request thread but in background. In background you can control concurrency level and update timeouts more gracefully.
Actually there is one cool method which allows you to do tag based cache tracking (memcached-tag for example). It's very simple under the hood. With every cache entry you save a vector of tags versions which it is belongs to (for example: {directory#5: 1, user#8: 2}). When you reading cache line you also read all actual vector numbers from memcached (this could be effectively performed with multiget). If at least one actual tag version is greater than tag version saved in cache line then cache is invalidated. And when you change objects (for example directory) appropriate tag version should be incremented. It's very simple and powerful method, but have it's own disadvantages, though. In this scheme you couldn't perform efficient cache invalidation. Memcached could easily drop out live entries and keep old entries.
And of course you should remember: "There are only two hard things in Computer Science: cache invalidation and naming things" - Phil Karlton.
Memcached is quite a scalable system. For instance, you can replicate cache to decrease access time for certain key buckets or implement Ketama algorithm that enables you to add/remove Memcached instances from pool without remap of all keys. In this way, you can easily add new machines dedicated to Memcached when you happen to have extra memory. Furthermore, as its instance can be run with different sizes, you can throw up one instance by adding more RAM to an old machine. Generally, this approach is more economic and to some extent does not inferior to the first one, especially for multiget() requests. Regarding a performance drop with data growth, the runtime of the algorithms used in Memcached does not vary with the size of the data, and therefore the access time depend only on number of simultaneous requests. Finally, if you want to tune your memory/performance priorities you can set expire time and available memory configuration values which will strict RAM usage or increase cache hits.
At the same time, when you use a hard-disk the file system can become a bottleneck of your application. Besides general I/O latency, such things as fragmentation and huge directories can noticeably affect your overall request speed. Also, beware that default Linux hard disk settings are tuned more for compatibility than for speed, so it is advisable to configure it properly before usage (for instance, you can try hdparm utility).
Thus, before adding one more integrating point, I think you should tune the existent system. Usually, properly designed database, configured PHP, Memcached and handling of static data should be enough even for a high-load web site.
I would suggest that you first use memcache for all major queries. Then, test to find queries that are least used or data that is rarely changed and then provide a cache for this.
If you can isolate common data from rarely used data, then you can focus on improving performance on the more commonly used data.
Memcached is something that you use when you're sure you need to. You don't worry about it being heavy on memory, because when you evaluate it, you include the cost of the dedicated boxes that you're going to deploy it on.
In most cases putting memcached on a shared machine is a waste of time, as its memory would be better used caching whatever else it does instead.
The benefit of memcached is that you can use it as a shared cache between many machines, which increases the hit rate. Moreover, you can have the cache size and performance higher than a single box can give, as you can (and normally would) deploy several boxes (per geographical location).
Also the way memcached is normally used is dependent on a low latency link from your app servers; so you wouldn't normally use the same memcached cluster in different geographical locations within your infrastructure (each DC would have its own cluster)
The process is:
Identify performance problems
Decide how much performance improvement is enough
Reproduce problems in your test lab, on production-grade hardware with necessary driver machines - this is nontrivial and you may need a lot of dedicated (even specialised) hardware to drive your app hard enough.
Test a proposed solution
If it works, release it to production, if not, try more options and start again.
You should not
Cache "everything"
Do things without measuring their actual impact.
As your performance test environment will never be perfect, you should have sufficient instrumentation / monitoring that you can measure performance and profile your app IN PRODUCTION.
This also means that every single thing that you cache should have a cache hit/miss counter on it. You can use this to determine when the cache is being wasted. If a cache has a low hit rate (< 90%, say), then it is probably not worthwhile.
It may also be worth having the individual caches switchable in production.
Remember: OPTIMISATIONS INTRODUCE FUNCTIONAL BUGS. Do as few optimisations as possible, and be sure that they are necessary AND effective.
You can delegate the combination of disk/memory cache to the OS (if your OS is smart enough).
For Solaris, you can actually even add SSD layer in the middle; this technology is called L2ARC.
I'd recommend you to read this for a start: http://blogs.oracle.com/brendan/entry/test.

Should you test an external system prior to using it?

Note: This is not for unit testing or integration testing. This is for when the application is running.
I am working on a system which communicates to multiple back end systems, which can be grouped into three types
Relational database
SOAP or WCF service
File system (network share)
Due to the environment this will run in, there are no guarantees that any of those will be available at run time. In fact some of them seem pretty brittle and go down multiple times a day :(
The thinking is to have a small bit of test code which runs before the actual code. If there is a problem then persist the request and poll until the target system until it is available. Tests could possibly be rerun within the code to check it is still available at logical points. The ultimate goal is to have a very stable system, regardless of the stability (or lack thereof) of the systems it communicates to.
My questions around this design are:
Are there major issues with it? (small things like the fact it may fail between the test completing and the code running are understandable)
Are there better ways to implement this sort of design?
Would using traditional exception handling and/or transactions be better?
Updates
The system needs to talk to the back end systems in a coordinated way.
The system is very async in nature so using things like queuing technologies is fine.
The system must run even if one or more backend systems are down as others may be up and processing of some information is possible.
You will be needing that traditional exception handling no matter what, since as you point out there's always the chance that things'll fail between your last check and the actual request. So I really think any solution you find should try to interact smoothly with this.
You are not stating if these flaky resources need to interact in some kind of coordinated manner, which would indicate that you should probably be using a transaction manager of some sort to do this. I do not believe you want to get into the footwork of transaction management in application code for most needs.
Sometimes I have also seen people use AOP to encapsulate retry logic to back-end systems that fail (for instance due to time-out issues). Used sparingly this may be a decent solution.
In some cases you can also use message queuing technology to alleviate unstable back-ends. You could for instance commit to a message queue as part of a transaction, and only pop off the queue when successful. But this design is normally only possible when you're able to live with an asynchronous process.
And as always, real stability can only be achieved by attacking the root cause of the problem. I had a 25-year old bug fixed in a mainframe TCP/IP stack fixed because we were overrunning it, so it is possible.
The Microsoft Smartclient framework provides a ConnectionMonitor class. Should be easy to use or duplicate.
Our approach to this kind of issue was to run a really basic 'sanity tester' prior to bringing up our main application. This was thick client so we could run the test every time the app started. This sanity test would go out and check things like database availability, and external network (extranet) access, and it could have been extended to do webservices as well.
If there was a failure, the user was informed, and crucially an email was also sent to the support/dev team. These emails soon became unweildy as so many were being created, but we then setup filters, so we knew when somethings really bad was happening. Overall the approach worked pretty well, our biggest win was being able to tell users that the system was down, before they had entered data, and got part way through a long winded process. They absolutely loved it.
At a technica level the sanity was written in C#, it used exception handling in a conventional way not to find the problems it was looking for. The sanity program became a mini app in its own right, and it was standalone from the main app. If I were doing it again I'd using a logging framework to capture issues, which is more flexible then our hard coded approach.

What advice can you give me for writing a meaningful benchmark?

I have developed a framework that is used by several teams in our organisation. Those "modules", developed on top of this framework, can behave quite differently but they are all pretty resources consuming even though some are more than others. They all receive data in input, analyse and/or transform it, and send it further.
We planned to buy new hardware and my boss asked me to define and implement a benchmark based on the modules in order to compare the different offers we have got.
My idea is to simply start sequentially each module with a well chosen bunch of data as input.
Do you have any advice? Any remarks on this simple procedure?
Your question is pretty broad, so unfortunately my answer will not be very specific either.
First, benchmarking is hard. Do not underestimate the effort necessary to produce meaningful, repeatable, high-confidence results.
Second, what is your performance goal? Is it throughput (transaction or operations per second)? Is it latency (time it takes to execute a transaction)? Do you care about average performance? Do I care about worst case performance? Do you care about the absolute worst case or I care that 90%, 95% or some other percentile get adequate performance?
Depending on which goal you have, then you should design your benchmark to measure against that goal. So, if you are interested in throughput, you probably want to send messages / transactions / input into your system at a prescribed rate and see if the system is keeping up.
If you are interested in latency, you would send messages / transactions / input and measure how long it takes to process each one.
If you are interested in worst case performance you will add load to the system until up to whatever you consider "realistic" (or whatever the system design says it should support.)
Second, you do not say if these modules are going to be CPU bound, I/O bound, if they can take advantage of multiple CPUs/cores, etc. As you are trying to evaluate different hardware solutions you may find that your application benefits more from a great I/O subsystem vs. a huge number of CPUs.
Third, the best benchmark (and the hardest) is to put realistic load into the system. Meaning, you record data from a production environment, and put the new hardware solution through this data. Getting this done is harder than it sounds, often, this means adding all kinds of measure points in the system to see how it behaves (if you do not have them already,) modifying the existing system to add record/playback capabilities, modifying the playback to run at different rates, and getting a realistic (i.e., similar to production) environment for testing.
The most meaningful benchmark is to measure how your code performs under everyday usage. That will obviously provide you with the most realistic numbers.
Choose several real-life data sets and put them through the same processes your org uses every day. For extra credit, talk with the people that use your framework and ask them to provide some "best-case", "normal", and "worst-case" data. Anonymize the data if there are privacy concerns, but try not to change anything that could affect performance.
Remember that you are benchmarking and comparing two sets of hardware, not your framework. Treat all of the software as a black box and simply measure the hardware performance.
Lastly, consider saving the data sets and using them to similarly evaluate any later changes you make to the software.
If you're system is supposed to be able to handle multiple clients all calling at the same time, then your benchmark should reflect this. Note that some calls will not play well together. For example, having 25 threads post the same bit of information at the same time could lead to locks on the server end, thus skewing your results.
From a nuts-and-bolts point of view, I've used Perl and its Benchmark module to gather the information I care about.
If you're comparing differing hardware, then measuring the cost per transaction will give you a good comparison of the trade offs of hardware for performance. One configuration may give you the best performance, but costs too much. A less expensive configuration may give you adequate performance.
It's important to emulate the "worst case" or "peak hour" of load. It's also important to test with "typical" volumes. It's a balancing act to get good server utilization, that doesn't cost too much, that gives the required performance.
Testing across hardware configurations quickly becomes expensive. Another viable option is to first measure on the configuration you have, then simulate that behavior across virtual systems using a model.
If you can, try to record some operations users (or processes) are doing with your framework, ideally using a clone of the real system. That gives you the most realistic data. Things to consider:
Which functions are most often used?
How much data is transferred?
Do not assume anything. If you think "that is going to be fast/slow", don't bet on it. In 9 out of 10 cases, you're wrong.
Create a top ten for 1+2 and work from that.
That said: If you replace old hardware with new hardware, you can expect roughly 10% faster execution for each year that has passed since you bought the first set (if the systems are otherwise pretty equal).
If you have a specialized system, the numbers may be completely different but usually, new hardware doesn't change much. For example, adding an useful index to a database can reduce the runtime of a query from two hours to two seconds. Hardware will never give you that.
As I see it, there are two kinds of benchmarks when it comes to benchmarking software. First, microbenchmarks, when you try to evaluate a piece of code in isolation or how a system deals with narrowly defined workload. Compare two sorting algorithms written in Java. Compare two web browsers how fast can each perform some DOM manipulation operation. Second, there are system benchmarks (I just made the name up), when you try to evaluate a software system under a realistic workload. Compare my Python based backend running on Google Compute Engine and on Amazon AWS.
When dealing with Java and such like, keep in mind that the VM needs to warm up before it can give you realistic performance. If you measure time with the time command, the JVM startup time will be included. You almost always want to either ignore start-up time or keep track of it separately.
Microbenchmarking
During the first run, CPU caches are getting filled with the necessary data. The same goes for disk caches. During few subsequent runs the VM continues to warm up, meaning JIT compiles what it deems helpful to compile. You want to ignore these runs and start measuring afterwards.
Make a lot of measurements and compute some statistics. Mean, median, standard deviation, plot a chart. Look at it and see how much it changes. Things that can influence the result include GC pauses in the VM, frequency scaling on the CPU, some other process may start some background task (like virus scan), OS may decide move the process on a different CPU core, if you have NUMA architecture, the results would be even more marked.
In case of microbenchmarks, all of this is a problem. Kill what processes you can before you begin. Use a benchmarking library that can do some of it for you. Like https://github.com/google/caliper and such like.
System benchmarking
In case of benchmarking a system under a realistic workload, these details do not really interest you and your problem is "only" to know what a realistic workload is, how to generate it and what data to collect. It is always best if you can instrument a production system and collect data there. You can usually do that, because you are measuring end-user characteristics (how long did a web page render) and these are I/O bound so the code gathering data does not slow down the system. (The page needs to be shipped to the user over the network, it does not matter if we also log a few numbers in the process).
Be mindful of the difference between profiling and benchmarking. Benchmarking can give you absolute time spent doing something, profiling gives you relative time spent doing something compared to everything else that needed doing. This is because profilers run heavily instrumented programs (common technique is to stop-the-world every few hundred ms and save a stack trace) and the instrumentation slows everything down significantly.

Resources