How to deploy simultaneously several critical bugs with Gitflow process as soon as possible? - continuous-deployment

So we use the Gitflow process for a safe and sound CID
(https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow)
But something's wrong.
Let's say we have 10 critical bugs on prod - they must be fixed asap.
So 10 bugs are assigned to 10 devs. Dev 1 creates a hotfix 0.0.1, commits the fix to it, fix gets uploaded to staging env and handed to QA for verifying. In that time dev 2 wants to create hotfix 0.0.2 for his bug, but he can't as Gitflow forbids creating new hotfix if one exists already.
So he either have to wait for closing hotfix/0.0.1 or commit to the same hotfix. Of course the normal choice is to commit to that hotfix/0.0.1, because his bug is critical and cant wait fix for bug 1.
So does every other 10 devs with their 10 critical bugs.
QA verifies bug 1 and confirm for deployment. But hotfix can't be closed and deployed, because the other 9 bugs are not tested or just not working
So to be closed, every bug in that hotfix must be working and confirmed.
In our case this takes time, a lot of time - 1,2 weeks, luckily - as more and more things needs to be fixed asap on production. So devs commit to that branch more and more, getting the first bugs to be put on hold even more.
So eventually, after all those bugs are fixed and confirmed it is time to finally close the hotfix and deploy those critical bugs on prod...with quite of a big time delay. But there is also another big problem - THE MERGE with dev branch, where other devs do their work (like minor bugfixings for next release). A horrible, several-hours lasting merge.
So we obviously are doing something wrong - what could be a solution for this? For our hotfixes to be on time and to not have terrible merges to dev.
One solution we were thinking, is to put a limit in the hotfix for the number of bugs which it should contain - for example 5 bugs max, other bugs must wait until those are fixed. This, though, means that only lead-dev should commit to hotfix, as he must ensure the limit (it is hard to control otherwise). But this way bugs are still put on hold and fast-deployment is not achieved.
What would be the normal procedure? Please share your ideas.

If you are spending several hours on a merge, then you are probably doing it wrong. One approach is to get a version control expert in for a day to train your team. That is a relatively small cost compared to salaries. I should think you could find a very good person for £500 ($600 US) per day, and you might only need them for a day or two.
I wonder also whether your testing (with a QA team) is too manual. Can bugfixes be accompanied by unit tests to prove they are an improvement? If so, the developer should be able to merge master into their bugfix branch, spin up a new instance for some simple QA checking, get QA to check the system, get a team lead to PR the fix and unit tests, then merge straight into master and redeploy on its own.
Also, make sure your deployment to live is fully automated. Doing several live (small) releases per day is a good target to have.
Updates
The advice above still stands, but since you have now said this is an old project:
Get some automated tests in place. Try to set up functional/browser tests for the bulk of the project, so you can have confidence in changes.
Don't reject unit tests out of hand. Perhaps they could be just for new code or new changes, at least to start with? Yes, the project is old, but don't let yourself off the hook easily. Unit tests will pay dividends in the long run. If you don't have dependency injection in the project, get that added, even if it is not used for all instantiation immediately.
Repeated from above: make your releases smaller.

Related

Is it normal to need to restart Solr/Apache regularly?

It seems like about once every 2 to 3 months or so something about our solr implementation breaks down. Most recently the process we use to reindex our solr cores broke. It's a console application that just does two things: 1) Clear the indexes, 2) Rebuild the indexes. That's it, and it just does it by issuing http web requests to the server.
I was getting a 500 response when it was trying to clear the 1st core. None of the other cores had a problem though (except for the fact that they came after the 1st one and it is a synchronous process so nothing got reindexed). I spent a little time troubleshooting but ultimately I just restarted apache and it worked.
That seems to be the solution for everything. I wish I could remember previous issues I've run into, but something a little different seems to happen each time and it's always just much easier to reset apache than to spend the time to troubleshoot (plus it always happens in production and so spending hours troubleshooting isn't really a good option if I can fix it in seconds).
I wish these issues happened in staging or development where I could take time to investigate further, but it's always in production. So I'm starting to wonder if I should just create a task to reset Apache server every night.
When I reset it and it just suddenly works normally without me having to make a single change, I really have to wonder about the stability of Solr. Is this normal for others who use solr?

Versioning with SemVer

I need a bit of help/advice with versioning with SemVer.
I'm working on a client's website who has sent several related amends both large and small to his site in a word document (like they always do).
I have a branch based off my master branch for these new amends, and have created commits for each completed amend I have done so far.
The idea was that I would complete all of the amends and then release them in the next release (v2.0.0) because I think all of these changes are related and all of them combined are significant enough to warrant a bump in version number.
The issue I have is that the client wants a few of these amends to be made live immediately, before the release of 2.0.0, so what would the best way of handling this be - would I upload these few completed amends into the existing version and increment the minor number, or would I bump it up to 2.0.0 even though all of the amends aren't complete?
I am a bit of a noob when it comes to versioning, but am trying to learn as best I can by reading and trying to make sense of Semantic Versioning site.
You should always consider these two things:
What the real changes are? If there are no visible changes, and/or if there are no major changes underneath, it may be better to avoid to increment the major version number.
What the customer should perceive of your changes? Saying version 1.1 or version 2.0 may make some difference in how the changes are perceived.
So if the modifications are limited and/or there are no visible things that have changed, it may make sense to increment the minor only and then wait for all of them to be complete to bump it to 2.0.0.

Web-Application Deployment Process, Scrum Sprints, Git-flow, Versioning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
We have a small scrum-team which develops a webseite with a lot of users. We would like to improve our development and relase/deployment process, but there are a lot of concepts out there, which in our eyes don't fit together perfectly. Since we're not the only company developing a website, I thought it might be a good idea to ask here ;)
First of all, the software is not in good shape, so things like continuous delivery or continuous depoloyment are not possible because of bad test coverage and that's nothing we can change soon.
We have two-week sprints, so we develop new features or resolve bugs in that period, after the final sprint meeting we merge the feature branches into master (we use feature branches in git and pull-request for review and merging), do some testing, and deploy master to a public beta enveriment. There we usually find some bugs, resolve them and after that we deploy master inckuding beta fixes to the production enviroment. After that we start with our next sprint.
But that process is far from being perfect. First of all it's difficult to show features with all those branches in our sprint review meeting, and because we only have one master branch which gets deployed, we can't easyly deploy hotfixes to different enviroments. Sometimes we need more time for testing in our beta enviroment, so we can't merge new features to master, or same for production, once deployed, we can't deploy hotfixes if we are already testing new features on beta. If we expect bugs in beta or production due to larger changes, development has too keep feature branches for a longer period, so merging feature branches later on becomes more and more painful.
First we thought about long running branches, like master for development, production and beta. So we could merge whatever feature we want into one of the thre branches. But we really like to work with pull requests (reviews, feedback and deleting feature branch after merging), it's really nice to work with, but we can only apply a pull-request branch to one other branch. So here, we could merge to master without deleting the branch, and have to swithc to another tool to merge a feature to beta or production, or create new pull requests for beta and production. It works that way, but its not such a nice workflow as only merging to one master branch.
We also thought about git-flow (Vincent Driessen's branching model), which looks nice, but it seems to be better suited to software with traditional release cycles and versioning, not 100% for web aplications, which have no real versions but deploy everything which is ready after a sprint. It solves the hotfix problems, has an extra develop branch, but it requirs release-versions. So we could create a release branch, get problems sorted out, release it to production and delete the release branch. We could open a pull request for merging into master with the release branch, but problems start if we want to release it to beta (we use capistrano for deployment), because the branch changes every sprint. And what if we want to test features in our beta enviroment? We could use the release branch for beta, but this way a release to production has to wait until all features in the release/beta branch are ready. If testing of large features takes long, we can't upload small updates to production.
Next problem is versioning. We use jira and jira likes to use versions for releases. We could use versions like "1,2,3"..., but should one sprint be one version? Doesn't feel right, because sprint planning is not the same as release planning, or should it be the same when developing a web application? Because sometimes we develop featurs in a sprint, which take longer and are released later after being completed in the next 1-2 sprints. With git-flow, this changes could not be merged to develop until they are ready, because every release is branched from develop branch. So this way, some feature branches are not merged for a long time and merging them becomes more and more difficult. It's the opposite of continuous integration.
And last but not least, the deployment and QA process does not fit to scrum so very much, we don't have an own web operations team, when a sprint is ready, the product owner has to review the stories, we have to test them, deploy them to beta/production, and have to fix the problems in it immediately, which also interrupts or next sprint. We are not sure when is the right time to merge the feature branches? Before review meeting? This way we could have merged features which are not accepted by the product owner into the branch which should be released soon, but if we don't merge them, we have to demonstrate each feature in it's own branch and merge every accepted feaure branch, so integration and testing could only start after the sprint review meeting. Integration and testing could take days and needs development resources, so when should the next sprint start? After release to production? But this way we could not start a sprint every two weeks, and not every developer is needed for integrating and QA, so what should they work on? Currently we start the planning meeting for the next sprint immidiatly after the review meeting of the last sprint, but if we do it this way we are not sure how many time we need for the release which draws resources from the sprint...
So how do you release web applications?
What workflow do you use?
What's the best way if we want to integrate pull requests in the workflow?
How do you integrate QA and deployment into the scrum workflow?
I try to make a backlog and prioritize it:
Prio 1: Refit you Definition of Done.
I think you are betraying yourself how many features you can develop in a sprint.
Scrums defines the result of an sprint as "useful increment of software".
What you do is to produce a "testable increment of software".
You have to include the test into the sprint. All stories that are not tested and have a "production ready" stamp on it, are simply not DONE.
Prio 1: Do your retrospectives
Get the team together and talk about this release mess. You need a lightweight way to test stories. You and your team knows best what's available and what is hindering you.
If you already doing those retrospectives, this one is done. ;)
Prio 1: Automated Tests
How can you get any story done without UnitTests?
Start making them. I would propose that you search for the most delicate/important component in the system and bring test coverage (ECLemma/JaCoCo helps) up to at least 60% for this component.
Commit a whole sprint to this. I don't know your DOD, but this is work that has been procrastinated and has to be redone.
If a manager tells you its impossible, remind him of the good old waterfall days, where 2 weeks of development have not raised his eyebrow, why should it now...
Prio 2: Integration testing
You have Unit-Tests in place and you do some additional manual testing in the sprints.
What you want is to include the integration test (merging with master) to the sprint.
This has to happen, because you have to(!) solve the bugs in the sprint they have been introduced. So a early integration has to happen.
To get there it is important that during a sprint you work by priorities: The highest prio task/Story has to be solved by the whole team first. Goal is, to have it coded and unit tested, then you hand it over to your tester for smoke testing. Bugs found have the prio of the story. When it works good enough this part is deployed to the integration and tested there.
(Of course most of the time, not all team members can work effectively on 1 story. So one other story is processed in parallel. But whenever something stops your Prio1 Story, the whole team has to help there.)
Prio 2: Reduce new features increase Quality
To include more quality you have to lower expectations regarding amount of new features. Work harder is not what scrum tells you. It tells you to deliver more quality because this will speed up your development in mid term.
Regarding QA and deployment, I suggest the following:
Commit your code early to the main branch, and commit often. I am totally unfamiliar with git so I can't comment on how you're using it but I've always worked off the main trunk (using Subversion) with all the developers committing new code to the trunk. When we create a release we branch the trunk to create a release branch.
If you haven't done so already then set up a continuous integration system (we use TeamCity, but it'll depend to some degree on your stack). Get what unit tests you do have running at each commit.
Create a definition of 'done' and use it. This will stop you from being in the situation of testing features implemented during the previous sprint. Make sure your definition of done includes the passing of acceptance tests that have been co-written by your Product Owner, in my opinion the PO should not be rejecting features during the Sprint Review.

What is the recommended way to build functionality similar to Stackoverflow's "Inbox"?

I have an asp.net-mvc website and people manage a list of projects. Based on some algorithm, I can tell if a project is out of date. When a user logs in, i want it to show the number of stale projects (similar to when i see a number of updates in the inbox).
The algorithm to calculate stale projects is kind of slow so if everytime a user logs in, i have to:
Run a query for all project where they are the owner
Run the IsStale() algorithm
Display the count where IsStale = true
My guess is that will be real slow. Also, on everything project write, i would have to recalculate the above to see if changed.
Another idea i had was to create a table and run a job everything minutes to calculate stale projects and store the latest count in this metrics table. Then just query that when users log in. The issue there is I still have to keep that table in sync and if it only recalcs once every minute, if people update projects, it won't change the value until after a minute.
Any idea for a fast, scalable way to support this inbox concept to alert users of number of items to review ??
The first step is always proper requirement analysis. Let's assume I'm a Project Manager. I log in to the system and it displays my only project as on time. A developer comes to my office an tells me there is a delay in his activity. I select the developer's activity and change its duration. The system still displays my project as on time, so I happily leave work.
How do you think I would feel if I receive a phone call at 3:00 AM from the client asking me for an explanation of why the project is no longer on time? Obviously, quite surprised, because the system didn't warn me in any way. Why did that happen? Because I had to wait 30 seconds (why not only 1 second?) for the next run of a scheduled job to update the project status.
That just can't be a solution. A warning must be sent immediately to the user, even if it takes 30 seconds to run the IsStale() process. Show the user a loading... image or anything else, but make sure the user has accurate data.
Now, regarding the implementation, nothing can be done to run away from the previous issue: you will have to run that process when something that affects some due date changes. However, what you can do is not unnecessarily run that process. For example, you mentioned that you could run it whenever the user logs in. What if 2 or more users log in and see the same project and don't change anything? It would be unnecessary to run the process twice.
Whatsmore, if you make sure the process is run when the user updates the project, you won't need to run the process at any other time. In conclusion, this schema has the following advantages and disadvantages compared to the "polling" solution:
Advantages
No scheduled job
No unneeded process runs (this is arguable because you could set a dirty flag on the project and only run it if it is true)
No unneeded queries of the dirty value
The user will always be informed of the current and real state of the project (which is by far, the most important item to address in any solution provided)
Disadvantages
If a user updates a project and then upates it again in a matter of seconds the process would be run twice (in the polling schema the process might not even be run once in that period, depending on the frequency it has been scheduled)
The user who updates the project will have to wait for the process to finish
Changing to how you implement the notification system in a similar way to StackOverflow, that's quite a different question. I guess you have a many-to-many relationship with users and projects. The simplest solution would be adding a single attribute to the relationship between those entities (the middle table):
Cardinalities: A user has many projects. A project has many users
That way when you run the process you should update each user's Has_pending_notifications with the new result. For example, if a user updates a project and it is no longer on time then you should set to true all users Has_pending_notifications field so that they're aware of the situation. Similarly, set it to false when the project is on time (I understand you just want to make sure the notifications are displayed when the project is no longer on time).
Taking StackOverflow's example, when a user reads a notification you should set the flag to false. Make sure you don't use timestamps to guess if a user has read a notification: logging in doesn't mean reading notifications.
Finally, if the notification itself is complex enough, you can move it away from the relationship between users and projects and go for something like this:
Cardinalities: A user has many projects. A project has many users. A user has many notifications. A notifications has one user. A project has many notifications. A notification has one project.
I hope something I've said has made sense, or give you some other better idea :)
You can do as follows:
To each user record add a datetime field sayng the last time the slow computation was done. Call it LastDate.
To each project add a boolean to say if it has to be listed. Call it: Selected
When you run the Slow procedure set you update the Selected fileds
Now when the user logs if LastDate is enough close to now you use the results of the last slow computation and just take all project with Selected true. Otherwise yourun again the slow computation.
The above procedure is optimal, becuase it re-compute the slow procedure ONLY IF ACTUALLY NEEDED, while running a procedure at fixed intervals of time...has the risk of wasting time because maybe the user will neber use the result of a computation.
Make a field "stale".
Run a SQL statement that updates stale=1 with all records where stale=0 AND (that algorithm returns true).
Then run a SQL statement that selects all records where stale=1.
The reason this will work fast is because SQL parsers, like PHP, shouldn't do the second half of the AND statement if the first half returns true, making it a very fast run through the whole list, checking all the records, trying to make them stale IF NOT already stale. If it's already stale, the algorithm won't be executed, saving you time. If it's not, the algorithm will be run to see if it's become stale, and then stale will be set to 1.
The second query then just returns all the stale records where stale=1.
You can do this:
In the database change the timestamp every time a project is accessed by the user.
When the user logs in, pull all their projects. Check the timestamp and compare it with with today's date, if it's older than n-days, add it to the stale list. I don't believe that comparing dates will result in any slow logic.
I think the fundamental questions need to be resolved before you think about databases and code. The primary of these is: "Why is IsStale() slow?"
From comments elsewhere it is clear that the concept that this is slow is non-negotiable. Is this computation out of your hands? Are the results resistant to caching? What level of change triggers the re-computation.
Having written scheduling systems in the past, there are two types of changes: those that can happen within the slack and those that cause cascading schedule changes. Likewise, there are two types of rebuilds: total and local. Total rebuilds are obvious; local rebuilds try to minimize "damage" to other scheduled resources.
Here is the crux of the matter: if you have total rebuild on every update, you could be looking at 30 minute lags from the time of the change to the time that the schedule is stable. (I'm basing this on my experience with an ERP system's rebuild time with a very complex workload).
If the reality of your system is that such tasks take 30 minutes, having a design goal of instant gratification for your users is contrary to the ground truth of the matter. However, you may be able to detect schedule inconsistency far faster than the rebuild. In that case you could show the user "schedule has been overrun, recomputing new end times" or something similar... but I suspect that if you have a lot of schedule changes being entered by different users at the same time the system would degrade into one continuous display of that notice. However, you at least gain the advantage that you could batch changes happening over a period of time for the next rebuild.
It is for this reason that most of the scheduling problems I have seen don't actually do real time re-computations. In the context of the ERP situation there is a schedule master who is responsible for the scheduling of the shop floor and any changes get funneled through them. The "master" schedule was regenerated prior to each shift (shifts were 12 hours, so twice a day) and during the shift delays were worked in via "local" modifications that did not shuffle the master schedule until the next 12 hour block.
In a much simpler situation (software design) the schedule was updated once a day in response to the day's progress reporting. Bad news was delivered during the next morning's scrum, along with the updated schedule.
Making a long story short, I'm thinking that perhaps this is an "unask the question" moment, where the assumption needs to be challenged. If the re-computation is large enough that continuous updates are impractical, then aligning expectations with reality is in order. Either the algorithm needs work (optimizing for local changes), the hardware farm needs expansion or the timing of expectations of "truth" needs to be recalibrated.
A more refined answer would frankly require more details than "just assume an expensive process" because the proper points of attack on that process are impossible to know.

Relying on db transaction rollback in sunshine scenario

In a financial system I am currently maintaining, we are relying on the rollback mechanism of our database to simulate the results of running some large batch jobs - rolling back the transaction or committing it at the end, depending on whether we were doing a test run.
I really cannot decide what my opinion is. In a way I think this is great, because then there is no difference between the simulation and a live run - on the other hand it just feels kind of icky, like e.g. depending on code throwing exceptions to carry out your business logic.
What is your opinion on relying on database transactions to carry out your business logic?
EXAMPLE
Consider an administration system having 1000 mortgage deeds in a database. Now the user wants to run a batch job that creates the next term invoice on each deed, using a couple of advanced search criteria that decides which deeds are to be invoiced.
Before she actually does this, she does a test run (implemented by doing the actualy run but ending in a transaction rollback), creating a report on which deeds will be invoiced. If it looks satisfactory, she can choose to do the actual run, which will end in a transaction commit.
Each invoice will be stamped with a batch number, allowing us to revert the changes later if it is needed, so it's not "dangereous" to do the batch run. The users just feel that it's better UX to be able to simulate the results first.
CLARIFICATION
It is NOT about testing. We do have test and staging environments for that. It's about a regular user using our system wanting to simulate the results of a large operation, that may seem "uncontrollable" or "irreversible" even though it isn't.
CONCLUSION
Doesn't seem like anyone has any real good arguments against our solution. As always, context means everything, so in the context of complex functional requirements exceeding performance requirements, using db rollback to implement batch job simulations seems a viable solution.
As there is no real answer to this question, I am not choosing an answer - instead I upvoted those who actually put forth an argument.
I think it's an acceptable approach, as long as it doesn't interfere with regular processing.
The alternative would be to build a query that displays the consequences for review, but we all have had the experience of taking such an approach and not quite getting it right; or finding that the context changed between query and execution.
At the scale of 1000 rows, it's unlikely the system load is burdensome.
Before she actually does this, she does a test run (implemented by doing the actualy run but ending in a transaction rollback), creating a report on which deeds will be invoiced. If it looks satisfactory, she can choose to do the actual run, which will end in a transaction commit.
That's wrong, prone to failure, and must be hell on your database logs. Unless you wrap your simulation and the actual run in a single transaction (which, judging by the timeline necessary to inspect 1000 deeds, would lead to a lot of blocked users) then there's no guaranteed consistency between test run and real run. If somebody changed data, added rows, etc. then you could end up with a different result - defeating the entire purpose of the test run.
A better pattern to do this would be for the test run to tag the records, and the real run to pick up the tagged records and process them. Or, if you have a thick client app, you can pull down the records to the client, show the report, and - if approved - push them back up.
We can see what the user needs to do, quite a reasonable thing. I mean how often do we get a regexp right first time? Refining a query till it does exactly what you want is not unusual.
The business consequences of not catching errors may be quite high, so doing a trial run makes sense.
Given a blank sheet of paper I'm sure we can devise an clean implementation expressed in formal behaviours of the system rather than this somewhat back-door appraoch.
How much effort would I put into fixing that now? Depends on whether the current approach is actually hurting. We can imagine that in a heaviliy used system it could lead to contention in the database.
What I wrote about the PRO FORMA environment in that bank I worked in was also entirely a user thing.
I'm not sure exactly what you're asking here. Taking you literally
What is your opinion on relying on
database transactions to carry out
your business logic?
Well, that's why we have transactions. We do rely on them. We hit an error and abort a transaction and rely on work done in that transaction scope to be rolled-back. So exploiting the transactional beahviours of our systems is a jolly good thing, and we'd need to hand-roll the same thing ourselves if we didn't.
But I think your question is about testing in a live system and relying on not commiting in order to do no damage. In an ideal world we have a live system and a test system and we don't mess with live systems. Such ideals are rarely seen. Far more common is "patch the live system. testing? what do you mean testing?" So in fact you're ahead of the game compared with some.
An alternative is to have dummy data in the live system, so that some actions can actually run all the way through. Again, error prone.
A surprisingly high proportion of systems outage are due to finger trouble, it's the humans who foul up.
It works - as you say. I'd worry about the concurrency of the system since the transaction will hold locks, possibly many locks. It means that your tests will hold up any live action on the system (and any live action operations will hold up your tests). It is generally better to use a test system for testing. I don't like it much, but if the risks from running a test but forgetting to roll it back are not a problem, and neither is the interference, then it is an effective way of attaining a 'what if' type calculation. But still not very clean.
When I was working in a company that was part of the "financial system", there was a project team that had decided to use the production environment to test their conversion procedure (and just rollback instead of commit at the end).
They almost got shot for it. With afterthought, it's a pity they weren't.
Your test environments that you claim you have are for the IT people's use. Get a similar "PRO-FORMA" environment of which you can guarantee your users that it is for THEIR use exclusively.
When I worked in that bank, creating such a PRO-FORMA environment was standard procedure at every year closure.
"But it's not about testing, it's about a user wanting to simulate the results of a large job."
Paraphrased : "it's not about testing, it's about simulation".
Sometimes I wish Fabian Pascal was still in business.
(Oh, and in case anyone doesn't understand : both are about "not perpetuating the results".)

Resources