Disclaimer: Apologies if some of the details of this are incorrect, I am working off my notes and recollection of what happened at the time. I started to blog about this and had a couple of attempts to confirm, but couldn’t reproduce the issue. If you’ve worked with CloudFormation you will understand how slow the attempts to reproduce were.
I had an issue which resulted in a CloudFormation stack ending up stuck. It failed to update and then failed to rollback the failed update. Attempting the rollback a second time produced the same result. This is how the events appeared in the AWS console.
I am not sure what the changes we made were that resulted in it getting stuck. I think I was trying to update the AMI and the new AMI was not set up to work correctly in Opsworks.
I couldn’t see a way to resolve this issue in the AWS console so I did some searching and found a way to get out of the rollback look.
The AWS command line tools often have more functionality than what is available in the web console. I had a look at the commands available for CloudFormation and found continue-update-rollback, which has a “–resources-to-skip” parameter. Using that command with the ID of the stuck instance, ProcessorOpsworksInstance1, got the rollback to complete, however I wasn’t quite done.
The following warning is provided with the “–resources-to-skip” parameter:
Specify this property to skip rolling back resources that AWS CloudFormation can’t successfully roll back. We recommend that you troubleshoot resources before skipping them. AWS CloudFormation sets the status of the specified resources to UPDATE_COMPLETE and continues to roll back the stack. After the rollback is complete, the state of the skipped resources will be inconsistent with the state of the resources in the stack template. Before performing another stack update, you must update the stack or resources to be consistent with each other. If you don’t, subsequent stack updates might fail, and the stack will become unrecoverable.
What you need to do here depends on what changes the failed update was making and what state your stack ended up left in.
For me the AMI failed to update and a new instance was not started. At the end of the rollback CloudFormation would be expecting a ProcessorOpsworksInstance1 in Opsworks running with the original AMI, but I had an instance that would not start. I decided the safest thing to do would be to recreate the instance so that I would be working from a clean slate in the future. To do this I deleted the Opsworks instance by removing it from the template and updating the stack. After that I then added it back into the template and updated the stack again. This got me back to where I started.