Tuesday, March 18, 2014

How we applied security updates to 16,304 running daemons.



Last week was exceptionally busy for our operations team.  If you hadn't noticed, then that is a sign of a job well done.


At VM Farms, we stand apart from other providers by providing our customers with a truly fully managed service.  This entails constant proactive management and maintenance in addition to our reactive role as a support group.

When your site has endured zero downtime, your application has remained performant, and your IT backend has stayed secure, it is because our team of system engineers have been hard at work.  Monitoring, patching, managing resources and technologies - we work continuously to ensure that our customers' applications and environments are always performing at their very best.

The last couple of weeks proved to be a truly exhausting exercise in proactive maintenance and management.  So much so that we decided to write a blog post to give our customers some additional insight into some of the operations activities we undertake on their behalf.

It all started on March 4th, when our monitoring systems notified us of a new CVE entry for the libgnutls library.  A library that a large number of our customers are heavily reliant on.

For the uninformed, CVE stands for Common Vulnerabilities and Exposures, and is part of a public database for vulnerabilities and security bugs for common software such as Linux, Window, OS X, and other operating systems.  The entire industry pays close attention to these announcement for actionable security concerns.

Each new vulnerability or incident is assigned a number, and makes mention of any affected package names and versions.  Upon receiving these alerts, our system automatically compares these reports with each of our customers' profiles of installed software.  This allows us to quickly identify those customers that are affected. 

For example, lets investigate CESA_2014-0246, which says:

lib/x509/verify.c in GnuTLS before 3.1.22 and 3.2.x before 3.2.12 does not properly handle unspecified errors when verifying X.509 certificates from SSL servers, which allows man-in-the-middle attackers to spoof servers via a crafted certificate.

Most linux professionals will recognize GnuTLS as a widely used software library, and this vulnerability immediately raised eyebrows around the world.  Right away, we were able to determine the scope of the problem based on the actual software patch...



The trouble with upgrading important libraries is that they often require service restarts.  Uptime is the utmost priority for us, so we notified our customers that same week.  This gave them a reasonable time to respond in case they had operations-scheduling conflicts that would prevent us from applying these patches.

The other problem with upgrading a widely used library is, because so many services are referenced within the library, they must all be restarted as well.  In the right order; and only once.  

The lsof command proved invaluable as we quickly made an inventory of exactly which daemons would need to be restarted.  We fixed non-production environments first, and left the most complex and exceptional configurations on production VMs for last.

Because we wanted to be sure we considered every service and dependency that needed to be restarted, we used a suite of custom automation tools that augmented our SSH sessions.  This helped Kris Kostecky and I perform the upgrades and verify each of them; one by one, and as a team.

When dealing with so many different customers using diverse stacks, you are bound to encounter edge cases in maintaining certain infrastructures.  Looking back, we are glad we took the time to care for each VM during this upgrade.  Each time we did encounter a gotcha, we had the ability to stop the process and take our time to properly address each issue. 

We are also glad that we allocated a team of two to work through this entire process.  This allowed us to make upgrades twice as fast, and provided additional support on the few incidents where a daemon did not restart cleanly, or a package download ran too slow.

In total we restarted 16,304 running daemons for our customers in the span of one week, scrutinizing each and every one.

Security is a never-ending, burdensome, and critical concern to all of our operations staff.  At VM Farms we have the tools and the team needed to remain constantly vigilant.  If you'd like to know more about how we can take a load off your plate as a Systems Administrator or Developer, check out our website, or give us a call at 1-866-278-0021.

Follow @vmfarms and @ian_vmfarms on Twitter.

No comments:

Post a Comment