As we pass the middle of the Essex development
cycle, questions about
the solidity of this release start to pop up. After all, the previous
releases were far from stellar, and with more people betting their
business on OpenStack we can't really afford another half-baked release.
Common thinking (mostly coming from years of traditional software
development experience) is that we shouldn't release until it's ready,
or good enough, and calls early for pushing back the release dates. This
assumes the issue is incidental: that we underestimated the time it
would take our finite team of internal developers working on bugs to
reach a sufficient level of quality.
OpenStack, being an open source project produced by a large community,
works differently. We have a near-infinite supply of developers. The
issue is, unfortunately, more structural than incidental. The lack of
solidity for a release comes from:
- Lack of focus on generic bugfixes. Developers should work on
fixing bugs. Not just the ones they filed or the ones blocking them
in their feature-adding frenzy. Fixing identified, targeted, known
issues. The bugtracker is full of them, but they don't get
attention.
- Not enough automated testing to efficiently catch regressions.
Even if everyone was working on bug fixes, if half your fixes end up
creating a set of regressions, then there is no end to it.
- Lack of bug triaging resources. Only a few people work on
confirming, triaging and prioritizing the flow of incoming bugs. So
the bugs that need the most attention are lost in the noise.
For the Diablo cycle, we had less than a handful of people focused on
generic bugfixing. The rest of our 150+ authors were busy working on
something else. Pushing back the release for a week, a month or a year
won't help OpenStack solidity if the focus doesn't switch. And if our
focus switches, then there will be no need for a costly release delay.
Acting now to make Essex a success
During the Essex cycle, our Project Technical Leads have done their
share of the work by using a very early milestone for their feature
freeze. Keystone, Glance and Nova will freeze at Essex-3, giving us 10
weeks for bugfixing work (compared to the 4 weeks we had for Diablo).
Now we need to take advantage of that long period and really switch our
mindset away from feature development and towards generic bug fixing.
Next week we'll hit feature freeze, so now is the time to switch.
If we could:
- have some more developers working on increasing our integration and
unit test coverage
- have the rest of the developers really working on generic bug fixing
- have very active core reviewers that get more anal-retentive as we
get closer to release, to avoid introducing regressions that would
not be caught by our automated tests
...then I bet that it will lead to a stronger release than any delaying
of the release could give you. Note that we'll also have a bug
squashing day on
February 2 that will hopefully help us getting on top of old, deprecated
and easy fixes, and give us a clear set of targets for the rest of the
cycle.
That's on our ability to switch our focus that hinges the quality of
future OpenStack releases. That's on what we'll be judged. The world
awaits, and the time is now.
The Free and Open source Software Developers' European Meeting, or
FOSDEM, is an institution that happens every
year in Brussels. A busy, free and open event that gets a lot of
developers together for two days of presentations and cross-pollination.
There are typically the FOSDEM main tracks (a set of presentations
chosen by the FOSDEM organization) and a set of devrooms, which are
topic-oriented or project-oriented and can organize their own schedule
freely.
This year, FOSDEM will host an unusual devroom, the Virtualization and
Cloud devroom. It will happen in the Chavanne room, a 550-seat
auditorium that was traditionally used for main tracks. And it will last
for two whole days, while other devrooms typically last for a day or a
half-day.
The Virtualization and Cloud devroom is the result of the merging of
three separate devroom requests: Virtualization, Xen and OpenStack
devrooms. It gives us a larger space and a lot of potential for
cross-pollination across projects ! We had a lot of talks proposed, and
here is an overview of what you'll be able to see there.
Saturday, February 4
Saturday will be the "cloud" day. We will start with a set of talks
about OpenStack, past, present and future. I will do an
introduction and
retrospective of
what happened last year in the project, Soren Hansen will guide new
developers to
Nova, and Debo
Dutta will look into future work on application scheduling and
Donabe. Next
we'll have a session on variouscloud-related technologies:
libguestfs,
pacemaker-cloud
and OpenNebula. The
afternoon will start with a nice session on cloud interoperability,
including presentations on the
Aeolus,
CompatibleOne and
Deltacloud
efforts. We'll
continue with a session on cloud deployment, with a strong OpenStack
focus: Ryan Lane will talk about how Wikimedia maintains infrastructure
like an open source
project, Mike
McClurg will look into
Ubuntu+XCP+OpenStack
deployments, and Dave Walker will introduce the Orchestra
project. The
day will end with a town hall
meeting for all
OpenStack developers, including a panel of distribution packagers: I
will blog more about that one in the next weeks.
Sunday, February 5
Sunday is more "virtualization" day ! The day will start early with two
presentations by Hans de Goede about
Spice and USB
redirection over the
network.
Then we'll have a session on virtualization management, with Guido
Trotter giving more Ganeti
news and
three
talks
about oVirt. In the
afternoon we'll have a more technical session around virtualization in
development: Antti Kantee will introduce ultralightweight kernel
service virtualization with rump
kernels, Renzo
Davoli will lead a workshop on tracing and
virtualization,
and Dan Berrange will show how to build application sandboxes on top of
LXC and KVM with
libvirt.
The day will end with another developers meeting, this time the Xen
developers will meet around Ian Campbell and his Xen deployment
troubleshooting workshop.
All in all, that's two days packed with very interesting presentations,
in a devroom large enough to accomodate a good crowd, so we hope to see
you there !
2011 is almost finished, and what a year it has been. We started it with
two core projects and one release behind us. During 2011, we got three
releases out of the door, grew from 60 code contributors to about 200,
added three new core projects, and met for two design summits.
The Essex-2 milestone was released last week. Here is our now-regular
overview of the work that made it to OpenStack core projects since the
previous milestone.
Nova was the busiest project. Apart from my work on a new secure root
wrapper, we
added a pair of OpenStack API extensions to support the
creation of snapshots and backups of
volumes,
the metadata
service
can now run separately from the API node, network limits can now be set
using a per-network base and a per-flavor
multiplier,
and a small usability feature lets you retrieve the last
error that
occurred using nova-manage. But Essex is not about new features, it's
more about consistency and stability. On the consistency front, the HA
network mode was extended to support
XenServer,
KVM compute nodes now report
capabilities
to zones like Xen ones, and the Quantum network manager now supports
NAT.
Under the hood, VM state
transitions
have been strengthened, the network data model
has
been
overhauled, internal interfaces now support UUID instance
references,
and unused callbacks have been
removed
from the virt driver.
The other projects were all busy starting larger transitions (Keystone's
RBAC, Horizon new user experience, and Glance 2.0 API), leaving less
room for essex-2 features. Glance still saw the addition of a custom
directory for data
buffering.
Keystone introduced global endpoints
templates
and swauth-like ACL
enforcement.
Horizon added UI support for downloading RC
files,
while migrating under the hood from jquery-ui to
bootstrap,
and adding a versioning
scheme
for environment/dependencies.
The next milestone is in a bit more than a month: January 26th, 2012.
Happy new year and holidays to all !
In the previous two posts of this series, we explored the deficiencies
of the current
model
and the features of an alternative
implementation.
In this last post, we'll discuss the advantages of a Python
implementation and open discussion on how to secure it properly.
Python implementation
It's quite easy to implement the features that were mentioned in the
previous post in Python. The main advantage of doing so is that the code
can happily live inside Nova code, in particular the filters definition
files can be implemented as Python modules that are loaded if present.
That solves the issue of shipping definitions within Nova and also the
separation of allowed commands based on locally-deployed nodes. The code
is simple and easy to review. The trick is to make sure that no
malicious code can be injected in the elevated rights process. This is
why I'd like to present a model and open it for comments in the
community.
Proposed security model
The idea would be to have Nova code optionally use "sudo nova-rootwrap"
instead of "sudo" as the root_helper. A generic sudoers file would
allow the nova user to run /usr/bin/nova-rootwrap as root, while
stripping environment variables like PYTHONPATH. To load its filters
definitions, nova-rootwrap would try to import a set of predefined
modules (like nova.rootwrap.compute), but if those aren't present, it
should ignore them. Can this model be abused ?
The obvious issue is to make sure sys.path (the set of directories
from which Python imports its modules) is secure, so that nobody can
insert their own modules in the process. I've given some thoughts to
various checks, but actually there is no way around trusting the default
sys.path you're given when you start python as root from a cleaned
env. If that's compromised, you're toasted the moment you "import sys"
anyway. So using sudo to only allow /usr/bin/nova-rootwrap and
cleaning the environment should be enough. Or am I missing something ?
Insecure mode ?
One thing we could do is check that sys.path all belongs to root and
refuse to run in the case it's not. That would tell the user that his
setup is insecure (potentially allowing him to bypass that by running
"sudo nova-rootwrap --insecure" as the root_helper). But that's a
convenience to detect insecure setups, not a security addition (the fact
that it doesn't complain doesn't mean you're safe, it could mean you're
already compromised).
Test mode ?
For tests, it's convenient to allow to run code from branches. To allow
this (unsafe) mode, you would tweak sudoers to allow it to run
\$BRANCH/bin/nova-rootwrap as root, and prepend ".." to sys.path
in order to allow modules to be loaded from \$BRANCH (maybe requiring
--insecure mode for good measure). It sounds harmless, since if you
run from /usr/bin/nova-rootwrap you can assume that /usr is safe...
Or should that idea be abandoned altogether ?
Audit
Nothing beats peer review when it comes to secure design. I call all
Python module-loading experts and security white-hats out there: would
this work ? Are those safe assumptions ? How much do you like insecure
and test modes ? Would you suggest something else ? If you're one of
those that can't think in words but require code, you can get a glimpse
of work in progress
here. It
will all be optional (and not used by default), so it can be added to
Nova without much damage, but I'd rather do it right from the beginning
:) Please comment !
In the previous post in this
series
we explored the current privilege escalation model used in OpenStack
Compute (Nova), and discussed its limitations. Now that we are able to
plug an alternative model (thanks to the root_helper option), we'll
discuss in this post what features this one should have. If you think we
need more, please comment !
Command filters
The most significant issue with the current model is that sudoers
filters the executable used, but not the arguments. To fix that, our
alternative model should allow precise argument filtering so that only
very specific commands are allowed. It should use lists of filters: if
one matches, the command is executed.
The basic CommandFilter would just check that the executable name
matches (which is what sudoers does). A more advanced RegexpFilter
would check that the number of arguments is right and that they all
match provided regular expressions.
Taking that concept a step further, you should be able to plug any type
of advanced filter. You may want to check that the argument to the
command is an existing directory. Or one that is owned by a specific
user. The framework should allow developers to define their own
CommandFilter subclasses, to be as precise as they want when filtering
the most destructive commands.
Running as
In some cases, Nova runs, as root, commands that it should just run as
a different user. For example, it runs kill with root rights to
interact with dnsmasq processes (owned by the nobody user). It
doesn't really need to run kill with root rights at all. Filters
should therefore also allow to specify a lower-privileged user a
specific matching command should run under.
Shipping filters in Nova code
Filter lists should live within Nova code and be deployed by packaging,
rather than live in packaging. That allows people adding a new escalated
command to add the corresponding filter in the same commit.
Limiting commands based on deployed nodes
As mentioned in the previous
post,
nova-api nodes don't actually need to run any command as root, but
in the current model their nova user is still allowed to run plenty of
them. The solution for that is to separate the command filters based on
the type of node that is allowed to run them, in different files. Then
deploy the nova-compute filters file only on nova-compute nodes, the
nova-volume filters file only on nova-volume nodes... A pure
nova-api node will end up with no filters being deployed at all,
effectively not being allowed any command as root. So this can be solved
by smart packaging of filter files.
Missing features ?
Those are the features that I found useful for our alternative privilege
escalation model. If you see others, please comment here ! I'd like to
make sure all the useful features are included. In the next post, we'll
discuss a proposed Python implementation of this framework, and the
challenges around securing it.