Refactor caching of chef-client #300

cheeseplus · 2017-03-21T13:44:26Z

The current implementation suffers from a few issues that make the experience less than delightful, especially new users.

caching on by default - this changed the default and breaks existing users with unexpected behaviour
the cache itself isn't structured so we have issues like Cache directory should be structured #293
we have to enumerate every platform this doesn't work for instead of only enumerating the platforms it does work for (there are more of the former category) see Caching should use whitelist of supported platforms #296
even if we have a list of boxes it works against, we are only doing so based on name so this breaks the second anyone chooses a name that doesn't include the OS
if the virt tools aren't properly installed on the base box or the host version of the tooling is newer/older than the box vagrant has, this breaks
we never clean/prune the cache and that can get out of hand pretty quickly

I personally feel like caching by default breaks the semver contract but I do understand why we added caching (downloads cost money) so that ship may have sailed.

I propose that we:

fix the cache structuring
only enable caching on known platform names or if we can only for bento boxes (we have nominal control over those)
investigate if it's possible to catch the error from Vagrant and re-try without shared folders on platforms that otherwise should be supported

charlesjohnson · 2017-03-21T16:24:27Z

What we can't do is increase the number of repeat downloads served of chef-client, because that comes out of Chef's pocket.

What ideas do people have to fix the use cases that are broken, without increasing the number of repeat downloads we serve of chef-client?

tas50 · 2017-03-21T16:29:49Z

So there's 2 major changes that absolutely need to be made, otherwise we've just entirely broken test kitchen for users:

Caching should only occur for bento boxes. As Seth mentioned we don't know the state of other boxes. A lot / most out there don't have the tooling necessary to setup the NFS mount for the cache, and Test Kitchen just fails there.
We need to whitelist operating systems instead of blacklisting. As I found out the other week trying to test chef on OmniOS we entirely forgot to blacklist Solaris variants so those no longer work with Test Kitchen. There's probably a few other operating systems we've also forgotten about so lets just assume we don't know what should be in a blacklist.

cheeseplus · 2017-03-21T17:10:51Z

Core to this is that we absolutely must know that the end user has the correct/working combination of hypervisor and base box with compatible vmtools for the shared folder method to be viable. The problem is that we can't introspect this in a meaningful way without either building this matrix into kitchen-vagrant or having reach out to some external endpoint that has a compatibility list - neither of these are viable solutions to maintain.

At best we should only enable caching by default for base boxes that we know should at least have vmtools installed which would be things in chef/bento (excluding platforms known not to work like FreeBSD). Even then, it's still a gamble as the local hypervisor version may be fundamentally incompatible with the base box that is local to the system.

Shared folders via hypervisors are inherently brittle and the caching solution relies on them to be a lot more predictable and useable than they are in practice.

charlesjohnson · 2017-03-21T17:45:34Z

We can't ask Chef to pay to serve up repeat downloads of chef-client to people running Kitchen, unless we know that those people can't get chef-client onto their instances in any automated way other than by re-downloading chef-client from Chef.

How can we ensure that we don't waste Chef's resources, but still provide a great user experience?

coderanger · 2017-03-21T18:26:33Z

@charlesjohnson Do we collect data on how many installs as a proportion of the whole are coming from kitchen+kitchen-vagrant? Because that seems like a good thing to have numbers on before we cite it as a reason to do or not do things.

charlesjohnson · 2017-03-21T19:27:10Z

Wouldn't that be great? But no, don't have a way to get that data afaik, paging @schisamo to confirm.

coderanger · 2017-03-21T19:30:16Z

We could, at a minimum, have the install.sh/ps1 scripts record $TEST_KITCHEN and forward that along with the download somehow (header?). We would have to make TK expose the driver name in a similar way to be able to tell which driver is in use but that is very doable.

coderanger · 2017-03-21T19:36:13Z

Or, now that it's all handled in mixlib-install, we could directly pass it the driver name and test_kitchen: true :)

schisamo · 2017-03-21T19:52:16Z

So mixlib-install does have the ability to set User-Agent headers now:
https://github.com/chef/mixlib-install#user-agent-request-headers

We'll need to tweak test-kitchen to set this option properly.

There is also some work in flight to properly extract the User-Agent header (and a few others) in our Faslty logs. Once that is done we'll begin indexing the data in our ES cluster and being to explore this facet of our packages.chef.io data.

coderanger · 2017-03-21T20:06:22Z

@schisamo That doesn't seem to be supported by Mixlib::Install::ScriptGenerator which is how almost everyone (I'm not sure anyone outside of Chef Software but me is even aware of theother mode) installs things in TK.

schisamo · 2017-03-22T01:14:24Z

@coderanger Yeah Mixlib::Install::ScriptGenerator is more or less deprecated, we need to update Test Kitchen to use the newer idioms. I'll let @wrightp weigh in with the specifics.

cheeseplus · 2017-03-22T17:23:10Z

I've outlined the short term fixes here: https://gist.github.com/cheeseplus/8a1871b837a31cd6c0113a382fe8e03d

cheeseplus · 2017-03-22T21:02:57Z

I've co-opted two existing issues to track this:

Implement whitelist / isolate to bento Caching should use whitelist of supported platforms #296
Structure cache directory / fix checksumming Cache directory should be structured #293

afiune · 2017-03-24T14:10:47Z

I'll take care of #293 👍

cheeseplus · 2017-07-24T22:55:57Z

We've already addressed #296 and #293 is sufficient for the remaining work so closing this one.

cheeseplus closed this as completed Jul 24, 2017

mirogta mentioned this issue Jun 1, 2018

* winrm.ssl_peer_verification must be a boolean. #327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor caching of chef-client #300

Refactor caching of chef-client #300

cheeseplus commented Mar 21, 2017 •

edited

Loading

charlesjohnson commented Mar 21, 2017

tas50 commented Mar 21, 2017

cheeseplus commented Mar 21, 2017

charlesjohnson commented Mar 21, 2017

coderanger commented Mar 21, 2017

charlesjohnson commented Mar 21, 2017

coderanger commented Mar 21, 2017

coderanger commented Mar 21, 2017

schisamo commented Mar 21, 2017

coderanger commented Mar 21, 2017

schisamo commented Mar 22, 2017

cheeseplus commented Mar 22, 2017

cheeseplus commented Mar 22, 2017

afiune commented Mar 24, 2017

cheeseplus commented Jul 24, 2017

Refactor caching of chef-client #300

Refactor caching of chef-client #300

Comments

cheeseplus commented Mar 21, 2017 • edited Loading

charlesjohnson commented Mar 21, 2017

tas50 commented Mar 21, 2017

cheeseplus commented Mar 21, 2017

charlesjohnson commented Mar 21, 2017

coderanger commented Mar 21, 2017

charlesjohnson commented Mar 21, 2017

coderanger commented Mar 21, 2017

coderanger commented Mar 21, 2017

schisamo commented Mar 21, 2017

coderanger commented Mar 21, 2017

schisamo commented Mar 22, 2017

cheeseplus commented Mar 22, 2017

cheeseplus commented Mar 22, 2017

afiune commented Mar 24, 2017

cheeseplus commented Jul 24, 2017

cheeseplus commented Mar 21, 2017 •

edited

Loading