Skip to content

Guidance on how to make your environment easier to onboard for Web Ops Engineers, SRE's and DevOps Practitioners

License

Notifications You must be signed in to change notification settings

actionjack/so-you-want-to-onboard-a-devops-engineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents generated with DocToc

So you want to Onboard a DevOps Practitioner

Author: Martin Jackson - @actionjack

Introduction

Currently everyone seems to be very interested in recruiting DevOps practitioners but I feel the process of on-boarding them and giving them a supportive environment to be able to succeed and thrive is still a bit of hit and miss affair, especially in busy organisations.

Nobody (at least nobody I know…) wants to work in a difficult environment:

  • Bad environments (and broken cultures) do not attract nor retain top talent it does the exact opposite.

“Suffering increases in proportion to knowledge of a better way.”

Jim Hickstein

Making it easy to get work done from day one

Simplify, simplify and after that simplify some more

“Simplicity is the ultimate sophistication.”

Leonardo da Vinci

Reduce the time spent learning environments by building them to be easy to understand, with a focus on a making it possible for every developer (new or old) to become effective in the shortest possible amount of time.

Here is some guidance on how to make your environment easier to onboard and keep the people working on them happy.

The Basics

The raw basics

"Without a solid foundation of raw basics, any structure built upon it is liable to crumble and fall."

Unknown

  • Have internet access sorted out for new starts or let them know if there isn't any.
  • Locker access (if you supply lockers for hot-desk environments).
  • Let security know that they are coming.
  • Let people know if they are required to use their own equipment or are being supplied with specified equipment and what Operating System.
  • If you haven't already done so adopt some Group Chat software like Slack, Microsoft Teams or Rocket Chat this kind of software is beneficial to all and reduces pressure on key individuals because your questions go out to a group of people rather than target specific individuals who may be busy and under constant interruption.
    • If you do the above try and implement some communications etiquette, for example when you answer someone create the answer in a thread so the questions, context, conversation and possibly solution are kept in the same place rather than being strewn throughout the chat history.
    • Provide a High-level Environment overview so new starts know what they are working on and what technologies they need to get up to speed on.

Culture

Aim to create a culture of empathy and psychological safety

“It's possible for good people, in perversely designed systems, to casually perpetrate acts of great harm on strangers, sometimes without ever realising it.”

Lawrence Lessig

  • Embrace the standard of The Humble Learner, The Humble Learner accepts the limits of human capacity while seeking to grow their technical and empathetic skills
  • Do not create nor foster a Blame, Shame and Train culture where mistakes are handled by openly blaming and shaming the employee (and sometimes terminating their employment) and then train other employees using the incident as an example
    • Instead recognise each failure for what it is, a lesson, identify what went wrong and how we can ensure it does not go wrong again (and no, this does not mean this is an excuse to produce lots more documentation:stuck_out_tongue_winking_eye:)
  • Try to foster a culture of improvement, benchmark your organisation against some form of maturity model to identify the gaps and attempt to close them.
  • Introduce the new engineer(s) to the relevant people within the organisation
  • Remember not everyone may be as smart as you are, they may be missing
    • Context / Situational awareness (how did we get from here to there?)
    • Tribal Knowledge (This is where our ancestors bodies are buried)
    • Cultural awareness (How we do things around here)
    • Technical Expertise in that specific problem domain
    • The local Taxonomy - concepts and language does vary from work place to work place. e.g. pre-approved changes and standard changes many not necessarily mean the same thing from job to job.
  • What are the Preferred practices or "Design Principles"?
  • Listen to their point of view. Bringing in a new person is a prime opportunity to find out where the code or process needs improvement.
  • Test your mentoring and on boarding process to flush out any shortfalls by getting the last person who joined to mentor the new joiner.
  • Make your documentation inclusive e.g. this document is parsed using alex in order to catch insensitive and inconsiderate writing.
  • Be wary of not overloading new starts with too much information. There is often quite a lot to learn (often more than you think), instead provide a set of useful links so people can research at their own pace.
  • Write code that takes into account how future maintainers will feel reading it, let your code be empathetic.

Documentation

Make it easy to understand and do the things

“Stale documentation is not only misleading, it is positively harmful.”

Riona MacNamara (@rionam)

It's important to either have or do the following:

  • Regularly tidy your documentation, old documents should be removed, outdated ones updated, if you touch it then update it

    • Consolidate your documentation, nothing is so disheartening as searching your Wiki for "Password Management Policy" and 40+ search results coming up 👎
  • Have a High-Level logical Architecture. E.g. ideally written in a Git friendly format:

  • An overview of the company’s infrastructure.

  • Systems integration points and their third party dependencies

  • A intranet/wiki or enterprise social network to Learn about different teams, key members with pictures. On day one, one can easily get overwhelmed with lots of new names and faces.

  • Have documentation for your alerts. If something is important enough to disturb the on-call person about, it's important enough to have a runbook entry about it. If you alert because foo queue is too long, there should be a runbook entry describing how to fix it.

    • At one client I worked with we configured the monitoring system so the alerts themselves actually had a link to the relevant runbook entry 👍 👏
  • Create a Glossary of Terms [e.g. a Minipedia] for describing any organisation specific acronyms or terms

  • Write your documentation as if it's going to be open to public scrutiny someday.

  • Have an easy to use and setup collection of shared resources e.g. bookmark file of URL links, .ssh/config files

  • If possible keep your documentation as close to the code as possible (possibly as Markdown) rather than referencing external resources like wikis or, use a static site generator this way you are more likely to have up to date documentation, since you get immediate feedback when you do a review of code changes rather than having to separately review a PR and a Wiki Page. Some options are:

  • If there are problems that you have to work around in your code then in the comments link to some sort of permanent record (e.g. a URL of a Jira story or ADR) for why, the following code comment caused me to do a lot of running around (The `git blame' gave me a commit that lead to a PR that had zero details in it, authored by someone who could not remember why they put that in the code.):

    instance_type: m4.4xlarge # Larger than this currently causes issues on our AMIs…
  • what would have been more helpful would have been:

    instance_type: m4.4xlarge # Larger than this type causes issues see REF-2019

Operations

Make it easy to get stuff done

“Complexity exacts a staggering tax on your humans. Good Ops engineers attempt to pay down that tax.”

Charity Majors

  • Have all relevant user accounts and access setup and ready

  • Create Operations Checklists for your key processes

  • Have your work structured so people can see what needs to be done i.e. Kanban board backlog or To Do lists

  • Provide information regarding the applications that are maintained by the team and how to do the operations for those applications

  • Have safe to deploy sample dummy applications that can be deployed safely to your infrastructure so new starts can learn how the deployment process works without fear of impacting key applications

  • Make it difficult to make mistakes e.g

  • If you have Policies on how to handle certain tasks e.g. Doing Spikes document them and link to them in your stories. e.g. here's the link to how you handle spikes.

  • Ensure your naming conventions are consistent and make sense:

    • If something is called build_X and it actually deploys_X then change the name to deploys_X if possible to reduce confusion and prevent information hiding,
    • If your environment structure is env-productgroup-application then make sure the naming is consistent across all environments e.g.
      • Development-Acme-Bomb
      • Test-Acme-Bomb
      • PreProduction-Acme-Bomb
      • Production-Acme-Bomb
  • Nobody should be able to do something catastrophic to an environment unless they are determined on doing so i.e.

    • Make doing the right thing easy to do by creating safety harnesses using build or scripting tools like the following list to do the most common tasks safety without the worry of screwing up:
    • If you use configuration management tools then use them repeatedly and/or test them, try to avoid one shot configuration management i.e. the operation is only run once once to configure a resource even one you do not expect to change, because it will change and it will break and you will be rushing around trying to figure out what happened.
    • Use the Guard Rail Pattern by putting safe conditionals in your configuration management to do be able to test runs without the worry of screwing up e.g. Ansible tasks:
    - name: “Do something really Dangerous"
      command: /sbin/something —could —be —dangerous --if --run --it --in --prod
      when: testmode == “Off"

Processes

How should we be doing the stuff

“If you can't describe what you are doing as a process, you don't know what you're doing.”

W. Edwards Deming

  • Everyone seems to have their own particular spin on Agile Scrum or Kanban, so explain up front what the process is and refine when and if necessary.
  • Have Shovel Ready work for new starters, create a backlog of work that can be easily done by a new starter:
    • Ideally work that:
      • is well defined,
      • is easily explained,
      • requires some research,
      • adds value and;
      • is not grunt work e.g. documentation.
  • Assign your new starter an on boarding buddy/mentor
    • Ensure that this "Buddy" has enough free cycles to be there for the new start if needed
  • Pair with new start as soon and as often as possible depending on the complexity of the environment this could go on for weeks (if not months), don't be afraid to pick up this pairing at a later date if the engineer has never touched that code block before.
  • When [and if] you do a Retro, then base it against a known good baseline i.e.
    • If you are doing production deploys in the early hours of the night and it goes successfully, remember this is not necessarily reflect a good deployment.
  • Put as much detail into tasks / stories as possible including:
    • Assumptions,
    • Reference information and existing implementations,
    • Ensuring to narrow down the acceptance criteria in order to prevent unnecessary research or rework,
    • Diagrams.
  • Ideally make sure your Tasks/Stories are as small as atomically possible this is for a number of reasons some of those being:
    • It makes them easier to handle and get your head around
    • You are less likely to have to context switch within a story if it has a narrow problem domain
    • You are more likely to actually finish that particular story and not have to pick up a new one and have to go back to the original story, since the smaller it is the less likely it is to run into some sort of unpredicted blockage.
  • Avoid [if possible] onboarding during crunch times (important or critical planned releases)
  • Ideally have your accounts linked with some central or shared directory e.g. Github/Google/LDAP so your new starters don’t have to create and remember 101 user/password combinations or have to request access to multiple applications separately.
  • Use configuration management that has a dry run feature e.g. --testing_mode on
  • Add or invite individual to any relevant Slack, IRC or Microsoft Teams channels or Mailing lists.
  • Provide information regarding relevant processes e.g.
    • Incident, problem and change management
    • Deploying changes / releases to the different environments
    • Ordering infrastructure / tools
    • Authorization for tools & applications
    • Use of test environments and creating and using testdata
  • Have Clean code It really helps if your code is good, sensibly organized and well structured. If the code base is large, it should be broken down into smaller understandable segments
  • Create a Papercuts.md in your Repos, These are a log of things that have hurt us in the current environment, they may not be actual technical debt,however they could be things for us to discuss and possibly fix in the future.
  • If you have adopted a particular coding style guideline on your project then document or reference it for new joiners to easily reference and adopt
  • Story kickoffs can be extremely useful to new starters by helping them getting to the mindset of the team, identify areas that aren't immediately visible in the code base and generally reduce constant rework due to poor or missing acceptance criteria.
  • Embed you processes in your code. If your process requires you to hand off to another team to get the thing you want done e.g. After issuing a Pull Request you need to notify another team to run a Jenkins pipeline, then put the team and the contact information in the documentation (e.g. Slack Channel).
  • Use code formatters to standardize the structure your code e.g. terraform fmt this can make reading diffs a lot easier since you don't have to deal with things like differing indentation.
  • Encourage Mobbing or Swarming on difficult issues or development blockages (e.g. blocked pipelines) where the entire team works together on a single task

Version control management

“Those who cannot remember the past are condemned to repeat it.”

George Santayana

How do we safely change the things
  • Document your coding standards and strategies in the open e.g.
  • Have an Up to date README documentation in all repos for example
  • Make Pull Requests a first class citizen, nothing is more demoralising than having a Pull Request sitting around without feedback and a chance of being merged especially if it needs to be continually rebased.
  • Good Pull Requests can also be an excellent teaching tool for new starts or old hands alike, a good PR tell's you what was implemented, why and how, so if you (or anyone else) need to do something similar in the future it will make things a lot easier than relying on your memory or tribal knowledge. You can also prompt for good Pull Requests by using Pull Request Templates that suggest your best practice format.
  • If you use slack or something similar consider adding a notification bot for pull request and push activities, e.g. for bitbucket or github to notify your colleagues that a Pull Request is ready for review.
  • Keep your pull request list short and tidy, merge good requests quickly and close poor ones or those that are never going to be merged.
  • Integrate your git history with your external issue tracker so that it can automatically reference the changes related to a story and put in place some automated branch naming pattern protection to ensure that any branches match the issue trackers issue reference format, this way you enforce the best practice of a branch matching a historical record in (for example) Jira as to why something was created, changed or deleted.

Development environments

How do we safely change things

“Measure twice, cut once”

Proverb

  • Make it easy to set up your local development environment, you should not have to do the following just so you can start work:
    • Log multiple service requests
    • Read through multiple wiki pages
    • hunt down multiple individuals
    • get multiple emails with multiple links
    • Ask multiple people how their personal environment is configured
  • Have at least a minimally functioning Continuous Integration setup
  • Make your tooling easy to set up an easy to use cross platform or run a local environment that does not mess up what’s currently there e.g. in a virtual machine

Useful links

Would you like to know more?

See a problem here

See a problem? Need something clarified? Raise and Issue and I'll try and fix it.

Contributing

I'm open to well structured Pull Requests

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

CC Attribution 4.0 International © Martin Jackson