Unix is an operating system that was originally developed at Bell Labs in the 1970s. It is based around a "modular design" where tools do very distinct and narrow tasks. To complete more complex tasks multiple modules are then combined through the use of "pipes" - more about those later. If you are interested in learning more about UNIX then you can have a look at the wikipedia article for UNIX, it's quite thorough and well written.
While there is no straightforward answer to this question there are some things that are often brought up. UNIX (and Unix-like systems) in its design is quite simple and is nowadays very portable. This has lead to it being used to run anything from massive high-performance computer clusters to tiny single-board computers such as Arduino & raspberry pis. This ubiquity and popularity is probably one reason why it is still so popular. Since the year 2000 Mac computers are also running on Apple's own Unix system, another popular Unix-like system is the Android mobile operating system. While the original UNIX operating system is a commercial system there are many Unix-like operating systems such as Linux which are free and open-source (these are generally based on the Linux kernel). The ecosystem of open source and free distribution suits the academic world very well. It is not science if you aren't sharing your findings and how you came to your conclusions - that generally includes your code.
Interaction with UNIX-style systems is typically done through a command-line interface (CLI) - a terminal
of some sort. There is generally no GUI (graphical user interface), though there exist protocols to display graphics through the terminal such as X11
.
To communicate with the system there needs something to interpret the user's command. In a UNIX-like system, this is called a shell
, one of the most common ones - and the one found on Uppmax
is called Bourn-Again shell or simply bash
.
It's an interpreter and it's own (basic) programming language.
Enough exposition, let's get going. Open up the terminal and proceed with the exercise.
Terminal display for version 7 of Unix
(By Huihermit - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=30560188)
The macOS operating system is a Unix-system so modern Macs can natively interact with other Unix systems. macOS comes with an inbuilt terminal application called simply Terminal
.
It is functional and gets the job down, we, however, recommend that you instead install iTerm2
, which comes with several quality of life improvements.
Download it from https://iterm2.com/
MacOS comes with a very bare-bones version of Unix which lacks many tools that most Unix installations come with. Thus we recommend that you install Homebrew which is a package manager for Mac which makes it super easy to install missing programs and tools.
To installl Homebrew
; open the terminal and run:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Note that this will take around 5 min to complete and you will be prompted for your user password. The installation requires around 10GB of space
Then you can install some useful tools:
brew install wget
brew install w3m
brew install tree
If you are on a Windows machine then the easiest option for you is to download MomaXterm
.
https://mobaxterm.mobatek.net/
Another option is to use the Ubuntu app from the Windows app store or to set up a dedicated Linux partition. If you know how to do that then go for it, if not stick to MobaXterm.
Open a local terminal and run the following command to install w3m
which we will use in today's lab:
apt-get install -y w3m-img
If you have problem with missing software on your installation of Unix/Linux you can choose to work on Uppmax directly. You can read the instructions under Working remotely then continue for here.
When you connect to a system you usually end up in your home directory. To see the path to where you stand use the pwd
command. Try it now:
pwd
it should return something like:
/home/user_name
As you move around in the systems directories this will of course change.
To make a directory (called folders on other systems) use the command mkdir
, try the following out:
mkdir directory1
To figure out what happened use the ls
command which lists the content of the current working directory:
ls
As you can see mkdir
makes directories, all (or at least most) UNIX commands are named after their function.
To go to your new directory use cd
(change directory);
cd directory1
Now try to create another directory called directory 2
mkdir directory 2
Take a look at what you did using the ls
command.
Looks like you created two directories, one called 2
and another called directory
. Since the basic syntax for UNIX commands is:
command option1 option2 option3
Each extra option to the command is separated by a space, which means that spaces and other white characters, such as tabs, are not allowed in file or directory names!
To fix this issue use rm
for, you guessed it - remove.
We can remove both at the same time:
rm directory 2
Ouch, rm
by default only removes files, not directories. You need to tell rm
that you want it to work recursively, using what's typically called a flag, in this case -r
.
rm -r directory 2
Check that is worked using ls
.
Ok, let's try and make a second directory again:
mkdir directory2
change into it using cd
cd directory2
So now your current directory should be home/your_usename/directory1/directory2
, to find out use pwd
. If you want to go one directory up, in our case from directory2 to directory1 you can use ..
notation:
cd ..
Check that it took you to the right place. Here are two need things to know about cd
cd -
takes you back to the last place you were standing.
Just cd
with no options takes you to your home directory.
w3m -dump https://en.wikipedia.org/wiki/Principal_component_analysis > PCA.txt
The above command reads the Wikipedia page for Principal Component Analysis and extracts the body text and saves it to the file PCA.txt
. The >
is used to redirect output to a file.
Now that we have some text to work here are some tools for inspecting files, try them out on PCA.txt
.
cat - concatenates the file contents to standard out (the screen)
less - a nice and easy file viewer, press q to quit!
head - look at the head of a file, by default the first 10 lines.
tail - looks at the tail of a file, by default the 10 last lines.
It turns out that the PCA article is quite big, how big?
Counting is hard and slow for humans but easy for machines. Use the word count command wc
on the file to figure out how many lines and words it contains.
Question 1: What did you get? How many lines and words?
Hmm, it's not that clear, is it? Have a look at the manual for wc
to try and figure it out.
All core UNIX commands have an inbuilt manual you can access it through the man
command:
man wc
Now that you figured that out, use the manual for head
to figure out how to save the first 100 lines of PCA.txt
and save them into short_pca.txt
.
You can use wc
to figure out if you did it correctly.
grep
is a useful tool that prints lines matching a provided pattern.
In our example we can use it to figure out how many lines contain the word PCA
:
grep PCA short_pca.txt
One of the fundamental concepts behind UNIX from the beginning was an emphasis on small task-specific programs. These programs could then be chained together into pipelines to perform more complex tasks.
Imagine that you have a machine that cuts down trees, de-barks them, and cuts them into planks. If you have trees and want planks then this is fine, but what happens if you want to cut down the tree and keep the whole log? Your big fancy machine only makes planks. In the UNIX world, your one machine would consist of smaller machines chained together. One that cuts down the tree, one that removed the bark, and one that cuts into planks. If you get a pile of logs from your neighbor then you can use one of your machines to make planks, etc.
This type of chaining together is called piping in UNIX and is done by the pipe character |
. e.g.
command 1 | command 2 | command3 > output_file
By using grep
, pipe
, and wc
we can now easily figure out how many lines of the Wikipedia article about Principle Component Analysis contains the word PCA
:
grep PCA short_pca.txt | wc -l
Try it for the full article as well!
Question 2: Write down how many times the term"PCA"
appears in both the full PCA.txt
and the short_pca.txt
files. Remember that there can be more than one hit per line so you need to account for that!
--
Hidden word exercise
Question 3:
Now that you have some basic UNIX tools at your disposal go and do the hidden word_exercise. (Note: You will need to download a file for this exercise. When working remotely, you can download it to your local computer first and then use scp
(Description at the bottom of this page) to copy it to the remote computer.)
Submit the hidden word.
Do not continue with the next part until you are done with the hidden word exercise.
The next part of this exercise will take place on the computational cluster Rackham on UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science).
To connect to a remote Unix server the protocol ssh
is typically used.
Given that you have an account with Uppmax do the following to connect to Rackham:
After entering that command you will be prompted to enter your password.
Our teaching project is called uppmax2022-2-2
Navigate to the RESULTS folder (using cd
):
to:
/proj/uppmax2022-2-2/private/RESULTS
There you need to make a directory for you to work in. Call it your first name underscore your last name:
FirstName_LastName
For the rest of your exercises, you should be working in that directory. I.e copy data there and work on it.
Ok time for something perhaps a bit more fun. Some genetic data!
The SAM/BAM (Sequence Alignment/Map & Binary Alignment Map) format is a very popular format for storing nucleotide data that is aligned to a reference.
If you want to read more about the file format(s) you can have a look at the official documentation: https://samtools.github.io/hts-specs/SAMv1.pdf
If you want a simpler (and easier to read) text to give you an overview have a look at the Wikipedia article: https://en.wikipedia.org/wiki/SAM_(file_format)
The main tool for working with SAM/BAM files is called samtools
and it's installed through the module system on Uppmax. To get access to it:
module load bioinfo-tools
module load samtools/1.9
The BAM file and corresponding SAM file can be found here:
/proj/uppmax2022-2-2/private/DATA/BAM
Before using it have a look at the file sizes of the two different formats.
ls -lh
With that information, you can probably see why it's a generally good idea to store data in binary formats as much as possible. The original full file was 117 GB for reference.
Question 4: use samtools view
and head
and tail
to figure out the first and last basepair position in the file. Also include the exact command you used!
First, figure out which fields it is that you want and then investigate man cut
to figure out how to access them.
Question 5: Write down the command you used to extract the name and nucleotide sequence! This should be done in one command.
Regular expressions(regex) are used to catch and match certain words or phrases. E.g
^
^P[0-9]+
will match at the beginning of a line(^
) the letter P literally, any number ([0-9]
) repeating (+
)
it will thus catch the top line but not the second:
P674353
454646464 P
These types of expression can be very useful and powerful
sed
is a powerful tool for editing streams of files. It is a common way of using rexes in Unix. It's often used to replace one thing with another:
if My_file.txt
contains:
dog dog dog
Then:
sed 's/dog/cat/' My_file.txt. # s is for substitute
cat dog dog
the first instance of dog
is replaced with cat
. We can also replace all instances using the g
global flag:
sed 's/dog/cat/g' My_file.txt. # sg - substitute globally
cat cat cat
You can use sed
on piped output from another program or straight on a single file. For a summary of some things that you can do with it have a look at this link.
Question 6 :
You have been given a that has been exported from excel in an odd format (something that is all too common in the life of a bioinformatician). Your task is to transform the file orange.csv into a normally formatted .csv
-file. That is the decimal point should be a .
and the delimiter (what separates one column from another) should be ,
. It also looks like someone has accidentally inserted some letters among the numbers, they also need to be removed.
Submit what sed
command(s) you used to clean the file. (Make sure that it looks correct, i.e. all columns are separated with a comma)
You will probably have to look up more information on how to do this. You can use man sed
or info sed
for more information, or google your way to it. As long as you know what your command does
Note that if you want to upload the file to your Uppmax results you can use the scp
command, you can find an example of how to use it at the bottom of the page.
Bash is a programming language in itself so it is possible to set up quite advanced workflows with it. The most simple bash script is just a normal command you would type on the command line saved to a file. Or more realistically you might want to run a couple of things that take a few minutes or hours after each other. This is something that you definitely will do in your future bioinformatics career.
An example of something like that:
echo “Wait for 5 seconds”
sleep 5
echo “Completed”
Add the above text to a file called sleep.sh
and execute it with:
bash sleep.sh
You can see that the code is executed sequentially, it does not progress to the next line until the previous one has finished.
An example of a use case could be that you have data on a local server that you want to transfer to a remote server where you want to perform some kind of analysis and then transfer the results back to your local server. That would look something like this:
### Standing at the remote server copy the local files there
scp my_user@remote_server:local_file .
### Do the analysis
run_analysis.sh local_file > output_file
### Copy results back
scp output_file my_user@remote_server:
You can chain any typ of program/script that you can run on the commandline like this, even those you have written yuorself like the perl
programs you ran in the hidden word excercise or the scripts you will write in Lab 2:
python my_python_script.py inputfile.txt
This is the end of the lab, please make sure that you did and wrote down the answers to all of the questions. Also, make sure to delete any files that you no longer need - you can copy them somewhere else if you want to keep them.