Skip to content

Latest commit

 

History

History
216 lines (133 loc) · 13.9 KB

README.org

File metadata and controls

216 lines (133 loc) · 13.9 KB

Notes on writing Linux kernel modules

In these notes, we follow The Linux Kernel Module Programming Guide closely. Each directory in this repository contains one or more kernel module examples, most are taken from LKMPG. We attempt to tidy up the examples, modernize them, or comment on them further. In any confusion, the ultimate arbiter are the kernel sources. Use the source, Luke.

Other resources:

The build system

The build system is called kbuild.

Building kernel modules on Debian

We need the kernel headers, which we can install with:

sudo apt install linux-headers-$(uname -r)

Operations on modules

  • modinfo, inspect a .ko file.
  • lsmod, view which modules are loaded and how many processes use a module.
    • Modules are also listed under /sys/modules; additionally, the kernel built-in modules will be also listed there.
  • insmod, load a module.
    • Arguments can be passed via insmod mymodule.ko variable=value using module_param(). See hello-5.c. E.g.
      insmod hello-5.ko mystring="foo" myintarray=-1,3
              
    • Can’t load a module if one with the same name is already loaded. This includes built-in kernel modules.
  • rmmod, remove a module.
  • dmesg | tail, view diagnostic messages printed with pr_info() and other similar functions.

Kernel modules are object files whose symbols are resolved by insmod (or modprobe for already-installed modules.) Exported kernel symbols are in /proc/kallsyms.

Licensing

The MODULE_LICENSE() macro exists primarily for three reasons:

  1. So modinfo can show license info for users wanting to vet their setup is free
  2. So the community can ignore bug reports including proprietary modules
  3. So vendors can do likewise based on their own policies

Information on MODULE_LICENSE() and other MODULE_* macros can be found by reading the source of the linux/module.h header.

Device drivers

Device drivers are a class of kernel modules providing functionality for hardware.

Device files are under /dev; they provide means of communicating with hardware. This provides a general method of communicating with drivers: /dev/sound may be connected to by the es1370.ko driver, or some other. Device files can be created by e.g. =mknod /dev/coffee c 12 2=.

Major and minor numbers (Assigned major/minor number listing) are listed by ls -l in the form MAJOR, MINOR. The major number is the corresponding device driver controlling it, and the minor number is to differentiate different (potentially abstract) hardware. When a device file is accessed, the kernel determines the module controlling it by the major number. The minor number is for the module itself to consume. Major numbers and drivers currently online are under /proc/devices.

Device files are either “character” or “block”. Block devices have IO in blocks of bytes and can buffer requests, while character devices work with bytes.

Character devices

The struct file_operations holds operation callbacks such as read() and write(). Unused operations are set to NULL. The variable is commonly called fops. struc proc_ops replaces struct file_operations for merely registering /proc handlers.

The struct file represents an abstract open file.

To register a major number for a character device, use either register_chrdev_region() or alloc_chrdev_region(). The former is with a fixed major number and the latter dynamically allocates one that is available. Then the functions cdev_alloc(), cdev_init(), cdev_add(), etc, are used. For an example of the cdev interface, see ioctl/ioctl.c.

In <linux/module.h>, the following functions are available to view or modify the use counter:

try_module_get()  /* Increment the reference count of current module. */
module_put()      /* Decrement the reference count of current module. */
module_refcount() /* Return the value of reference count of current module. */

This can all be accomplished better by the .owner = THIS_MODULE member of struct file_operations. See SA/a/6079839 and an examplanation of the VFS as well as lwn.net/Articles/22197/.

Conditional compilation for different kernel versions

This is an advanced situation where multiple incompatible kernel versions are wished to be supported.

/* Conditionally compile for kernel 2.6.16 or less */
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,16)
  /* ... */
#endif

Examples

Each directory in this repository contains one or more kernel module examples. Here we describe them and comment on the particularities of their source code.

chardev

This kernel module is a character device. Userland processes can interact with the device by treating it as a file (with filename /dev/chardev.)

We define four functions, device_{open,release,read,write}, which we populate a struct file_operations with. The file_operations structure controls the behavior of the character device. For example, when an attempt from a process to read from the character device is made, the function registered under the structure member .read is called.

There are two functions attributed with __init and __exit which are the entry point and exit point of the kernel module (analogous to main in a C userspace program.) Any functions attributed with __init and __exit allow the kernel to free up the memory their code used after initialization, so it is an optional optimization. The actual lines that tell the kernel which functions will be the entry and exit point are module_init() and module_exit().

In our init function, we register a character device with register_chrdev so that the kernel dynamically assigns a major number (scroll to the 234-254 range) for us. This looks like:

major = register_chrdev(0, DEVICE_NAME, &chardev_fops);
/* ... */
cls = class_create(THIS_MODULE, "chardev");
device_create(cls, NULL, MKDEV(major, 0), NULL, "chardev");

The class_create call creates a class structure. These classes have multiple uses, a notable one is for exporting device numbers under /sys/class/$name where $name is the second parameter of class_create(). The device numbers are used by by udev(7), e.g. with tools like udevadm(8) for device discovery (for example: mount filesystem when USB stick is plugged in.) Note that cls must be deallocated with class_destroy(); THIS_MODULE is a macro to a struct and MKDEV() combines a major and a minor number.

Our driver has a global buffer called msg which we wish to synchronize between multiple processes; only one process can use the buffer at a time. For this purpose, we use a binary semaphore with atomic updates: we use ATOMIC_INIT(val), atomic_cmpxchg(&x, comp, newval), and atomic_set(&x, val).

We keep track of the number of processes currently using the kernel module with try_module_get(THIS_MODULE) and module_put(THIS_MODULE) to let the kernel know not to make the module exit module prematurily. Note that try_module_get() presents an issue, and there is a superior alternative. See SA/a/6079839.

Writing to the device fails with -EINVAL.

Reading from the device essentially calls put_user(*msg++, *buf++) over and over until the whole message is written, and returns the number of bytes. The function put_user() copies from kernel memory to user memory: when a userland program attempts to read from the character device, a userland buffer is provided to kernel space for filling; note that it is attributed with __user, as in char __user *buf.

We can invoke trigger.sh every time chardev is loaded by writing the following udev rule in /etc/udev/rules.d/80-chardev.rules:

SUBSYSTEM=="chardev", ACTION=="add", RUN+="/path/to/chardev/trigger.sh"

Assuming the path is correctly modified to point to trigger.sh, and that we then run udevadm control --reload, the script will be invoked whenever insmod chardev.ko is performed. We can check that it has indeed ran by inspecting its output, on /tmp/chardev_trigger.log.

chardev2

Another mechanism of communication with character devices is demonstrated: ioctl(2) calls. The function that deals with the ioctl call is device_ioctl(), stored under the .unlocked_ioctl member of the fops structure. To define our own ioctls, we use the _IO* macros in chardev2.h. This public header is also used by userland programs, as they also need to be able to use the ioctl macros. One important difference with the old chardev is that we no longer dynamically register a major number; instead we provide a fixed number MAJOR_NUM to register_chrdev(). This is important because the device number is used in the ioctl macros.

In chardev we used the .release fops, but now we use a worse alternative, try_module_get() and module_put(). This shouldn’t be used, but we demonstarate it regardless.

procfs

The init and exit functions use proc_create() and proc_remove() to create/remove the proc file. The return value is a struct proc_dir_entry *

To them the file permissions, e.g. =0644= are passed, and a proc_ops struct with .proc_read = procfile_read. See linux/proc_fs.h for kernels v5.6+.

The function procfile_read uses copy_to_user(buffer, s, len) and adds *offset + len=.

ioctl

After loading the module, use journalctl | tail to find out the major number, and use

mknod mydevfile c <MAJOR> 0

to create a device file corresponding to this driver. This char file will continuously output the configured byte value non-stop.

syscalls

When calling a syscall, a process jumps to a location in the kernel named system_call. They are indexed on sys_call_table by the syscall number.

We wish to modify sys_call_table to wrap our code around a particular syscall.

The control register cr0 modifies the x86 processor behavior. Once the write protection WP flag is set, the processor disallows write attempts to read-only sections. Thus to modify the table, we must disable WP.

We will replace open() with what is conceptually

new_open():
  if proc_id() == MAGIC:
    pr_info(report which file is being opened)
  continue with normal open()

pid_experiments

We show various things that a kernel module can do with userspace processes.

pid_info

Print information about a process.

The Virtual File System

The VFS is the layer between a call to write() and the specific code responsible for dealing e.g. with ext4, btrfs, and so on.

VFS translates pathnames into directory entries (dentries). A dentry points to an inode, a filesystem object. The inode contains information about the file, for example the file’s permissions, together with a pointer to the disk location or locations where the file’s data can be found.

To open an inode, a file structure is allocated (kernel-side file descriptor). The file structure points to the dentry and operation callbacks taken from the inode; in particular, open() is then called so that the particular filesystem can do its work.

Filesystems are (un)registered with

int (un)register_filesystem(struct file_system_type *);

The registered filesystems are under /proc/filesystems. To mount a filesystem, VFS calls mount0() and a new vfsmount is attached to the mountpoint; when pathname resolution reaches the mountpoint, it jumps into the root of the vfsmount.

A superblock object representes a mounted filesystem.

Things to explain

  • [X] What is the loff_t* parameter in the .read operations of struct file_operations and struct proc_ops?

    The offset is the current position in the file. The read operation gets called again and again until a 0 is returned. Notice it is us who advance the offset via a simple +=.

  • [X] How does the sysfs example work? I don’t understand kobject_create_and_add(), especially the second argument. How is an attribute a kobject?

    The kernel_kobj file makes it a parent and so the kobject lies under /sys/kernel.

  • [X] What does class_create() do?

    Creates entries with major/minor under /sys/class, useful for device discovery by udev(7).