CFQ_SUBSYS: BLOCK I/O SCHEDULING CGROUPS SUBSYSTEM BASED ON CFQ --------------------------------------------------------------- Technical Report --------------------------------------------------------------- Vasily Tarasov vtaras@openvz.org ABSTRACT The latest Linux kernels provide a mechanism for aggregating tasks in groups with specialized behavior. This functionality is called Control Groups or CGroups. On their own, CGroups does not implement any specialized behavior for groups of tasks. The intention is that other subsystems hook into the generic CGroups support to provide new attributes for groups, such as accounting/limiting/scheduling the resources, which the tasks in a group can access. At the moment several subsystems are implemented for CGroups. CPUSET is a subsystem that allows dynamic partitioning of a system into sets of CPUs and memory nodes. Then it allows one to assign tasks to run only within those sets. The CPU subsystem lets the CPU scheduler recognize task groups and control CPU bandwidth allocation to such groups. Other subsystems are in development. For example, a lot of effort is being made to create reasonable subsystems that deal with memory. However, practically nothing is being done towards creating a CGroups subsystem, that manages I/O bandwidth of block device. The author proposes the CFQ_SUBSYS subsystem, which adds to the CFQ I/O scheduler the capability to recognize task groups. This paper describes CFQ_SUBSYS abilities and drawbacks, discusses in details its design and presents some overhead evaluations. 1. INTRODUCTION The initiation of this work was dictated by rapid development of Virtualization technologies. Virtualization is a marked trend nowadays. In simple words virtualization is a technology that allows running several virtual computers on a physical one. Since a lot of teams and individual programmers are working in this field, you can see a great diversity of virtualization approaches. Operating System (OS) level virtualization approach is based on the observation that in certain scenarios user doesn't need to run different OS kernels. In this case kernel can be modified in such they that it will give groups of processes the illusion, that each group is the only one running on the computer. Of course such implementation will not allow you to run completely different OSes, like FreeBSD on Linux or Microsoft Windows on Linux. But as noticed above often it is enough to have several instances of the same OS, Linux in our case. For the price of this disadvantage one gets an advantage of a very low overhead and consequently an ability to run many more virtual machines on a single physical node. Section 2.1 highlights some basic aspects of OS-level virtualization and gives information about current stakeholders of this area. It also introduces CGroups infrastructure, which was recently included in the mainstream Linux kernel. The important part of any virtualization solution is proper resource isolation. CPU time, memory, filesystem and network are the usual examples of resources to isolate. However a resource like block device bandwidth is often forgotten. The CFQ scheduler is used for implementing CFQ_SUBSYS. Hence section 2.2 explains how CFQ I/O scheduler works and covers the basics of Linux block layer design. Section 3 and 4 form the crux of the paper. The former one explains the high level design of CFQ_SUBSYS subsystem. The latter one points out the hurdles encountered during the implementation and explains in detail the meaning of each patch in the patchset. So far several basic tests have been performed and their results discussed in Section 5. However more extensive testing will be required for any such modifications. Section 6 discusses the future direction of the project. 2. BACKGROUND CFQ_SUBSYS is located at the intersection of Linux CGroups and Linux block layer. This section tries to give a minimal required knowledge about these components. 2.1 PROCESSES AGREGATION AND VIRTUALISATION Over the course of Linux history, there have been and continue to be multiple efforts to provide forms of modified behavior across sets of processes. There are several major applications of that. It can be used for more sophisticated resource scheduling mechanism in order to increase overall system performance. An example is the Linux Cpusets functionality [11], that allows binding groups of processes to groups of CPUs and memory banks. This helps improve performance on NUMA architectures. The second application is enhanced flexibility of Quality of Service (QoS). For example a user can aggregate low priority processes in group and limit the resources allocated to this group. Consequently the high priority processes will have enough critical resources with more probability. The CKRM project [09] and UserBeancounters [03] are known works in this field of application. OS level virtualization has become one of the most important applications of process aggregation. In fact, enhanced process aggregation is OS level virtualization. Notable examples here are OpenVZ [04] and Linux Vserver [12] projects. Taking into account growing interest in this field, Paul Menage from Google and others developed the CGroups functionality for Linux kernel [01][02]. CGroups is a basic framework that allows aggregating tasks in groups. In this paper "CGroups" is a functionality, while "cgroup" - is a group of processes in CGroups framework. Let's review the basic concepts of CGroups. CGroups contains the cgroup filesytem, that allows user to create cgroups and assign processes to cgroups. For example creating a new cgroup and assigning the current running shell to this group is done by the following simple commands: # mount -t cgroup cgroup /mnt/cgroup # mkdir /mnt/cgroup/group_name/ # echo $$ > /mnt/cgroup/group_name/tasks Each new process launched from this shell will be in the same cgroup, since CGroups framework preserves cgroup of any child process. CGroups allows the formation as a hierarchy of cgroups by simply creating a tree in cgroup filesystem: # mkdir /mnt/cgroup/group_name/ # mkdir /mnt/cgroup/group_name/lower_group_name Hierarchical structure of cgroups is useful in certain cases. For example an administrator can create a cgroup for the processes of professors at University and then separate professors by the department, giving CS professors more priority in accessing resources. So far there was no discussion about how CGroups changes behavior of processes in cgroups. In fact, on their own, CGroups does not implement any specialized behavior for cgroups. The intention is that other subsystems hook into the generic CGroups support. CGroups provides the interface for registering new subsystems. It is important to note that "subsystem" is a part of CGroups terminology. And CFQ_SUBSYS forms a subsystem under CGroups. Registration of a new subsystem requires the cgroup_subsys structure, filled with methods that are called by CGroups. For example cgroup_subsys->create() method is called on the creation of a new cgroup. CGroups also allows to create files in cgroup filesystem that enables the administrator to control the subsystem specific behavior of cgroups. For example the user can change the I/O priority of cgroups by writing to "ioprio" file. 2.2 LINUX BLOCK LAYER AND CFQ SCHEDULER Linux Block layer consists of several components: device drivers, I/O schedulers, APIs for drivers and filesystems, helper functions and finally the core, that maintains all the required data structures in consistent state. For this paper I/O schedulers and their position in block layer are the most important. These are discussed below. For each block device in the system Linux allocates "request_queue" structure. This huge structure is used across all components of the block layer: from filesystems API to device drivers. Second major structure is "request" structure, that is an abstraction of single request for the block device. Also filesystems use bio structure which essentially also represents a request to the device. When filesystem is ready to make a request to block device it forms an appropriate bio structure and calls submit_bio() function. This function forms request structure and invoked the I/O scheduler. I/O scheduler places newly arrived request in its internal structures and wakes up device driver. Device driver checks with the scheduler if there are any requests to the device. If there are, the scheduler puts some requests on the request_queue. After servicing all the requests the process repeats. Note that I/O scheduler acts as a dispatcher. It stores arriving requests in its internal structures and then sends these requests to the driver, but (possibly)) in different order. The change in the order helps to increase performance and fairness dramatically. Current Linux kernel includes several I/O schedulers. Completely Fairness Queuing (CFQ) scheduler is the default at the moment and is apparently the best on varied workloads. Moreover its the only one that supports I/O priorities between processes. Hence CFQ is a good choice for implementing block layer scheduling subsystem for CGroups. The basic idea of CFQ is as follows. Each process that produces an I/O activity has its own queue of requests (struct cfq_queue), this structure will henceforth be referred to as "cfqq". CFQ gives each cfqq an interval of time to work with a block device, the time slice. The length of the time slice depends on the priority of the task. When cfqq expires its time slice the next cfqq is serviced in round-robin fashion. The design appears simple, but the "devil is in details", these are described below. The CFQ allocates a cfqq per process, but only synchronous requests go to this queue. Asynchronous cfqqs are shared between all processes and are created on a per priority basis. The reason is that it's unknown who initially requested an I/O operation in case of asynchronous requests. Do note that usually writes are asynchronous operations. CFQ merges multiple adjacent requests into a single request. In addition CFQ inherits the concept of anticipating requests from the anticipatory (AS) scheduler. This means that if there is a synchronous request from the process, in certain situations it's worth waiting for the next request, rather than switching to serving other request immediately. Hence the disk driver idles, although there are requests to be processed. In order to minimize the effects of idling CFQ logs processes' I/O activity and tries to predict by the prehistory if it's worths idling . CFQ also allows preempting cfqqs in certain scenarios. The elevator mechanism is also implemented in CFQ, i.e. CFQ services process's requests in the order of disk head trajectory. Moreover CFQ adds the ability to go back on the disk, if the back request is near enough. As already mentioned CFQ supports per process priorities. 3 priority classes are supported. Real Time (RT) class is the highest priority class. If there are any non-empty cfqqs from this class - they will be serviced first. All other cfqqs will be serviced only if there are no RT cfqqs. Best Effort (BE) class, the most common class, cfqqs are scheduled in round-robin next. Idle class queues are serviced only if there are no cfqqs in other classes. Each class owns 8 priorities. Another major data structure in CFQ is cfq_data structure. It is an internal representation of per device queue. cfq_data contains all required information about the device to perform scheduling. For example it contains the tree of all cfqqs belonging to this device: cfq_data->service_tree. 3. DESIGN This section explains the high level design of CFQ_SUBSYS CGroups subsystem. The design was initially developed for the UserBeancounters [03] project in the context of OpenVZ project [04]. In addition to the main aim of implementing functionality the following aspects are important: first there is a desire to include patches in mainstream kernel, so minimal changes to CFQ code should be introduced; Second, only minimal overhead is acceptable. As already mentioned CGroups supports hierarchical cgroups. In case of CFQ_SUBSYS it doesn't matter where in hierarchy the cgroup resides. We only care about the priority of the cgroup in question. Current design implements 8 priorities. No real time or idle classes are supported. It is done in order to simplify the design at first steps. Moreover such functionality is not required in common cases. Default priority is 4. CFQ_SUBSYS implements something like two level CFQ scheduler. Each cgroup has a time slice, during which only cfqqs from this cgroup are serviced. If there are no active cfqqs in cgroup, scheduler switches to the next cgroup immediately. Here is a drawback: we do not anticipate requests from cgroup, as original CFQ does for processes. It is done again for simplicity, but later should be fixed. Cgroup's time slice is scaled according to its priority: from HZ/2 to HZ. The higher priority is, the longer time slice the cgroup obtains. Inside cgroup all queues are scheduled as in original CFQ. As in original CFQ design asynchronous requests are not prioritized. In order to overcome this shortcoming, "dirty pages tracker" is required, which will remember for each dirty page the cgroup that made it dirty. There are some experiments in this field [03], and later it would be great to incorporate their results in CFQ_SUBSYS. Next section explains design in details. 4. IMPLEMENTATION This section explains the meaning of each patch. It seems to be a very appropriate approach for explanation of low level design. Notice that each patch is done in such manner that it is possible to compile kernel and run it after applying the patch. Patches are against linux-2.6.24-rc3-mm2 kernel, because they need to be incorporated in -mm tree before be accepted to vanilla Linux. diff-00-cfq-cgroup-move This patch just moves internal CFQ structures to the separate header file, since we need cfq_queue and cfq_data structures in CFQ_SUBSYS. diff-01-cfq-cgroup-kconfig Simply adds description to Kconfig file. Notice that requirement IOSCHED_CFQ is set to "yes". The thing is that subsystems in CGroups can be compiled in kernel, but not as a module, while I/O schedulers can be compiled as modules. So a lot of questions arise about what should happen if CFQ module is loaded/unloaded. Moreover it's possible to switch I/O schedulers on the fly. It means that we have a system with a "high degree of freedom". To decrease the "degree of freedom" and consequently don't care about some scenarios this limitation is imposed. diff-02-cfq-cgroup-subsys The patch registers CFQ_SUBSYS subsystem in CGroups framework. It introduces "ioprio" file in cgroup filesystem, which is used to get and set priority of a cgroup. cfq_ss_css structure appears in this patch. This is a per cgroup data structure that holds only ioprio of the cgroup, but next patches will extend this structure. diff-03-cfq-cgroup-mainstruct Main CFQ data structures (cfq_data and cfq_queue) are extended and fundamental CFQ_SUBSYS data structure is introduced - cfq_cgroup_data. This structure is allocated per block device (per cfq_data, speaking in CFQ terms) for each cgroup. It contains data from cfq_data structure that should be per cgroup now (for example cfqq service tree). Structure cfq_ss_css is extended to hold the list of cfq_cgroup_data structures that are used by this cgroup. Each time a process produces the first request to a device, we look in this list for appropriate cfq_cgroup_data. If there is no appropriate structure in the list, new cfq_cgroup_data is allocated and placed in the list. This is performed by cfq_cgrp_findcreate() function. Notice that each cfqq has a link to cfq_cgroup_data (so we can easily find the owner of cfqq) and cfq_data structure contains a list of active cfq_cgroup_data as well as current active cgroup. Current active cgroup is a cgroup that works with the device at the moment. Active cgroup is a cgroup that has pending requests. There is only one current active cgroup in the system, but multiple active cgroups. The list of active cfq_cgroup_data will be used for cgroups scheduling in later patches. This list is always sorted in the order of dispatching, so we get next cgroup to service in O(1) time. Two list described above form a lattice. In case of two devices and two active cgroups the following figure depicts the this lattice: cfq_data 1 cfq_data 2 | | | | cgroup 1 --- cfq_cgroup_data ------- cfq_cgroup_data | | | | cgroup 2 --- cfq_cgroup_data ------- cfq_cgroup_data | | | | cgroup 3 --- cfq_cgroup_data ------- cfq_cgroup_data This patch also introduces some helper functions that will be used in later patches. For example function cfq_ss_exit_queue() will be called on deregistering of block device. Note that we care about compilation in case of CGROUP_CFQ=n by adding appropriate defines to cfq-cgroup.h header file. diff-04-cfq-cgroup-creat-elim The patch adds a hook inside CFQ scheduler that forms proper lattice above. Namely on new cfqq allocation cfq_cgrp_findcreate() function is called. diff-05-cfq-cgroup-switch As mentioned above cfqq service tree (member of cfq_data) is moved to cfq_cgroup_data, i.e. it becomes per cgroup. Consequently we need to switch every place where per device service tree was used to per cgroup per device tree. Functions that operate on cfqqs already know which cfq_cgroup_data to use, because each cfqq holds a link to appropriate cfq_cgroup_data. Other functions use current active cgroup. diff-06-cfq-cgroup-force In "force dispatch" case I/O scheduler must flush all pending requests in system to the disk. We handle this case by iterating other all cfq_cgroup_data and flushing their requests. diff-07-cfq-cgroup-srvtreerm Above patches completely moved service tree to cfq_cgroup_data from cfq_data. This patch just removes the leftovers of original tree. They were preserved till now only to provide smooth compilation and operability. diff-08-cfq-cgroup-cgrpsched Till introduction of this patch only one cgroup was working on device - top cgroup, in other words current active cgroup was constant. But we need to switch current active cgroup, when its time slice is expired. This patch does precisely that. Several hooks to CFQ code are added. First, while selecting next cfqq to service (cfq_get_next_queue()) we also select current active cgroup by calling cfq_cgrp_schedule_active() function. Second, on adding new request we increase per cgroup request counter. This counter is used to switch current active cgroup if there are no requests originating from it. Time slice scaling code is introduced. Minimal cgroup time slice is HZ/2 and increases linearly to HZ depending depending on priority. diff-09-cfq-cgroup-rqindrv We also should take into account requests in the driver. This patch handles it by introducing additional counter. diff-10-cfq-cgroup-fixes This patch will be partially merged with above patches, because it just introduces some bug fixes. Remaining part of the patch will disappear, because it is necessary only for internal rude debugging. 5. EVALUATION Several simple tests were written to check basic CFQ_SUBSYS functionality. This section describes two tests and their results. During tests separate hard drive were used. 1GB files were written on the hard drive and readers were working on them. After each test remounting was performed to drop caches. In the first test 8 cgroups were created with two readers in each. Each reader reads consequently a separate file for 60 seconds and reports the number of bytes read. Each cgroup had different priorities assigned to it: from 0 to 7. Here are the results of two runs. reader{i}-{j} means reader {j} from cgroup with priority of {i}. RUN 1 READER BYTES READ CGROUP SUM reader0-0 23314432 reader0-1 17940480 41254912 reader1-0 22265856 reader1-1 22528000 44793856 reader2-0 27508736 reader2-1 28950528 56459264 reader3-0: 29605888 reader3-1: 32096256 61702144 reader4-0: 30654464 reader4-1: 31834112 62488576 reader5-0: 35373056 reader5-1: 38780928 74153984 reader6-0: 28033024 reader6-1: 30654464 58687488 reader7-0: 48865280 reader7-1: 39305216 88170496 RUN 2 READER BYTES READ CGROUP SUM reader0-0 21340160 reader0-1 20561920 41902080 reader1-0 18857984 reader1-1 17547264 36405248 reader2-0 24100864 reader2-1 31703040 55803904 reader3-0 27508736 reader3-1 28557312 56066048 reader4-0 26984448 reader4-1 28557312 55541760 reader5-0 32882688 reader5-1 34979840 67862528 reader6-0 42450944 reader6-1 37994496 80445440 reader7-0 50184192 reader7-1 43892736 94076928 Results demonstrate that the system works, but there are some derivations from ideal behavior. It can be explained by disk geometry, file system layout and other factors. The good news is that if we run the same test on unmodified CFQ (applying priority per processes) we will get the same deviations [06]. Next test demonstrates that CFQ_SUBSYS provides isolation. There is one high priority cgroup with only two readers. The second cgroup is low priority and and has increasing number of processes. Look, that the number of bytes read by each process decrease while the number of processes increases. So, the sum remains approximately the same: reader7-0: 193970176 reader7-1: 191873024 reader0-0: 88449024 reader0-1: 81248256 reader7-1: 162504704 reader7-0: 169328640 reader0-0: 61587456 reader0-1: 57393152 reader0-2: 62373888 reader7-0: 176799744 reader7-1: 158056448 reader0-0: 55681024 reader0-1: 47693824 reader0-2: 39436288 reader0-3: 42188800 reader7-0: 154910720 reader7-1: 158973952 reader0-0: 41140224 reader0-1: 47161344 reader0-2: 39960576 reader0-3: 36814848 reader0-4: 41533440 reader7-0: 160284672 reader7-1: 154648576 reader0-0: 31703040 reader0-1: 33275904 reader0-2: 29868032 reader0-3: 31965184 reader0-4: 31571968 reader0-5: 30916608 reader7-0: 160931840 reader7-1: 152027136 reader0-0: 35373056 reader0-4: 28033024 reader0-6: 24231936 reader0-2: 28033024 reader0-5: 26853376 reader0-3: 24887296 reader0-1: 31703040 reader7-0: 162512896 reader7-1: 164872192 reader0-0: 25018368 reader0-1: 24100864 reader0-2: 17547264 reader0-3: 19775488 reader0-4: 25935872 reader0-5: 25542656 reader0-6: 24756224 reader0-7: 22003712 Also the comparison of bandwidth between unmodified kernel and with CFQ_SUBSYS feature were performed. For small number of devices no overhead was marked. 6. FUTURE WORK A lot of future work is expected on this project. The first step is to write a comprehensive test suite for it. It should be based on fio tool from Jens Axboe, because it is used in mainstream and its results will be a trusted proof of reliability of the system. Basic time slice should be configurable and its default value need to be determined by running tests. Support of priority classes could be added. The important thing is to use binary tree instead of the list to avoid overhead on nodes with big amount of block devices. Also most cgroups are bound to certain block device. So we can cache the pointer to cfq_cgroup_data and do not traverse list each time. Patches that change CFQ code should be modified to be more transparent. We can use macroses to access service tree for example. Write support is a big issue to think about. 7. ACKNOWLEDGEMENTS Ashish Misra helped me a lot by reviewing this text and fixing my English. He gave several invaluable advises that permitted to increase readability of the text. Jens Axboe , the maintainer of Linux block layer and author of CFQ, graciously answered a lot of my questions about CFQ design. It worths to say that in conversations with him several bugs in CFQ were found and some improvements added to it. Pavel Emelianov , one of the major contributors to CGroups development, was kind enough to answer my question about CGroups internals. Also I cannot but thank Erez Zadok , Operating System Class (CSE 506) Professor at Stony Brook University for his great lectures. 8. REFERENCES [01] Paul B. Menage, "Adding Generic Process Containers to the Linux Kernel", Proceedings of the Linux Symposium, June 2007 [02] Linux sources: Documentation/cgroups.txt [03] Pavel Emelianov, Denis Lunev, Kirill Korotaev, "Resource Management: Beancounters", Proceedings of the Linux Symposium, June 2007 [04] OpenVZ project, http://openvz.org/ [05] Robert Love, "Linux Kernel Development", 2nd edition, Novell Press, June 2005 [06] Jens Axboe, "CFQ IO Scheduler", Talk at LINUX.CONF.AU, January 2007, http://www.linux.org.au/conf/2007/talk/123.html (video record) [07] Daniel P. Bovet, Marco Cesati, "Understanding the Linux Kernel", 3d edition, O'REILLY, November 2005 [08] Linux VServer project, http://www.linux-vserver.org/ [09] Class-based Kernel Resource Management (CKRM) project, http://ckrm.sourceforge.net/ [10] Balbir Singh, "Resource Management in Linux", MS Dissertation, March 2007 [11] Linux sources: Documentation/cpusets.txt [12] Linux Vserver Project, http://linux-vserver.org