| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ==================== |
| Utilization Clamping |
| ==================== |
| |
| 1. Introduction |
| =============== |
| |
| Utilization clamping, also known as util clamp or uclamp, is a scheduler |
| feature that allows user space to help in managing the performance requirement |
| of tasks. It was introduced in v5.3 release. The CGroup support was merged in |
| v5.4. |
| |
| Uclamp is a hinting mechanism that allows the scheduler to understand the |
| performance requirements and restrictions of the tasks, thus it helps the |
| scheduler to make a better decision. And when schedutil cpufreq governor is |
| used, util clamp will influence the CPU frequency selection as well. |
| |
| Since the scheduler and schedutil are both driven by PELT (util_avg) signals, |
| util clamp acts on that to achieve its goal by clamping the signal to a certain |
| point; hence the name. That is, by clamping utilization we are making the |
| system run at a certain performance point. |
| |
| The right way to view util clamp is as a mechanism to make request or hint on |
| performance constraints. It consists of two tunables: |
| |
| * UCLAMP_MIN, which sets the lower bound. |
| * UCLAMP_MAX, which sets the upper bound. |
| |
| These two bounds will ensure a task will operate within this performance range |
| of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies |
| capping a task. |
| |
| One can tell the system (scheduler) that some tasks require a minimum |
| performance point to operate at to deliver the desired user experience. Or one |
| can tell the system that some tasks should be restricted from consuming too |
| much resources and should not go above a specific performance point. Viewing |
| the uclamp values as performance points rather than utilization is a better |
| abstraction from user space point of view. |
| |
| As an example, a game can use util clamp to form a feedback loop with its |
| perceived Frames Per Second (FPS). It can dynamically increase the minimum |
| performance point required by its display pipeline to ensure no frame is |
| dropped. It can also dynamically 'prime' up these tasks if it knows in the |
| coming few hundred milliseconds a computationally intensive scene is about to |
| happen. |
| |
| On mobile hardware where the capability of the devices varies a lot, this |
| dynamic feedback loop offers a great flexibility to ensure best user experience |
| given the capabilities of any system. |
| |
| Of course a static configuration is possible too. The exact usage will depend |
| on the system, application and the desired outcome. |
| |
| Another example is in Android where tasks are classified as background, |
| foreground, top-app, etc. Util clamp can be used to constrain how much |
| resources background tasks are consuming by capping the performance point they |
| can run at. This constraint helps reserve resources for important tasks, like |
| the ones belonging to the currently active app (top-app group). Beside this |
| helps in limiting how much power they consume. This can be more obvious in |
| heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the |
| background tasks to stay on the little cores which will ensure that: |
| |
| 1. The big cores are free to run top-app tasks immediately. top-app |
| tasks are the tasks the user is currently interacting with, hence |
| the most important tasks in the system. |
| 2. They don't run on a power hungry core and drain battery even if they |
| are CPU intensive tasks. |
| |
| .. note:: |
| **little cores**: |
| CPUs with capacity < 1024 |
| |
| **big cores**: |
| CPUs with capacity = 1024 |
| |
| By making these uclamp performance requests, or rather hints, user space can |
| ensure system resources are used optimally to deliver the best possible user |
| experience. |
| |
| Another use case is to help with **overcoming the ramp up latency inherit in |
| how scheduler utilization signal is calculated**. |
| |
| On the other hand, a busy task for instance that requires to run at maximum |
| performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the |
| scheduler to realize that. This is known to affect workloads like gaming on |
| mobile devices where frames will drop due to slow response time to select the |
| higher frequency required for the tasks to finish their work in time. Setting |
| UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance |
| level when they start running. |
| |
| The overall visible effect goes beyond better perceived user |
| experience/performance and stretches to help achieve a better overall |
| performance/watt if used effectively. |
| |
| User space can form a feedback loop with the thermal subsystem too to ensure |
| the device doesn't heat up to the point where it will throttle. |
| |
| Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints. |
| |
| In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any |
| performance point rather than being tied to MAX frequency all the time. Which |
| can be useful on general purpose systems that run on battery powered devices. |
| |
| Note that by design RT tasks don't have per-task PELT signal and must always |
| run at a constant frequency to combat undeterministic DVFS rampup delays. |
| |
| Note that using schedutil always implies a single delay to modify the frequency |
| when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only |
| helps picking what frequency to request instead of schedutil always requesting |
| MAX for all RT tasks. |
| |
| See :ref:`section 3.4 <uclamp-default-values>` for default values and |
| :ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks |
| default value. |
| |
| 2. Design |
| ========= |
| |
| Util clamp is a property of every task in the system. It sets the boundaries of |
| its utilization signal; acting as a bias mechanism that influences certain |
| decisions within the scheduler. |
| |
| The actual utilization signal of a task is never clamped in reality. If you |
| inspect PELT signals at any point of time you should continue to see them as |
| they are intact. Clamping happens only when needed, e.g: when a task wakes up |
| and the scheduler needs to select a suitable CPU for it to run on. |
| |
| Since the goal of util clamp is to allow requesting a minimum and maximum |
| performance point for a task to run on, it must be able to influence the |
| frequency selection as well as task placement to be most effective. Both of |
| which have implications on the utilization value at CPU runqueue (rq for short) |
| level, which brings us to the main design challenge. |
| |
| When a task wakes up on an rq, the utilization signal of the rq will be |
| affected by the uclamp settings of all the tasks enqueued on it. For example if |
| a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs |
| to respect to this request as well as all other requests from all of the |
| enqueued tasks. |
| |
| To be able to aggregate the util clamp value of all the tasks attached to the |
| rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the |
| scheduler hot path. Hence care must be taken since any slow down will have |
| significant impact on a lot of use cases and could hinder its usability in |
| practice. |
| |
| The way this is handled is by dividing the utilization range into buckets |
| (struct uclamp_bucket) which allows us to reduce the search space from every |
| task on the rq to only a subset of tasks on the top-most bucket. |
| |
| When a task is enqueued, the counter in the matching bucket is incremented, |
| and on dequeue it is decremented. This makes keeping track of the effective |
| uclamp value at rq level a lot easier. |
| |
| As tasks are enqueued and dequeued, we keep track of the current effective |
| uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on |
| how this works. |
| |
| Later at any path that wants to identify the effective uclamp value of the rq, |
| it will simply need to read this effective uclamp value of the rq at that exact |
| moment of time it needs to take a decision. |
| |
| For task placement case, only Energy Aware and Capacity Aware Scheduling |
| (EAS/CAS) make use of uclamp for now, which implies that it is applied on |
| heterogeneous systems only. |
| When a task wakes up, the scheduler will look at the current effective uclamp |
| value of every rq and compare it with the potential new value if the task were |
| to be enqueued there. Favoring the rq that will end up with the most energy |
| efficient combination. |
| |
| Similarly in schedutil, when it needs to make a frequency update it will look |
| at the current effective uclamp value of the rq which is influenced by the set |
| of tasks currently enqueued there and select the appropriate frequency that |
| will satisfy constraints from requests. |
| |
| Other paths like setting overutilization state (which effectively disables EAS) |
| make use of uclamp as well. Such cases are considered necessary housekeeping to |
| allow the 2 main use cases above and will not be covered in detail here as they |
| could change with implementation details. |
| |
| .. _uclamp-buckets: |
| |
| 2.1. Buckets |
| ------------ |
| |
| :: |
| |
| [struct rq] |
| |
| (bottom) (top) |
| |
| 0 1024 |
| | | |
| +-----------+-----------+-----------+---- ----+-----------+ |
| | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | |
| +-----------+-----------+-----------+---- ----+-----------+ |
| : : : |
| +- p0 +- p3 +- p4 |
| : : |
| +- p1 +- p5 |
| : |
| +- p2 |
| |
| |
| .. note:: |
| The diagram above is an illustration rather than a true depiction of the |
| internal data structure. |
| |
| To reduce the search space when trying to decide the effective uclamp value of |
| an rq as tasks are enqueued/dequeued, the whole utilization range is divided |
| into N buckets where N is configured at compile time by setting |
| CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. |
| |
| The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. |
| |
| The range of each bucket is 1024/N. For example, for the default value of |
| 5 there will be 5 buckets, each of which will cover the following range: |
| |
| :: |
| |
| DELTA = round_closest(1024/5) = 204.8 = 205 |
| |
| Bucket 0: [0:204] |
| Bucket 1: [205:409] |
| Bucket 2: [410:614] |
| Bucket 3: [615:819] |
| Bucket 4: [820:1024] |
| |
| When a task p with following tunable parameters |
| |
| :: |
| |
| p->uclamp[UCLAMP_MIN] = 300 |
| p->uclamp[UCLAMP_MAX] = 1024 |
| |
| is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket |
| 4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in |
| this range. |
| |
| The rq then keeps track of its current effective uclamp value for each |
| uclamp_id. |
| |
| When a task p is enqueued, the rq value changes to: |
| |
| :: |
| |
| // update bucket logic goes here |
| rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) |
| // repeat for UCLAMP_MAX |
| |
| Similarly, when p is dequeued the rq value changes to: |
| |
| :: |
| |
| // update bucket logic goes here |
| rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() |
| // repeat for UCLAMP_MAX |
| |
| When all buckets are empty, the rq uclamp values are reset to system defaults. |
| See :ref:`section 3.4 <uclamp-default-values>` for details on default values. |
| |
| |
| 2.2. Max aggregation |
| -------------------- |
| |
| Util clamp is tuned to honour the request for the task that requires the |
| highest performance point. |
| |
| When multiple tasks are attached to the same rq, then util clamp must make sure |
| the task that needs the highest performance point gets it even if there's |
| another task that doesn't need it or is disallowed from reaching this point. |
| |
| For example, if there are multiple tasks attached to an rq with the following |
| values: |
| |
| :: |
| |
| p0->uclamp[UCLAMP_MIN] = 300 |
| p0->uclamp[UCLAMP_MAX] = 900 |
| |
| p1->uclamp[UCLAMP_MIN] = 500 |
| p1->uclamp[UCLAMP_MAX] = 500 |
| |
| then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN |
| and UCLAMP_MAX become: |
| |
| :: |
| |
| rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 |
| rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 |
| |
| As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max |
| aggregation is the cause of one of limitations when using util clamp, in |
| particular for UCLAMP_MAX hint when user space would like to save power. |
| |
| 2.3. Hierarchical aggregation |
| ----------------------------- |
| |
| As stated earlier, util clamp is a property of every task in the system. But |
| the actual applied (effective) value can be influenced by more than just the |
| request made by the task or another actor on its behalf (middleware library). |
| |
| The effective util clamp value of any task is restricted as follows: |
| |
| 1. By the uclamp settings defined by the cgroup CPU controller it is attached |
| to, if any. |
| 2. The restricted value in (1) is then further restricted by the system wide |
| uclamp settings. |
| |
| :ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand |
| further on that. |
| |
| For now suffice to say that if a task makes a request, its actual effective |
| value will have to adhere to some restrictions imposed by cgroup and system |
| wide settings. |
| |
| The system will still accept the request even if effectively will be beyond the |
| constraints, but as soon as the task moves to a different cgroup or a sysadmin |
| modifies the system settings, the request will be satisfied only if it is |
| within new constraints. |
| |
| In other words, this aggregation will not cause an error when a task changes |
| its uclamp values, but rather the system may not be able to satisfy requests |
| based on those factors. |
| |
| 2.4. Range |
| ---------- |
| |
| Uclamp performance request has the range of 0 to 1024 inclusive. |
| |
| For cgroup interface percentage is used (that is 0 to 100 inclusive). |
| Just like other cgroup interfaces, you can use 'max' instead of 100. |
| |
| .. _uclamp-interfaces: |
| |
| 3. Interfaces |
| ============= |
| |
| 3.1. Per task interface |
| ----------------------- |
| |
| sched_setattr() syscall was extended to accept two new fields: |
| |
| * sched_util_min: requests the minimum performance point the system should run |
| at when this task is running. Or lower performance bound. |
| * sched_util_max: requests the maximum performance point the system should run |
| at when this task is running. Or upper performance bound. |
| |
| For example, the following scenario have 40% to 80% utilization constraints: |
| |
| :: |
| |
| attr->sched_util_min = 40% * 1024; |
| attr->sched_util_max = 80% * 1024; |
| |
| When task @p is running, **the scheduler should try its best to ensure it |
| starts at 40% performance level**. If the task runs for a long enough time so |
| that its actual utilization goes above 80%, the utilization, or performance |
| level, will be capped. |
| |
| The special value -1 is used to reset the uclamp settings to the system |
| default. |
| |
| Note that resetting the uclamp value to system default using -1 is not the same |
| as manually setting uclamp value to system default. This distinction is |
| important because as we shall see in system interfaces, the default value for |
| RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the |
| future. |
| |
| 3.2. cgroup interface |
| --------------------- |
| |
| There are two uclamp related values in the CPU cgroup controller: |
| |
| * cpu.uclamp.min |
| * cpu.uclamp.max |
| |
| When a task is attached to a CPU controller, its uclamp values will be impacted |
| as follows: |
| |
| * cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup |
| v2 documentation <cgroupv2-protections-distributor>`. |
| |
| If a task uclamp_min value is lower than cpu.uclamp.min, then the task will |
| inherit the cgroup cpu.uclamp.min value. |
| |
| In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child, |
| parent). |
| |
| * cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2 |
| documentation <cgroupv2-limits-distributor>`. |
| |
| If a task uclamp_max value is higher than cpu.uclamp.max, then the task will |
| inherit the cgroup cpu.uclamp.max value. |
| |
| In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child, |
| parent). |
| |
| For example, given following parameters: |
| |
| :: |
| |
| p0->uclamp[UCLAMP_MIN] = // system default; |
| p0->uclamp[UCLAMP_MAX] = // system default; |
| |
| p1->uclamp[UCLAMP_MIN] = 40% * 1024; |
| p1->uclamp[UCLAMP_MAX] = 50% * 1024; |
| |
| cgroup0->cpu.uclamp.min = 20% * 1024; |
| cgroup0->cpu.uclamp.max = 60% * 1024; |
| |
| cgroup1->cpu.uclamp.min = 60% * 1024; |
| cgroup1->cpu.uclamp.max = 100% * 1024; |
| |
| when p0 and p1 are attached to cgroup0, the values become: |
| |
| :: |
| |
| p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024; |
| p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024; |
| |
| p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact |
| p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact |
| |
| when p0 and p1 are attached to cgroup1, these instead become: |
| |
| :: |
| |
| p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; |
| p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024; |
| |
| p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; |
| p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact |
| |
| Note that cgroup interfaces allows cpu.uclamp.max value to be lower than |
| cpu.uclamp.min. Other interfaces don't allow that. |
| |
| 3.3. System interface |
| --------------------- |
| |
| 3.3.1 sched_util_clamp_min |
| -------------------------- |
| |
| System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024, |
| which means that permitted effective UCLAMP_MIN range for tasks is [0:1024]. |
| By changing it to 512 for example the range reduces to [0:512]. This is useful |
| to restrict how much boosting tasks are allowed to acquire. |
| |
| Requests from tasks to go above this knob value will still succeed, but |
| they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN]. |
| |
| The value must be smaller than or equal to sched_util_clamp_max. |
| |
| 3.3.2 sched_util_clamp_max |
| -------------------------- |
| |
| System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024, |
| which means that permitted effective UCLAMP_MAX range for tasks is [0:1024]. |
| |
| By changing it to 512 for example the effective allowed range reduces to |
| [0:512]. This means is that no task can run above 512, which implies that all |
| rqs are restricted too. IOW, the whole system is capped to half its performance |
| capacity. |
| |
| This is useful to restrict the overall maximum performance point of the system. |
| For example, it can be handy to limit performance when running low on battery |
| or when the system wants to limit access to more energy hungry performance |
| levels when it's in idle state or screen is off. |
| |
| Requests from tasks to go above this knob value will still succeed, but they |
| won't be satisfied until it is more than p->uclamp[UCLAMP_MAX]. |
| |
| The value must be greater than or equal to sched_util_clamp_min. |
| |
| .. _uclamp-default-values: |
| |
| 3.4. Default values |
| ------------------- |
| |
| By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to: |
| |
| :: |
| |
| p_fair->uclamp[UCLAMP_MIN] = 0 |
| p_fair->uclamp[UCLAMP_MAX] = 1024 |
| |
| That is, by default they're boosted to run at the maximum performance point of |
| changed at boot or runtime. No argument was made yet as to why we should |
| provide this, but can be added in the future. |
| |
| For SCHED_FIFO/SCHED_RR tasks: |
| |
| :: |
| |
| p_rt->uclamp[UCLAMP_MIN] = 1024 |
| p_rt->uclamp[UCLAMP_MAX] = 1024 |
| |
| That is by default they're boosted to run at the maximum performance point of |
| the system which retains the historical behavior of the RT tasks. |
| |
| RT tasks default uclamp_min value can be modified at boot or runtime via |
| sysctl. See below section. |
| |
| .. _sched-util-clamp-min-rt-default: |
| |
| 3.4.1 sched_util_clamp_min_rt_default |
| ------------------------------------- |
| |
| Running RT tasks at maximum performance point is expensive on battery powered |
| devices and not necessary. To allow system developer to offer good performance |
| guarantees for these tasks without pushing it all the way to maximum |
| performance point, this sysctl knob allows tuning the best boost value to |
| address the system requirement without burning power running at maximum |
| performance point all the time. |
| |
| Application developer are encouraged to use the per task util clamp interface |
| to ensure they are performance and power aware. Ideally this knob should be set |
| to 0 by system designers and leave the task of managing performance |
| requirements to the apps. |
| |
| 4. How to use util clamp |
| ======================== |
| |
| Util clamp promotes the concept of user space assisted power and performance |
| management. At the scheduler level there is no info required to make the best |
| decision. However, with util clamp user space can hint to the scheduler to make |
| better decision about task placement and frequency selection. |
| |
| Best results are achieved by not making any assumptions about the system the |
| application is running on and to use it in conjunction with a feedback loop to |
| dynamically monitor and adjust. Ultimately this will allow for a better user |
| experience at a better perf/watt. |
| |
| For some systems and use cases, static setup will help to achieve good results. |
| Portability will be a problem in this case. How much work one can do at 100, |
| 200 or 1024 is different for each system. Unless there's a specific target |
| system, static setup should be avoided. |
| |
| There are enough possibilities to create a whole framework based on util clamp |
| or self contained app that makes use of it directly. |
| |
| 4.1. Boost important and DVFS-latency-sensitive tasks |
| ----------------------------------------------------- |
| |
| A GUI task might not be busy to warrant driving the frequency high when it |
| wakes up. However, it requires to finish its work within a specific time window |
| to deliver the desired user experience. The right frequency it requires at |
| wakeup will be system dependent. On some underpowered systems it will be high, |
| on other overpowered ones it will be low or 0. |
| |
| This task can increase its UCLAMP_MIN value every time it misses the deadline |
| to ensure on next wake up it runs at a higher performance point. It should try |
| to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any |
| particular system to achieve the best possible perf/watt for that system. |
| |
| On heterogeneous systems, it might be important for this task to run on |
| a faster CPU. |
| |
| **Generally it is advised to perceive the input as performance level or point |
| which will imply both task placement and frequency selection**. |
| |
| 4.2. Cap background tasks |
| ------------------------- |
| |
| Like explained for Android case in the introduction. Any app can lower |
| UCLAMP_MAX for some background tasks that don't care about performance but |
| could end up being busy and consume unnecessary system resources on the system. |
| |
| 4.3. Powersave mode |
| ------------------- |
| |
| sched_util_clamp_max system wide interface can be used to limit all tasks from |
| operating at the higher performance points which are usually energy |
| inefficient. |
| |
| This is not unique to uclamp as one can achieve the same by reducing max |
| frequency of the cpufreq governor. It can be considered a more convenient |
| alternative interface. |
| |
| 4.4. Per-app performance restriction |
| ------------------------------------ |
| |
| Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an |
| app every time it is executed to guarantee a minimum performance point and/or |
| limit it from draining system power at the cost of reduced performance for |
| these apps. |
| |
| If you want to prevent your laptop from heating up while on the go from |
| compiling the kernel and happy to sacrifice performance to save power, but |
| still would like to keep your browser performance intact, uclamp makes it |
| possible. |
| |
| 5. Limitations |
| ============== |
| |
| .. _uclamp-capping-fail: |
| |
| 5.1. Capping frequency with uclamp_max fails under certain conditions |
| --------------------------------------------------------------------- |
| |
| If task p0 is capped to run at 512: |
| |
| :: |
| |
| p0->uclamp[UCLAMP_MAX] = 512 |
| |
| and it shares the rq with p1 which is free to run at any performance point: |
| |
| :: |
| |
| p1->uclamp[UCLAMP_MAX] = 1024 |
| |
| then due to max aggregation the rq will be allowed to reach max performance |
| point: |
| |
| :: |
| |
| rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 |
| |
| Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for |
| the rq will depend on the actual utilization value of the tasks. |
| |
| If p1 is a small task but p0 is a CPU intensive task, then due to the fact that |
| both are running at the same rq, p1 will cause the frequency capping to be left |
| from the rq although p1, which is allowed to run at any performance point, |
| doesn't actually need to run at that frequency. |
| |
| 5.2. UCLAMP_MAX can break PELT (util_avg) signal |
| ------------------------------------------------ |
| |
| PELT assumes that frequency will always increase as the signals grow to ensure |
| there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency |
| increase will be prevented which can lead to no idle time in some |
| circumstances. When there's no idle time, a task will stuck in a busy loop, |
| which would result in util_avg being 1024. |
| |
| Combing with issue described below, this can lead to unwanted frequency spikes |
| when severely capped tasks share the rq with a small non capped task. |
| |
| As an example if task p, which have: |
| |
| :: |
| |
| p0->util_avg = 300 |
| p0->uclamp[UCLAMP_MAX] = 0 |
| |
| wakes up on an idle CPU, then it will run at min frequency (Fmin) this |
| CPU is capable of. The max CPU frequency (Fmax) matters here as well, |
| since it designates the shortest computational time to finish the task's |
| work on this CPU. |
| |
| :: |
| |
| rq->uclamp[UCLAMP_MAX] = 0 |
| |
| If the ratio of Fmax/Fmin is 3, then maximum value will be: |
| |
| :: |
| |
| 300 * (Fmax/Fmin) = 900 |
| |
| which indicates the CPU will still see idle time since 900 is < 1024. The |
| _actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As |
| long as there's idle time, p->util_avg updates will be off by a some margin, |
| but not proportional to Fmax/Fmin. |
| |
| :: |
| |
| p0->util_avg = 300 + small_error |
| |
| Now if the ratio of Fmax/Fmin is 4, the maximum value becomes: |
| |
| :: |
| |
| 300 * (Fmax/Fmin) = 1200 |
| |
| which is higher than 1024 and indicates that the CPU has no idle time. When |
| this happens, then the _actual_ util_avg will become: |
| |
| :: |
| |
| p0->util_avg = 1024 |
| |
| If task p1 wakes up on this CPU, which have: |
| |
| :: |
| |
| p1->util_avg = 200 |
| p1->uclamp[UCLAMP_MAX] = 1024 |
| |
| then the effective UCLAMP_MAX for the CPU will be 1024 according to max |
| aggregation rule. But since the capped p0 task was running and throttled |
| severely, then the rq->util_avg will be: |
| |
| :: |
| |
| p0->util_avg = 1024 |
| p1->util_avg = 200 |
| |
| rq->util_avg = 1024 |
| rq->uclamp[UCLAMP_MAX] = 1024 |
| |
| Hence lead to a frequency spike since if p0 wasn't throttled we should get: |
| |
| :: |
| |
| p0->util_avg = 300 |
| p1->util_avg = 200 |
| |
| rq->util_avg = 500 |
| |
| and run somewhere near mid performance point of that CPU, not the Fmax we get. |
| |
| 5.3. Schedutil response time issues |
| ----------------------------------- |
| |
| schedutil has three limitations: |
| |
| 1. Hardware takes non-zero time to respond to any frequency change |
| request. On some platforms can be in the order of few ms. |
| 2. Non fast-switch systems require a worker deadline thread to wake up |
| and perform the frequency change, which adds measurable overhead. |
| 3. schedutil rate_limit_us drops any requests during this rate_limit_us |
| window. |
| |
| If a relatively small task is doing critical job and requires a certain |
| performance point when it wakes up and starts running, then all these |
| limitations will prevent it from getting what it wants in the time scale it |
| expects. |
| |
| This limitation is not only impactful when using uclamp, but will be more |
| prevalent as we no longer gradually ramp up or down. We could easily be |
| jumping between frequencies depending on the order tasks wake up, and their |
| respective uclamp values. |
| |
| We regard that as a limitation of the capabilities of the underlying system |
| itself. |
| |
| There is room to improve the behavior of schedutil rate_limit_us, but not much |
| to be done for 1 or 2. They are considered hard limitations of the system. |