Index | Thread | Search

From:
Martin Pieuchot <mpi@openbsd.org>
Subject:
Re: powersave CPU policy
To:
tech@openbsd.org
Date:
Tue, 18 Jun 2024 17:28:26 +0200

Download raw body.

Thread
On 18/06/24(Tue) 13:05, Kirill A. Korinsky wrote:
> On Tue, 18 Jun 2024 10:11:32 +0100,
> Martin Pieuchot <mpi@openbsd.org> wrote:
> [...]
> 
> > > With two changes:
> > > 
> > > 1. Unused CPU cores are halted and moved to deepest C-state to save some
> > >    power.
> > 
> > The hard question here is how do we defined "unused".  I know that your
> > diff use `CP_IDLE' but I'd like to be sure there isn't a better
> > alternative and in 2024 I believe there is already prior works we can
> > learn from.
> >
> > See the problem with CP_IDLE is that it is another subsystem which is
> > "guessing" what the CPU usage is.  So, yes your joint work with solene@
> > might be an improvement over what is currently in tree, sadly it is
> > based on the same bias.  I believe we can do better.
> 
> And the next question is what to do with proc that is restricted to a
> specific CPU core?

That's an easy one.  If a CPU has a pegged process in a runnable state
it should executed it. 

> Frankly, this question kills the part about halting CPU cores. Handling it
> makes things much more complicated.

Indeed. 

> So I've abandoned this approach as too complicated and dangerous.

I would not abandoned it completely.  It might make sense to be able to
add/remove CPUs to/from the scheduler at some point.

> > This is the second hard question which depends on the machine (laptop vs
> > desktop), its usage, the generation of CPU, the kind of CPU (big vs
> > LITTLE), its architecture...  We cannot have a subsystem "designed" only
> > for modernish x86 laptops with turbo mode CPUs.
> > 
> > Plus we cannot add another sysctl like that.  I understand why you did it.
> > Sadly this is a workaround for the old hw.setperf interface.  I doubt this
> > interface is still relevant in 2024 when it comes to improving power
> > consumption.
> >
> 
> Just to put on the table the issue with the current auto mode. It switched
> to 100% cpu usage when connected to the AC-adapter. A machine I am using
> right now had an interesting bug: at 100% cpu usage it needs more power than
> it can get from AC-adapter and it drains the battery a bit as a result. Yes,
> it doesn't charge on AC adapter with default auto policy. The machine has
> Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz

That's an interesting bug that we should probably fix.  I'm not sure if
we want a MI change.  If 100% performances of the CPU is too much to
charge the battery, then I'd suggest acpicpu(4) or whatever bios driver
related to your machine need to be adapted.

> > I don't understand why it seems safer.  My point is that the IDLE thread
> > is designed like the following loop:
> > 
> > 	while (cpu_has_nothing_enqueued(ci))
> > 		cpu_do_nothing_and_save_energie(ci);
> > 
> > This is bad for what you want to achieve and for what I want to achieve.
> > Instead I'd rather see a loop like:
> > 
> > 	while (cpu_is_not_needed(ci))
> > 		cpu_do_nothing_and_save_energie(ci);
> > 
> > So the new question is when do we need a CPU and at which frequency do
> > we need it?
> > 
> > To me this is something that has to be designed from scratch based on
> > existing state of art.  Building something on top of the hw.setperf
> > mechanism is not going work.
> > 
> 
> I agree, and would like to share a naive implementation of a Linux's
> ondemand like policy. I'm not suggesting to commit this diff, it's more like
> another proof of concept which I'd like to share.

IMHO such algorithm is much better suited for a general purpose OS.  The
threshold of 80% is interesting and the dynamic change make sense to me.

How does this measure compared to the existing one?  What are the
difference for a powersaving and for a CPU-bound use case?

> I used the same spc_cp_time because it's here and allows to implement such a
> PoC with minimal changes.

Do you know what alternative we could use?  What other OSes use?  At
which rate do they sample idle time?

> Interesting that on my machine it works as long as the previous PoC, but
> this time without any impact on system responsiveness.

That's very interesting.  I'd suggest starting another thread asking for
tests.

> The next logical step to this PoC is to change the frequency per CPU, which
> will require some code rework.

This isn't trivial to do because you also want the scheduler to know
which CPU is running at which frequency in order to make decisions.

That said I believe there's a step to do before that which correspond to
the load estimation.  Could every CPU keep track of how much time it
spends executing threads?  This doesn't have to be since boot time.  It
doesn't have to be too precise otherwise it will be too costly.  What we
want is a better granularity than a tick and enough knowledge from the
past to make decisions like which thread goes next, which CPU goes idle
and which current performance level we wan to pick, etc.
 
> diff --git sys/kern/sched_bsd.c sys/kern/sched_bsd.c
> index 54ce3173e4e..88edece0e1f 100644
> --- sys/kern/sched_bsd.c
> +++ sys/kern/sched_bsd.c
> @@ -580,12 +580,11 @@ void
>  setperf_auto(void *v)
>  {
>  	static uint64_t *idleticks, *totalticks;
> -	static int downbeats;
>  	int i, j = 0;
> -	int speedup = 0;
> +	int boost = 0, load = 0, speed = 0;
>  	CPU_INFO_ITERATOR cii;
>  	struct cpu_info *ci;
> -	uint64_t idle, total, allidle = 0, alltotal = 0;
> +	uint64_t idle, total;
>  
>  	if (perfpolicy != PERFPOL_AUTO)
>  		return;
> @@ -593,11 +592,6 @@ setperf_auto(void *v)
>  	if (cpu_setperf == NULL)
>  		return;
>  
> -	if (hw_power) {
> -		speedup = 1;
> -		goto faster;
> -	}
> -
>  	if (!idleticks)
>  		if (!(idleticks = mallocarray(ncpusfound, sizeof(*idleticks),
>  		    M_DEVBUF, M_NOWAIT | M_ZERO)))
> @@ -607,36 +601,48 @@ setperf_auto(void *v)
>  		    M_DEVBUF, M_NOWAIT | M_ZERO))) {
>  			free(idleticks, M_DEVBUF,
>  			    sizeof(*idleticks) * ncpusfound);
> +			idleticks = NULL;
>  			return;
>  		}
> +
>  	CPU_INFO_FOREACH(cii, ci) {
>  		if (!cpu_is_online(ci))
>  			continue;
> +
>  		total = 0;
> -		for (i = 0; i < CPUSTATES; i++) {
> +		for (i = 0; i < CPUSTATES; i++)
>  			total += ci->ci_schedstate.spc_cp_time[i];
> -		}
>  		total -= totalticks[j];
> +		if (total <= 0)
> +			continue;
> +
>  		idle = ci->ci_schedstate.spc_cp_time[CP_IDLE] - idleticks[j];
> -		if (idle < total / 3)
> -			speedup = 1;
> -		alltotal += total;
> -		allidle += idle;
> +
>  		idleticks[j] += idle;
>  		totalticks[j] += total;
> +
> +		load = (100 * (total - idle)) / total;
> +
> +		if (load > 80)
> +			boost = 1;
> +		else if (load > speed)
> +			speed = load;
> +
>  		j++;
>  	}
> -	if (allidle < alltotal / 2)
> -		speedup = 1;
> -	if (speedup && downbeats < 5)
> -		downbeats++;
>  
> -	if (speedup && perflevel != 100) {
> -faster:
> +	if (boost && perflevel != 100) {
>  		perflevel = 100;
>  		cpu_setperf(perflevel);
> -	} else if (!speedup && perflevel != 0 && --downbeats <= 0) {
> -		perflevel = 0;
> +	} else if (!boost && speed > perflevel && perflevel != 100) {
> +		perflevel += 5;
> +		if (perflevel > 100)
> +		    perflevel = 100;
> +		cpu_setperf(perflevel);
> +	} else if (!boost && speed < perflevel && perflevel != 0) {
> +		perflevel -= 5;
> +		if (perflevel < 0)
> +		    perflevel = 0;
>  		cpu_setperf(perflevel);
>  	}
>  
> 
> 
> -- 
> wbr, Kirill
>