*A putative new idea for AI control; index here. **The idea here is somewhat underdeveloped.*

I've been suggesting AI designs that are not exactly expected-utility maximisers, while wondering if such designs could be stable under self improvement/subagent creation/general stuff. And also wondering what designs such agents would converge to if they did self-improve.

This post is just a short note warning people to make sure they are consistent when they use non-expected utility maximalisation, and to be clear which parts are Platonic and idealised and which parts are not.

Let's take the reduced impact AI design as an example. In this, the agent has a kind of utility:

v = u - R,

where u is a standard utility and R is a penalty term. The penalty term involved a probability estimate for what a putative "super AI" would estimate, based on some observations it could make. Thus it is better described as R(P), where P is the probability estimator of that "Super AI".

When calculating expected v, the agent will use its own probability estimator P'. Initially, P' is just a less powerful version of P, one that can, moreover, estimate properties about P.

How could this possibly go wrong? I'm glad you asked ^_^ The obvious risk is that the agent acts to replace P with Q, such that R(Q) is as low as possible.

This could come about if P referred to some actual physical algorithm, possibly one written down somewhere the AI could get at it. Then re-writing the algorithm P could increase v by reducing R(P).

It is somewhat more difficult if P and P' are defined in the same place (maybe P is just P' run for a lot longer). Then modifying P could run the risk of degrading the AI's overall accuracy in predictions. However, this may be a price the AI is willing to pay, if the R(P) penalty is severe enough. More likely, the AI could degrade the quality of P but not P' by introducing pathologies that would only affect the first.

Even if P itself is safe from modification, the AI can still hack the system. Recall that it does not calculate P, it estimates it, using P' - so effectively using P'(P). Thus modifications of P' can affect the AI's estimate of R(P). Even if P and P'(P) cannot be changed, the AI can still improve P' for predictions of u, while leaving P'(P) unchanged. Thus the AI may become super-super-intelligent at predicting facts about the world, while remaining (deliberately) only super-intelligent at predicting the penalty R(P). This could allow it to use its great intelligence to hack or trick the R(P) term.

The main requirement to avoid all this is that there is no "first-step divergence". That there never comes a time where the agent modifies itself in such a way that the initial relation between P and P' comes apart.

To avoid this, it might be best to treat both P and P' as different approximations to the same idealised platonic process (Kolmogorov complexity priors calculated by hypercomputation, possibly?). Then the two algorithms are actually strongly related (you can't really change them "independently"), and the AI is trying to minimise the true value of R(P), not the value it records. Self-modifications to P and P' will only work if this allows a *better *estimate of R(P), no necessarily a *lower *estimate.

Being inconsistent about what is platonic (in theory or in practice - some processes are essentially platonic because we can't affect them) and what isn't can lead to problems and unstable motivation systems.