Book Demo
AI, Portfolio Management

From Plausible Risk Logs to Portfolio Assurance. What we learned building AI risk review for SPM teams

SPM teams don't need another way to fill a risk log. They need a better way to decide which risks deserve attention. What we learned building AI risk review.

SPM teams do not need another way to fill a risk log. They need a better way to decide which initiative risks deserve steerco attention.

AI makes that problem more visible. A model can read initiative data and produce a draft risk register in seconds. The output often looks credible: familiar language, familiar categories, plausible consequences.

The problem appears on review. Many entries are risk-shaped rather than useful. They are not anchored to specific evidence, cannot be tested against the initiative record, and do not point to a decision a sponsor can make.

We saw this while building an AI-assisted risk-review capability using real enterprise portfolio data. The first version gave the model initiative data and asked it to act like an expert PMO reviewer. The output was fluent, but too much of it was hollow.

That forced a better question. Not "how do we get AI to suggest risks?" but "what is a risk log actually for?"

At steerco level, a risk log is not a list of things that might go wrong. It is a decision-support tool. A useful risk identifies a material future consequence, grounded in a present signal, where a sponsor or executive can make a judgment about mitigation, funding, scope, timing, or ownership.

That changed the work. We analyzed real risk logs produced by experienced humans and asked what evidence would have helped identify those risks from the underlying initiative data.

The answer was not better wording. It was a review method.

The model did not become useful when it was asked to think like an expert. It became useful when the expert review method was broken down into checks the model could perform.

What makes a risk useful

This statement appears in risk registers everywhere:

"There is a risk that stakeholder misalignment delays delivery."

It fits the textbook definition of a risk: an uncertain future event that carries a material consequence. In many registers it would pass without comment.

An experienced reviewer would not accept it.

There is no evidence behind it. There is no named consequence. There is nothing a steerco can test, fund, escalate, or reject. It could apply to almost any program.

Compare this:

"There is a risk that regulated customer data is crossing legal jurisdictions without the controls the regulator expects. The same cloud provider appears twice in the cost lines, billed through its EU entity and its US entity, on a workload that processes regulated customer data. Under the regulator's outsourcing rules this is a material arrangement, and no transfer assessment or regulator engagement is visible in the record."

That risk names a specific consequence, grounds it in the initiative record, and tests it against a named obligation. The committee can act: commission the transfer assessment, engage the regulator early, or constrain the workload until the controls are evidenced.

Notice what produced it. No single signal is alarming. Two cloud line items are among the most ordinary things in a cost report, and a busy reviewer would scroll past both. The risk only appears when the review connects them, and then tests the combination against the regulator's rules for material outsourcing.

The difference is evidence, read through organizational context.

A useful AI risk review has to distinguish between a signal and a risk. A missing date, an overdue milestone, a supplier line item, or a finance note saying "to be reversed" is not automatically a risk. It becomes a risk when it points to a material future consequence: missed go-live, unstable cost baseline, unsupported production platform, unresolved ownership, or regulatory exposure.

That distinction is obvious to experienced PMO and assurance teams. It is easy for a model to miss.

Weak risk entries pass because they have the shape of real risks. What they lack is the review judgment: whether the concern is evidenced, material, timely, and worth an executive's attention.

Why generic prompting fails

Our first version concentrated on framing the model: who it was and what good output looked like.

In shape, it came down to this:

Early version
Act as a seasoned assurance reviewer. Here is the initiative data. Identify the delivery risks a steering committee should see.

That produced plausible output. It did not produce consistently useful output.

The problem was not access to data. The problem was that the review logic had not been made explicit.

Experienced reviewers do not scan a program and free-associate risks. They apply tests, even if they rarely write those tests down.

They check whether outcomes are overdue against today's date. They look for planned dates that do not fit inside the initiative window. They read finance notes for instability, not just the numbers. They notice lines called "Reversal", "Placeholder", "Contingency", or "Top-up".

They know that a supplier appearing in the data is not automatically a third-party risk. It becomes material when the supplier is processing regulated data, running a production service, supporting an outsourced operation, touching financial reporting, or sitting inside the operational resilience perimeter.

They also know when absence is evidence.

If an initiative is putting a production platform live and the record contains no backup, disaster recovery, non-production environment, monitoring, security tooling, or named support model, the issue is not simply that the data is incomplete. The issue is that the route to production does not evidence the capabilities a sponsor would expect before go-live.

This is the part experienced PMO and assurance teams usually do in their heads. They know when a missing date is poor hygiene and when it means the plan no longer fits inside the approved window. They know when a finance note is harmless and when it points to weak cost control. They know when RAG tells the truth and when it is lagging the evidence.

The model only gets that judgment if the review method is written down.

Writing down the review logic

The useful version of the prompt was much longer, but length was not the breakthrough. The breakthrough was specificity.

Each review area had:

  • a thing to inspect;
  • a threshold for when the signal became material;
  • a rule for how to frame the underlying risk;
  • a legitimate reason to produce no risk.

In simplified form, the working method looked like this:

Working methodology
For each area below, either raise a risk grounded in specific evidence in the data, or waive it with a stated reason. Do not invent a risk to fill a step.
  • Schedule: check outcomes overdue against today's date, and dates that fall outside the delivery window.
  • Financials: read every line-item note and name for instability, and test actuals against forecast and budget.
  • Technology lifecycle: for any migration, the destination must be live before the source loses vendor support.
  • Architecture: treat the absence of backup, recovery, non-production, monitoring, security tooling, or a named support model as a signal where a production platform is being introduced.
  • Regulatory and third-party: only raise a risk where a named arrangement crosses a specific obligation, such as regulated data, outsourcing, operational resilience, consumer impact, financial reporting, or cross-border transfer.
  • Operating model and resourcing: look for ownership after go-live and single points of failure in named roles.

The full working prompt carried more than method. It carried organizational context: who regulates the customer, which frameworks apply, what counts as material outsourcing, where data is allowed to live. The cross-border risk earlier in this article is unfindable without that context. Two cloud line items only become a regulatory exposure when the review knows the regulator and the threshold for materiality. What separated the working version from the first version was that it wrote down both: what an experienced reviewer checks, and what the organization's own reviewers know.

Take the schedule check. That one line expands into two different tests.

The first compares every in-progress outcome's due date against today. Where an outcome is past its date and still open, the delay has already happened. The live risk is no longer "delivery may be delayed". The live risk is whether the program can recover in time.

The second reads the plan for forward anomalies: an outcome with no start date, a delivery date beyond the initiative end date, or a milestone that does not fit inside the approved delivery window. That is a different risk. It is not evidence of an actual slip, but it may show that the plan is no longer credible.

The financial review also has two sides.

The model needs to do the arithmetic: actuals against forecast, unusual spikes, large reversals, and cumulative spend approaching budget. It also needs to read the words.

The note on a line item, or the name of the line item itself, is often where trouble appears first. An experienced reviewer slows down at "placeholder", "to be reversed", "awaiting invoice", "needs investigation", or a line simply called "Reversal". Those phrases may point to unresolved cost instability, weak financial control, scope movement, or dependency on another team's budget.

The regulatory check is different again. Generic AI often raises "third-party risk" because suppliers appear in the data. That is filler. A supplier line item is only material when it points to a specific arrangement: regulated data processing, production service operation, outsourcing scope, cross-border data movement, identity or security infrastructure, financial reporting, or a customer-facing service within the operational resilience perimeter.

Each category needs its own test. Without those tests, the model collapses different assurance questions into generic concern.

How expert interpretation changes the same signal

The same fact can carry more than one meaning. The decision rules determine which reading applies.

Scroll to see full table →
Program signal A literal reading An experienced reading
Outcome has no end date The data is incomplete There may be no credible path to completion before the approved initiative end date
Outcome is already overdue Delivery may be delayed The delay has already happened; the live risk is whether the program can close or recover in time
A supplier appears in the line items Third-party risk exists Material only where the supplier touches regulated data, a production service, outsourcing scope, operational resilience, or financial reporting
No support model for a production platform Support information is missing The platform may enter production with no clear ownership once it goes live
A note reads “to be reversed” The budget figures may be wrong There may be unresolved cost instability or weak financial control
RAG status is stable despite distressed financials The program is stable RAG may not be a useful signal; the reviewer should look harder at the underlying evidence

This is where risk review differs from a data-quality report.

The point is not to tell executives that fields are missing or notes are untidy. The point is to identify the underlying delivery, financial, regulatory, architectural, or operating-model risk suggested by those signals.

Why "no risk" is sometimes the right answer

When building a structured review, the natural move is to require an output for every category. That is a mistake.

A model held to that requirement will manufacture risks to satisfy it. This is worse than no risk. An executive who receives a padded register full of plausible entries with no real grounding learns to discount the whole document.

The revised method gave each review area two legitimate responses: raise a risk grounded in specific evidence, or waive the area because the data positively shows the concern does not apply.

The threshold for a waiver matters.

Uncertainty is not grounds for a waiver. In some review areas, silence in the data is itself the signal. If a production platform is being introduced and no backup, disaster recovery, monitoring, non-production environment, or support model is visible, the absence is material.

Other categories should be waived unless there is a specific signal. A supplier existing in the financials is not enough to raise a regulatory risk. A renewal risk should not be invented unless there is a time-bounded cost, such as "Year 1", "subscription", "AMC", or "renewal", without visible future funding.

That honest off-ramp is essential.

A shorter output carries more weight. A review that returns four well-grounded risks, each with specific evidence and a clear delivery consequence, serves a steerco better than one that returns eight diluted risks.

Where this changes SPM assurance

Some risks are outside the realistic scope of portfolio data.

A model cannot know that a sponsor has lost political capital, that a vendor relationship is deteriorating, that a dependency is fragile because of an informal agreement, or that a delivery team has quietly lost confidence. Those risks still require human context.

The practical use case is narrower and more valuable.

In SPM, the constraint is not the number of risks a team can write down. The constraint is whether the same assurance standard is applied across every initiative in the portfolio.

Once the review logic is explicit, AI can apply that standard across outcomes, RAG history, financials, notes, and existing risks. The output is not an auto-generated risk register. It is a first-pass assurance review for human sign-off.

The sponsor still owns the risk decision. The system changes the preparation: less manual trawling through initiative records, more evidence-backed review before steerco.

That matters at portfolio scale. A human reviewer can read a handful of initiatives deeply. A structured review method can be applied across the full portfolio, consistently, using the same evidence thresholds every time.

The lesson

AI risk review only works when it stops generating risks and starts applying an assurance method.

That means writing down what to inspect, how to interpret the signal, when to raise a risk, when to waive the category, and how to avoid filler.

The model's fluency is not the scarce ingredient. The scarce ingredients are the review logic and the organizational context it tests against.

The next operating state for SPM is not a larger risk register. It is a portfolio assurance layer that reads initiative data continuously, applies a consistent review method, and gives steerco a smaller set of risks worth acting on.

That is where AI-assisted risk review becomes useful: not as a substitute for expert judgment, but as a way to make more of that judgment explicit, repeatable, and available at portfolio scale.

Keep reading

Explore more articles