Methodology — LLM-DX

Origin

Practitioner observation, codified.

LLM-DX did not start as a framework. It started as a notebook. Across two years of daily AI practice — Claude as the primary substrate, Gemini and Perplexity for triangulation, LM Notebook for synthesis — the same patterns kept surfacing across very different production workflows. The same kinds of sessions degraded. The same kinds of prompts produced slop. The same kinds of practitioners blamed the model when the model had been given nothing to work with.

The author runs cybersecurity and GRC programmes in his day job: domains where structured inputs, traceable evidence, and disciplined process are the difference between a passing audit and an avoidable incident. The same instincts kept catching workflow failures that the AI community was reflexively attributing to model limits. The seven dimensions are the distillation of those caught failures across dozens of real workflows — not a literature review, not a survey instrument, not a productisation of someone else's research. Practitioner observation, codified.

Why these seven

Each dimension isolates a distinct failure mode.

The dimensions are not a hierarchy and they are not interchangeable.

Project Setup — whether the working environment briefs the model before the conversation starts. Detects the practitioner who re-explains the same project five times a week.
Knowledge Quality — whether the supplied context is current, scoped, and structured, or whether it is a dump. Detects garbage-in workflows.
On-Demand Context — whether the practitioner can supply targeted context mid-session, or whether they restart the chat every time something new is needed. Detects context starvation and context flood.
Prompt Quality — whether requests state the goal, the format, and the constraint, or whether they are conversational nudges. Detects the cost of vague briefs.
Session Discipline — whether sessions are bounded, scoped, and reset before degradation, or whether they are run until they break. Detects the context-fill degradation that most practitioners do not realise they are causing.
Efficiency — whether tokens are spent on signal or on rework. Tokens are the receipt for everything upstream; this dimension is the observable consequence of the other six being tight or loose.
Output Discernment — whether the practitioner can tell when the model is wrong, hedging, or confabulating. Detects the gap between "the model said it" and "the answer is correct."

Each dimension is separable. A practitioner can be Optimised in Prompt Quality and Foundational in Session Discipline. The score surfaces those imbalances rather than averaging them away.

The scoring rubric

What 1, 2, 3, 4 mean for any question.

1 — Foundational. The behaviour is absent, accidental, or actively counter-productive. The practitioner does not yet recognise the failure mode.
2 — Developing. The behaviour appears in some sessions but is not systematic. Outcomes vary widely depending on day, project, or mood.
3 — Proficient. The behaviour is consistent in normal conditions. It breaks down under deadline pressure, novelty, or fatigue.
4 — Optimised. The behaviour is systematic and survives stress. The practitioner has explicit habits, templates, or guardrails that make the behaviour the default rather than the choice.

Dimension percentage is (sum of answered question scores / (answered × 4)) × 100. Overall percentage is the mean of dimension percentages, not of question scores — this prevents dimensions with more answered questions from disproportionately swinging the result.

Tier thresholds

Why 40, 65, 85 — and not other cutoffs.

0–39 Foundational. At this score the practitioner is doing the behaviour fewer than two times in five. The model is mostly performing in spite of the workflow rather than because of it.
40–64 Developing. The practitioner has the behaviour in some conditions. The 40 cutoff marks the point where workflow failure stops being the dominant cause of bad output.
65–84 Proficient. The practitioner has the behaviour in most conditions. The 65 cutoff marks the point where the practitioner is reliably extracting value the model can give.
85–100 Optimised. The practitioner has explicit, repeatable structure. The 85 cutoff marks the point where additional gains come from the model getting better, not from the practitioner getting tighter.

The cutoffs are spaced so that movement between tiers corresponds to a visible behavioural change, not a rounding artifact.

Limitations

What this framework does not do.

Self-report bias. The assessment is self-administered. Practitioners who score themselves harshly will look worse than practitioners who score themselves generously, even when the underlying practice is the same. The AI-assessed mode mitigates this but does not eliminate it.
No inter-rater reliability data. The framework has not been put through a formal IRR study. Two practitioners scoring the same workflow may produce different results. Treat the score as a self-diagnostic, not a certification.
Cohort benchmarking requires n≥10. The role-and-experience cohort comparison is suppressed when fewer than ten matching practitioners exist, to avoid identifying any single user and to avoid drawing inferences from noise.
Claude-first, model-agnostic in principle. The framework was built primarily on Claude. The dimensions are intended to hold across models, but the scoring language is calibrated to current-generation LLM behaviour and may need re-anchoring as model capabilities shift.
Workflow scope, not output scope. LLM-DX scores how you work with AI. It does not score the quality of any particular artifact the AI produced. Output quality is downstream; this is upstream measurement.

Related work

How this framework relates to the 4D AI Fluency Framework.

The 4D AI Fluency Framework, developed by Professors Rick Dakan and Joseph Feller in collaboration with Anthropic, defines 24 behaviors across four dimensions of AI fluency. Eleven of these are directly observable in conversation; thirteen happen outside the chat interface. Anthropic's May 2026 AI Fluency Report used this framework to measure fluency behaviors across 9,830 conversations at population scale.

llm-dx is a different instrument built for a different purpose. The 4D Framework describes what fluent AI use looks like. llm-dx diagnoses which of seven specific workflow dimensions is underperforming for an individual practitioner and provides targeted corrections.

The two frameworks overlap in domain — both address the quality of human-AI collaboration — but differ in three ways that matter. Unit of analysis: population (4D) versus individual practitioner (llm-dx). Method: behavioral observation of conversations (4D) versus scored self-assessment with AI-assisted mode (llm-dx). Output: research index and baseline measurement (4D) versus dimension scores, correction prompts, and longitudinal tracking (llm-dx).

llm-dx does not claim validation from the 4D Framework. The directional alignment is acknowledged. The instruments are independent.

Sources: Anthropic (2026). Education Report: AI Fluency. anthropic.com · Dakan, R. & Feller, J. (2025). The 4D AI Fluency Framework, in collaboration with Anthropic.

Start here

The framework is meant to be used, not read.

Take the Assessment — 15 min Quick Check — 5 min

How LLM-DX is built and what it measures.