Agent Reference Guides — OD Agent System

Agent 01

Survey Instrument Designer

Psychometrically grounded survey instruments for organizational assessment

Purpose

The Survey Instrument Designer generates structured, psychometrically defensible survey instruments for organizational diagnostic engagements. Given a set of constructs the practitioner wants to measure — such as leadership effectiveness, psychological safety, or organizational culture — this agent produces a complete item set, organizes it into a deployable instrument, and outputs a format ready for upload to Qualtrics or equivalent platforms.

This agent eliminates the drafting burden that consumes significant time in the early stages of an OD engagement. It applies construct-level item generation logic, reverse-scoring logic, and response format conventions drawn from validated instrument development practice.

How It Works

Stage 1 — Structure Validation (Deterministic)

The agent validates that all required inputs are present and internally consistent: construct names are unique, demographic filters are recognized, and the delivery platform is supported. This stage runs without calling an LLM.

Stage 2 — Item Generation (LLM)

Using Claude Haiku, the agent generates 4–6 Likert-scale items per construct, with one reverse-scored item per construct where applicable. The prompt includes the engagement context (industry, organization size, anonymity mode), reference instruments, and any practitioner-defined sensitivity flags. Output includes a full item set with construct assignments, a recommended response scale, demographic filter questions, and deployment-ready formatting guidance.

Inputs Required

Field	Type	Description
clientName	Text	Name of the client organization — used for instrument header and branding context
engagementName	Text	Engagement or project name — used for internal tracking and document headers
industryContext	Select	Sector and workforce description, e.g. "Federal Defense — GS-12 to SES"
organizationSize	Number	Approximate headcount — informs item language and sampling guidance
diagnosticDimensions	Multiselect	Constructs to measure, e.g. "Leadership Effectiveness," "Team Cohesion"
referenceInstruments	Checkboxes	Known validated instruments to align with (Denison, Gallup Q12, OCAI)
anonymityMode	Select	"Anonymous" or "Attributed" — affects item sensitivity and disclosure language
demographicFilters	Text	Demographic breakdowns desired, e.g. "pay_grade, directorate"
platform	Text	Survey delivery platform, e.g. "Qualtrics" — affects formatting conventions

Outputs Produced

Output	Format	Description
Survey Instrument	Structured doc	Complete item set organized by construct with response scale and instructions
Item Mapping	Table	Maps each item to its construct, reverse-scoring flag, and sensitivity classification
Deployment Notes	Text	Platform-specific guidance for upload, branching logic, and completion estimates
Demographic Block	Item list	Recommended demographic questions aligned to the specified filters

Theoretical Foundations

Theory / Framework	Source	Application
Classical Test Theory	Spearman (1904); Lord & Novick (1968)	Underpins item reliability logic, internal consistency targets, and reverse-scoring conventions
Construct Validity	Cronbach & Meehl (1955)	Guides item-construct alignment — items must represent the latent construct they purport to measure
Likert Scaling	Likert (1932)	Establishes the 5-point agree/disagree response format and balanced anchor phrasing
Denison Organizational Culture Survey	Denison & Mishra (1995)	Reference framework for culture-domain items covering involvement, consistency, adaptability, and mission
Gallup Q12 Employee Engagement	Buckingham & Coffman (1999)	Reference framework for engagement items — informs construct coverage for team-level dynamics
Survey Design Best Practices	Dillman, Smyth & Christian (2014)	Governs item phrasing rules: single-barreled items, behavioral anchoring, and neutral midpoint use

Practical Notes

Operator NoteGenerated instruments should be reviewed by a qualified practitioner before deployment. The agent applies general best practices but cannot account for organization-specific language sensitivities, union considerations, or prior survey fatigue.

Item counts default to 4–6 per construct. Reduce to 3 for very short instruments (under 15 minutes target completion).
Reverse-scored items improve detection of response bias (straight-lining). The agent includes one per construct by default.
The Denison and Gallup Q12 reference flags prompt the agent to align item framing with those frameworks but do not reproduce copyrighted items verbatim.
Federal/Defense contexts: items are drafted in plain language appropriate to civilian and military mixed audiences.

Agent 02

Interview Protocol Builder

Structured qualitative data collection protocols for organizational inquiry

Purpose

The Interview Protocol Builder develops structured and semi-structured interview protocols for qualitative data collection in OD engagements. Given the assessment dimensions, role hierarchy, and session logistics, this agent generates complete, role-differentiated interview guides — including warm-up questions, main probes, follow-up prompts, and closing sequences — calibrated to the seniority and organizational knowledge of each participant group.

This agent addresses a consistent bottleneck in mixed-methods OD work: protocol development is time-intensive, and off-the-shelf templates rarely reflect the specific constructs under investigation or the power dynamics of the audience.

How It Works

Stage 1 — Protocol Architecture (Deterministic)

The agent builds a protocol structure for each role group specified in the inputs, calculating time allocations, sequencing question types, and mapping each probe to the assessment dimensions it covers.

Stage 2 — Question Generation (LLM)

Using Claude Haiku, the agent generates role-appropriate questions for each session type (executive interview, focus group, listening session, stakeholder interview). Senior leaders receive strategic, forward-looking prompts. Front-line participants receive experience-level, operationally grounded questions. The LLM is constrained to produce questions that are behaviorally anchored, open-ended, and non-leading.

Inputs Required

Field	Type	Description
clientName / engagementName	Text	Engagement context for document headers and LLM framing
organizationContext	Text	Narrative description of the organization — workforce composition, culture notes, sensitivities
assessmentDimensions	Multiselect	Constructs to probe qualitatively — must align with survey dimensions where applicable
roleHierarchy	Text	Roles from highest to lowest — used to calibrate language register and question depth
interviewTypes	Checkboxes	Session formats: executive interview, focus group, stakeholder interview, listening session
sessionCount	Number	Total number of sessions planned — informs sampling guidance in the protocol
targetRoles	Dynamic table	Role name, seniority level, and expected organizational knowledge — one row per role group
constraints	Options	Max protocol length in minutes; recording and transcription guidance flags

Outputs Produced

Output	Format	Description
Interview Guides	Per-role docs	Complete facilitator script for each role group — warm-up, core probes, follow-ups, close
Question Bank	Table	All generated questions mapped to assessment dimension and session type
Sampling Guidance	Text	Recommended participant counts and selection criteria per role group
Consent & Documentation	Text	Consent language, recording protocol, and note-taking guidance
Time Allocation Map	Table	Breakdown of minutes per protocol section for each session type

Theoretical Foundations

Theory / Framework	Source	Application
Semi-Structured Interviewing	Kvale & Brinkmann (2009)	Guides question sequencing: open openers, focused probes, hypothetical follow-ups, and member-checking prompts
Appreciative Inquiry	Cooperrider & Srivastva (1987)	Informs the affirmative question frame — protocols include strength-based probes alongside deficit-oriented ones
Organizational Sense-Making	Weick (1995)	Shapes questions that surface how participants interpret ambiguous or changing situations
Phenomenological Interviewing	Moustakas (1994)	Grounds the lived-experience question structure — asking participants to describe specific events rather than general opinions
Power-Aware Facilitation	Freire (1970); Schein (2009)	Drives role differentiation — questions for executives differ structurally from front-line staff to account for positional dynamics
Grounded Theory Sampling	Glaser & Strauss (1967)	Informs the sampling guidance — theoretical saturation logic drives recommended participant counts per group

Practical Notes

Operator NoteProtocols should be piloted with one participant before full deployment. Probes generated by the LLM are starting points — experienced facilitators should adapt language in the room.

Focus group protocols differ structurally from one-on-one interviews: they include group dynamics management prompts and consensus-testing questions not needed in dyadic sessions.
Executive interview protocols are deliberately shorter in question count — senior leaders often digress into high-value narrative that should not be interrupted.
Federal/Defense contexts: include a note about authorized disclosure and non-attribution in all consent language.
The "listening session" format generates a more conversational, minimal-facilitation protocol suitable for open-ended town hall structures.

Agent 03

Content Curation Engine

Objective-aligned learning content selection for development programs

Purpose

The Content Curation Engine selects and packages learning content from the OD system's internal content library to support leadership development programs. Given a set of learning objectives, participant characteristics, and delivery constraints, the agent scores each available content item against the program requirements and returns a curated, relevance-ranked set of materials organized by module.

This agent addresses the content sourcing problem in program design: practitioners typically spend hours searching for materials, evaluating fit, and organizing them into a coherent sequence.

How It Works

Stage 1 — Library Scoring (Deterministic)

The agent iterates through all items in the content library index and computes a relevance score for each against the specified learning objectives. Scoring considers: objective keyword alignment, content type appropriateness for the delivery modality, participant seniority calibration, and industry context relevance. Items scoring below the minimum relevance threshold are excluded.

Stage 2 — Selection and Curation (LLM)

Using Claude Haiku, the agent selects the highest-scoring items up to the requested module count, writes a brief curatorial rationale for each selected item, and drafts facilitator notes explaining how to introduce each piece of content in the context of the program.

Current LimitationExternal search is architecturally supported but currently operates in degraded mode — the system curates from the internal library only (15 items). External retrieval will be activated in a future release. The degraded: true flag in metadata is expected and not a failure condition.

Inputs Required

Field	Type	Description
programName	Text	Name of the development program — used for document headers and LLM context
learningObjectives	Textarea	What participants should know, do, or value upon completion — one objective per line
industryContext	Select	Sector and audience description — used to weight content relevance for the specific context
participantLevel	Select	Emerging Leader, Mid Manager, Senior Leader, or Executive — calibrates content sophistication
contentTypes	Checkboxes	Acceptable content formats: article, case study, video, framework, exercise, assessment
deliveryMode	Select	In Person, Virtual, or Hybrid — affects content type weighting
moduleCount	Number	Number of program modules — the agent selects content to populate each module
sessionDuration	Number	Total session length in minutes — used to estimate content volume appropriateness
minimumRelevanceScore	Number	0.0–1.0 threshold below which items are excluded. Default: 0.6

Outputs Produced

Output	Format	Description
Curated Content Package	Structured list	Selected items with relevance scores, rationale, and module assignments
Facilitator Notes	Per-item text	How to introduce and debrief each content item in the context of the program objectives
Coverage Report	Table	Which learning objectives are covered by which content items — identifies gaps
Curation Metadata	Summary	Total items evaluated, items selected, average relevance score, coverage completeness

Theoretical Foundations

Theory / Framework	Source	Application
Bloom's Taxonomy (Revised)	Anderson & Krathwohl (2001)	Content is matched to the cognitive level required: remember, understand, apply, analyze, evaluate, create
Adult Learning Theory (Andragogy)	Knowles (1980)	Content selection favors materials that are experience-based, problem-centered, and immediately applicable
70-20-10 Development Model	McCall, Lombardo & Morrison (1988)	Experiential exercises weighted above passive reading; reflection prompts included to activate social learning
Situated Learning Theory	Lave & Wenger (1991)	Case studies and context-specific examples are weighted higher than abstract frameworks when industry context is specified
Cognitive Load Theory	Sweller (1988)	Content volume recommendations are calibrated to session duration to avoid overloading participants
Transfer of Training	Baldwin & Ford (1988)	Content selection includes application exercises that bridge learning context to job context

Practical Notes

The content library currently contains 15 items across core OD constructs. Relevance scores reflect that library — results improve significantly as the library expands.
The moduleCount input is advisory: the agent distributes available high-relevance items across the requested modules but cannot generate content that does not exist in the library.
A minimumRelevanceScore of 0.6 is appropriate for most engagements. Lower it to 0.5 if you are getting fewer items than expected; raise it to 0.75 for programs with very specific audience profiles.
The coverage report is the critical quality check: if any learning objective shows no associated content, either lower the relevance threshold or add relevant items to the library before running the program.

Agent 04

Quantitative Analysis

Psychometric scoring, reliability analysis, and subgroup comparison for survey data

Purpose

The Quantitative Analysis agent ingests raw survey export data from Supabase Storage, computes construct-level scores and reliability statistics, identifies statistically significant subgroup differences, and produces a narrative interpretation of the findings. It transforms a raw CSV file into a structured, analyst-ready quantitative findings package.

This agent removes the analytical bottleneck that occurs between data collection and synthesis. Practitioners no longer need to run manual SPSS or Excel calculations for standard psychometric outputs.

How It Works

Stage 1 — Data Loading and Validation (Deterministic)

The agent retrieves the survey export from the specified Supabase Storage path, parses the CSV, validates row counts and header integrity, and checks that all declared construct item IDs exist in the data. Errors at this stage halt execution with a specific, actionable error message.

Stage 2 — Scoring and Statistics (Deterministic)

For each construct, the agent: (1) applies reverse scoring to flagged items, (2) computes respondent-level mean or sum scores per the specified scoring formula, (3) computes construct-level mean, median, standard deviation, min, and max, (4) estimates Cronbach's alpha as a reliability indicator, and (5) computes benchmark deltas where a benchmark set is specified.

Stage 3 — Narrative Interpretation (LLM)

Using Claude Haiku, the agent produces a 2–3 sentence narrative for each construct summarizing the statistical pattern, notable subgroup differences, and any reliability concerns. These narratives feed directly into the Synthesis Report agent.

Inputs Required

Field	Type	Description
surveyExportStoragePath	Text	Path to the CSV file within the survey-exports Supabase Storage bucket
exportFormat	Select	Currently: csv only. XLSX support is planned.
constructDefinitions	Dynamic table	One row per construct: name, ID, item IDs (CSV column headers), reverse-scored items
demographicFilters	Text	Demographic fields to use for subgroup comparison, e.g. "pay_grade, directorate"
significanceThreshold	Number	p-value threshold for reporting subgroup differences. Default: 0.05
organizationalContext	Select	Federal, Corporate, Nonprofit, Healthcare — calibrates narrative interpretation language
engagementId	UUID	Links this analysis run to the engagement record in the database

Outputs Produced

Output	Format	Description
Construct Scores	Table	Mean, median, SD, min, max, Cronbach's alpha, and benchmark delta per construct
Subgroup Differences	Table	Statistically significant demographic differences with p-value, effect size, and practical significance flag
Response Quality Flags	List	Respondents flagged for straight-lining, all-extreme responding, or low completion
Narrative Summaries	Per-construct text	LLM-generated interpretation for each construct — feeds into Synthesis Report
Analysis Metadata	Summary	Total respondents, valid respondents, data quality flags, and analysis parameters used

Theoretical Foundations

Theory / Framework	Source	Application
Classical Test Theory	Lord & Novick (1968)	Foundation for item scoring, construct mean computation, and reliability estimation via Cronbach's alpha
Cronbach's Coefficient Alpha	Cronbach (1951)	Internal consistency reliability estimate — constructs below 0.70 are flagged as potentially unreliable
Cohen's Effect Size Conventions	Cohen (1988)	Interprets magnitude of subgroup differences: small (d=0.2), medium (d=0.5), large (d=0.8)
Nonparametric Significance Testing	Mann-Whitney (1947); Kruskal-Wallis (1952)	Applied for small-n subgroups where normality assumptions fail — the agent selects tests automatically
Straight-Lining Detection	Meade & Craig (2012)	Response quality flagging algorithm identifies respondents who selected the same response for every item

Practical Notes

Pre-RequisiteThis agent requires data to be pre-uploaded to the survey-exports Supabase Storage bucket before the run is initiated. The system does not accept file uploads directly through the UI.

Cronbach's alpha is estimated using the classical formula. For short constructs (3 items or fewer), alpha is mathematically constrained — interpret with caution.
Subgroup comparisons are only meaningful when group sizes are adequate. Groups with n < 5 are flagged and statistics are suppressed below that threshold.
Outputs feed directly into the Synthesis Report agent. The approvalStatus field must be set to "approved" via human review before synthesis can proceed.

Agent 05

Facilitation Guide Generator

Practitioner-ready facilitation guides for leadership development sessions

Purpose

The Facilitation Guide Generator produces complete, practitioner-ready facilitation guides for leadership development and organizational learning sessions. Given the program objectives, content modules, session logistics, and facilitator experience level, the agent generates a structured guide covering: session timeline, detailed activity instructions, facilitator scripts, debrief questions, contingency plans, and modality-specific notes.

This agent resolves the last-mile problem in program delivery: even when content is designed and approved, practitioners without deep facilitation experience often lack the scaffolding to run high-stakes sessions confidently.

How It Works

Stage 1 — Timeline Construction (Deterministic)

The agent computes a minute-by-minute session timeline from the module time allocations, inserting standard buffer time, breaks, and orientation blocks. The timeline is validated for mathematical completeness before LLM generation begins.

Stage 2 — Guide Generation (LLM — Streaming)

Using Claude Sonnet with a 32,000-token output budget (streaming required), the agent generates: overview with session goals and success criteria, preparation checklist, materials list, full activity detail blocks (purpose, setup, process steps, debrief questions, contingency options), facilitator tips, and appendices.

Stage 3 — Validation and Correction

A deterministic validator checks the generated guide for structural completeness: all timeline entries must have corresponding activity detail blocks, all learning objectives must be covered, and all debrief question sets must meet the minimum count. If validation fails, a correction prompt is issued to the LLM before the guide is finalized.

Stage 4 — File Export

The finalized guide is exported as a KH-branded DOCX and a Markdown file, both uploaded to the guides/ Supabase Storage bucket.

Inputs Required

Field	Type	Description
programName	Text	Name of the development program — appears in guide header and all exported documents
programObjectives	Textarea	Overall program learning objectives — one per line. Drive debrief question generation.
sessionDuration	Number	Total session length in minutes (30–480). Determines timeline architecture.
participantCount	Number	Number of participants — affects activity instructions, room setup notes, and group sizing guidance
deliveryMode	Select	In Person, Virtual, or Hybrid — generates modality-specific facilitation notes for each activity
facilitatorExperience	Select	Novice, Intermediate, or Expert — calibrates script depth and contingency guidance
organizationalContext	Select	Federal, Corporate, Nonprofit, Healthcare — calibrates language register and example selection
contentModules	Dynamic list	One card per module: title, learning objectives, time allocation, activities, summary, takeaways, discussion prompts, application exercise

Outputs Produced

Output	Format	Description
Facilitator Guide (DOCX)	KH-branded file	Complete practitioner guide with all sections — uploaded to guides/ bucket
Facilitator Guide (Markdown)	Text file	Plain-text version for digital sharing or LMS upload — uploaded to guides/ bucket
Session Timeline	Structured data	Minute-by-minute schedule with activity titles and objective mappings
Activity→Objective Map	Structured data	Feeds directly into the Evaluation Package Builder as chained input
Validation Report	Metadata	Records whether structural validation passed and any issues identified and corrected

Theoretical Foundations

Theory / Framework	Source	Application
Kolb's Experiential Learning Cycle	Kolb (1984)	Each activity block follows Concrete Experience → Reflective Observation → Abstract Conceptualization → Active Experimentation sequence
Transformative Learning Theory	Mezirow (1991)	Debrief questions are designed to surface and challenge assumptions — the "disorienting dilemma" is deliberately built into higher-stakes activities
Psychological Safety	Edmondson (1999)	Facilitator scripts include explicit psychological safety framing at session open and after high-disclosure activities
Action Learning	Revans (1982)	Application exercise at end of each module operationalizes Revans' principle: learning requires real problems and reflective questioning
Scaffolded Instruction	Wood, Bruner & Ross (1976)	The facilitatorExperience parameter controls scaffolding depth — novice guides include more prescriptive scripts
Kirkpatrick Level 3 (Behavior)	Kirkpatrick (1959)	Discussion prompts and application exercises are forward-facing — they ask participants to commit to specific behavioral changes

Practical Notes

Performance NoteThis agent produces large outputs (70KB+ for a two-module program). Generation takes 60–120 seconds. Do not navigate away from the run detail panel while the agent is running.

The activityObjectiveMap in the output is the key downstream data structure — it feeds directly into the Evaluation Package Builder.
Validation issues in the correction pass are normal for complex programs. A "validationPassed: false" flag in metadata with subsequent correction means the final output is still structurally complete.
Virtual and hybrid delivery modes generate additional facilitation notes for each activity covering breakout room management, annotation tool use, and engagement techniques for remote participants.
Novice facilitator guides include word-for-word transition scripts. Expert guides replace scripts with intent statements and adaptive branches.

Agent 06

Synthesis Report

Mixed-methods triangulation and integrated OD diagnostic reporting

Purpose

The Synthesis Report agent integrates quantitative survey findings and qualitative interview findings into a unified organizational diagnostic report. It identifies convergent patterns (where both data sources agree), complementary patterns (where each source adds distinct information), and divergent patterns (where the sources are in tension), then generates validated recommendations with urgency, impact, and feasibility ratings.

This agent produces the primary client deliverable: the integrated OD assessment report. It replaces the synthesis step that practitioners typically spend the most time on — manually comparing two data sets, resolving discrepancies, and drafting a coherent narrative that holds both sources of evidence simultaneously.

How It Works

Stage 1 — Prerequisite Validation (Deterministic)

Both the quantitative findings (from Agent 04) and qualitative findings must have approvalStatus: "approved" before synthesis begins. This gate prevents synthesis on unapproved or potentially flawed upstream data.

Stage 2 — Triangulation Mapping (Deterministic)

The agent builds a triangulation map: for each assessed dimension, it classifies the relationship between quantitative and qualitative evidence as convergent, complementary, or divergent. Divergent findings trigger an interpretive note generation step.

Stage 3 — Report Generation (LLM)

Using Claude Sonnet, the agent generates three LLM outputs in sequence: (1) the integrated report narrative, (2) recommendations with ratings, and (3) the executive summary package. Separating these calls ensures each section receives adequate token budget.

Stage 4 — Validation

A claims validator checks that all recommendations are grounded in stated findings and that all divergent findings have been addressed.

Inputs Required

Field	Type	Description
clientName / industry	Text	Client identity and sector — used in report header and LLM context framing
assessmentScope	Select	Scope description for the report methodology section
clientPriorities	Textarea	Strategic priorities the client has communicated — recommendations are ranked partly on alignment with these
reportFormat	Select	"Comprehensive" (full sections) or "Executive" (condensed) — controls report depth
quantitativeFindings	Structured data	Approved output from Agent 04 — contains construct scores and narratives by dimension
qualitativeFindings	Structured data	Approved qualitative data — contains dimension-level narrative summaries from interview analysis
engagementId	UUID	Links the synthesis to the engagement record

Outputs Produced

Output	Format	Description
Triangulation Map	Structured data	Convergent, complementary, and divergent finding classifications per dimension
Findings by Dimension	Report sections	Integrated narrative for each assessed dimension — with caveats where data quality requires them
Cross-Cutting Themes	Report sections	Patterns that appeared consistently across multiple dimensions
Recommendations	Rated table	Actionable recommendations with urgency, impact, and feasibility ratings — plus rationale and implementation considerations
Executive Summary	Doc section	Priority actions and leadership implications — designed for C-suite or SES-level audience
Report Preview (UI)	Web view	Operator preview accessible from the Reports tab
Report DOCX	Download	Fully formatted Word document — download from the Reports tab

Theoretical Foundations

Theory / Framework	Source	Application
Mixed Methods Research Design	Creswell & Plano Clark (2011)	Convergent parallel design: quantitative and qualitative strands collected independently, merged at interpretation
Triangulation	Denzin (1978)	Data triangulation, methodological triangulation, and investigator triangulation applied — agent explicitly codes convergence, complementarity, and divergence
Organizational Diagnosis	Nadler & Tushman (1980)	The Congruence Model informs recommendation framing: findings are interpreted as misalignments between inputs, strategy, work, people, and structure
Force Field Analysis	Lewin (1951)	Recommendations are structured as driving forces to amplify and restraining forces to reduce
Evidence-Based OD	Rousseau (2006)	The claims validator enforces that every recommendation is explicitly grounded in stated findings

Practical Notes

Operator NoteThis agent produces the primary client deliverable. Review the report preview thoroughly before downloading the DOCX — the DOCX is the document that leaves the building.

Both upstream findings must be human-approved before this agent can run. Approval gates exist to ensure a practitioner has reviewed the evidence before synthesis occurs.
Divergent findings are the most analytically valuable outputs — they represent places where the data does not tell a simple story. Preserve the tension and let the client engage with it.
Recommendations are generated in urgency order. These ratings are LLM-generated estimates and should be reviewed against practitioner judgment.
The executive summary is designed to be read in under five minutes by a senior leader who will not read the full report.

Agent 07

Evaluation Package Builder

Kirkpatrick-aligned evaluation instruments for leadership development programs

Purpose

The Evaluation Package Builder generates a complete set of evaluation instruments for leadership development programs. Anchored to the program's learning objectives and facilitated activities, the agent produces pre-session baseline instruments, post-session learning gain instruments, session reaction surveys, and facilitator observation checklists — all aligned to the specific objectives of the program rather than generic course evaluation templates.

This agent solves a persistent gap in program evaluation practice: most organizations deploy generic "smile sheets" that measure satisfaction rather than learning. The agent generates instruments that directly trace back to program objectives, enabling practitioners to demonstrate learning gain and support Kirkpatrick Level 2 and Level 3 evaluation claims.

How It Works

Stage 1 — Objective Mapping (Deterministic)

The agent validates that the facilitation guide input has approvalStatus: "approved" and builds an objective map from the activityObjectiveMap structure — linking each program objective to the activities designed to develop it. Every instrument item is traceable to a specific objective.

Stage 2 — Instrument Generation (LLM)

Using Claude Sonnet, the agent generates four instruments simultaneously: (1) pre-session baseline, (2) post-session outcome instrument, (3) session reaction survey, and (4) facilitator observation checklist.

Stage 3 — Validation

A coverage validator checks that every program objective has at least one item in both the pre and post instruments. A structure validator checks item formatting, response type consistency, and instruction completeness.

Stage 4 — File Export

The evaluation package is exported as a KH-branded DOCX (all four instruments in one document) and an XLSX file (all instruments as separate tabs), both uploaded to the reports/ Supabase Storage bucket.

Inputs Required

Field	Type	Description
facilitationGuide	Structured data	Approved output from Agent 05 — provides program name, objectives, delivery mode, participant count, and activityObjectiveMap
evaluationUse	Checkboxes	Intended evaluation purposes: Learning Gain, Participant Reaction, Application Intent
organizationalContext	Select	Federal, Corporate, Nonprofit, Healthcare — calibrates language register and item framing
engagementId	UUID	Links the evaluation package to the engagement record

Outputs Produced

Output	Format	Description
Pre-Session Instrument	Survey doc	Baseline knowledge and attitude items — administered before the session begins
Post-Session Instrument	Survey doc	Parallel items to pre-session — computes learning gain by differencing responses
Session Reaction Survey	Survey doc	Participant experience items: relevance, facilitator effectiveness, environment, and overall value
Facilitator Observation Checklist	Checklist doc	Behavioral indicators the facilitator or observer monitors during delivery
Evaluation Package DOCX	KH-branded file	All four instruments in a single formatted document — uploaded to reports/ bucket
Evaluation Package XLSX	Spreadsheet	All four instruments as separate tabs — ready for data collection and analysis
Strategy Summary	Text	Narrative describing the evaluation approach, instrument rationale, and scoring guidance

Theoretical Foundations

Theory / Framework	Source	Application
Kirkpatrick's Four Levels	Kirkpatrick (1959); Kirkpatrick & Kirkpatrick (2016)	Package measures Level 1 (Reaction), Level 2 (Learning), and lays groundwork for Level 3 (Behavior)
Pre-Post Quasi-Experimental Design	Campbell & Stanley (1963)	Pre and post instruments are structurally parallel to enable learning gain calculation. Design limitation (no control group) noted in strategy summary.
Transfer of Training	Baldwin & Ford (1988)	Application intent items operationalize the motivation-to-transfer construct: "I intend to use X within Y weeks" format
Objective-Referenced Assessment	Popham (1978)	Every item in the pre/post instruments maps directly to a stated learning objective
Brinkerhoff's Success Case Method	Brinkerhoff (2003)	The facilitator observation checklist surfaces best-case and worst-case behavioral indicators during delivery

Practical Notes

Pre-RequisiteThe Evaluation Package Builder is designed to be run after the Facilitation Guide is approved. Its activityObjectiveMap is the critical linking structure — without accurate activity-to-objective mappings, coverage validation will fail.

Pre and post instruments are intentionally parallel in item content but not identical in wording — this reduces testing effects (participants remembering their prior answers) while preserving construct comparability.
The reaction survey is not the primary evaluation instrument — it is the minimum viable measure. Administer and analyze the pre/post instruments as the primary data source.
The XLSX export is structured for direct data entry after administration. Column headers match item IDs and are formatted for easy import into SPSS or R for pre-post analysis.
Federal/Defense contexts: participant anonymity on evaluation instruments should mirror the anonymity posture of the original survey.