From Prototype to Product: Testing Compass with Kenyan Youth (2/3)

This is part two of a three–part series on how Tabiya evaluates Compass, our AI–powered skills discovery tool for young jobseekers.

In Part 1 of this blog series, we showed how Compass’s modular architecture and evaluation suite ensure model reliability – that the AI produces accurate, safe outputs. But none of that guarantees user value. In this post, we describe the messy but essential next step: taking Compass out of the lab and into actual users’ hands through a pilot with Ajira Digital in Kenya. In the language of the Agency Fund’s framework for evaluating AI in the social sector, this covers Level 2 (Product) and Level 3 (User) evaluations.

Why Product Evaluation Matters

Model performance alone doesn’t guarantee user value. A model that produces accurate outputs can still fail if users can’t complete the experience, don’t understand what Compass is asking, or abandon the conversation due to poor design. Product evaluation (Level 2) bridges model accuracy and real-world usability – it tests whether the infrastructure, authentication, conversation flow and user interface actually support meaningful interactions at scale.

This distinction matters because product failures look different from model failures. When Compass identifies the wrong skill, that’s a Level 1 (Model) problem – something we addressed in our earlier post. When users drop off because they don’t know how long the conversation will take, that’s a Level 2 (Product) problem. The Ajira pilot surfaced both, and forced us to fix product issues before we could meaningfully assess user outcomes and impact.

Testing with Ajira Digital in Kenya

In April 2025 we partnered with Ajira Digital, Kenya’s government–backed digital jobs initiative, for a large–scale product test. Ajira connects roughly 600,000 young people to digital work opportunities. This gave us access to a large, diverse user base to stress–test two things: first, whether our infrastructure (authentication, session handling, concurrent users) could handle real–world conditions; and second, whether the user experience (onboarding, conversation flow, cognitive load) actually worked for Kenyan youth. This was deliberately a Level 2 (Product) test with embedded Level 3 (User) checks.

What We Found

The Ajira pilot exposed problems that internal testing had missed. In very early deployment phases, more than 90 percent of users dropped off. A stark result, but exactly why we tested the system before moving to impact evaluation. By the final deployment phase, targeted fixes had reduced user dropout to 44 percent.

Our analysis of user data revealed that multiple factors likely contributed to early dropout, including connectivity issues. But the analysis pointed to several product-level issues:

Authentication and session failures: Our login flow and session management couldn’t handle concurrent user spikes. Users timed out or lost progress.

No progress indicator: Users didn’t know how long the conversation would take, contributing to early abandonment.

Unclear onboarding: Users stalled because they didn’t know what Compass was asking or why.

Cognitive overload: Skill-discovery sequences felt repetitive for some users. The more experiences users explored, the more likely they were to abandon the conversation.

These are classic Level 2 problems: the model might be producing accurate outputs, but the product is not supporting users through the experience.

What We Changed

We treated each deployment phase as an experiment, applying fixes and measuring again. Key changes included:

Simplified authentication: Fixed session timeouts and added reconnect logic so users wouldn’t lose progress.

Lowering the entry barrier: Allowed users to start without registration codes, so curiosity wouldn’t be throttled by bureaucracy.

Added Progress Indicators: Clear progress bars and time estimates let users knew how much remained.

Structured onboarding and signposting: Short onboarding messages explaining what Compass would ask and why – reducing anxiety and setting expectations.

Refined prompts: Shortened repetitive questions, broke long sequences into smaller checkpoints, and made the language more conversational.

Defined fallbacks: Where appropriate, we replaced ambiguous LLM fallbacks with rule-based prompts. (Technically: Deterministic fallbacks supplement generative fallbacks with predefined, rule-based prompts that explicitly control state transitions once user intent is established). For example, instead of letting the model decide when to move from experience gathering to skill exploration, we implemented a hard requirement: “Collect at least one complete experience before proceeding.”

Building for iteration. These rapid cycles of testing and fixing pushed us to improve our engineering practices. We refactored hardcoded prompts and rigid CV templates into versioned, testable components – infrastructure that now supports A/B testing and will make our upcoming evaluations more reliable. We also added features users asked for directly: the ability to upload an existing CV, see more top skill matches, and groundwork for future language support.

One example illustrates why product-level decisions can be delicate for AI systems.

Originally, Compass asked about experiences in this order: Employed → Self-employed → Unpaid internship → Volunteering (last). Many Ajira users’ most relevant experiences were informal or voluntary; they frequently dropped off before reaching “volunteering.” The intuitive fix was to move volunteering earlier.

But moving “volunteering” first broke a hidden logic in the model: the system had been designed to treat volunteering as a fallback. If a user reached that step with no prior experience, Compass would gently prompt reflection on subtler forms of work. When volunteering was first, the bot jumped into encouragement mode prematurely, misinterpreting the user’s context and prompting less relevant follow-ups.

Fixing it required reconfiguring how Compass understood experiences: we removed fallback behaviors tied to position, adjusted prompt templates so the bot would not assume “no experience” simply because volunteering appeared early and added state checks to ensure the bot’s tone and suggestions matched what the user had already shared.

The deeper lesson: product-level decisions (Level 2) can inadvertently introduce bias into model behavior (Level 1). By assuming “volunteering last” meant “no experience,” our conversational logic penalized users whose primary work was informal or unpaid – exactly the users Compass was designed to serve. Catching this during product testing, before impact evaluation, was critical.

What We Learned

Product tweaks improved metrics, but qualitative and behavioral signals explained why things moved. Four insights stood out:

Users need structured guidance, not open-ended conversation. A flexible conversational agent that “chats” broadly often confuses users who want a clear purpose. Users preferred structured, guidance that explained what the tool would produce (a skills summary and a CV).

Cognitive load matters. Long, open-ended prompts produced fatigue. Shorter, scaffolded prompts led to more complete skill capture.

Localization is cultural, not just linguistic. How we described unpaid or care work affected whether users saw it as “work.” Framing examples around household management, informal sales, or community leadership made those experiences easier to report.

Users value explanations. When Compass showed how a skill was inferred from a specific user line, users were more likely to accept and keep the skill. Traceability matters for both adoption and perceived fairness.

These insights came from combining platform telemetry (session logs, completion funnels), short in-flow surveys, and targeted follow-up conversations with users at different stages.

What’s Next

The Ajira pilot turned early failure into practical insight. Small design choices – a progress bar, a simplified login, the order of experience prompts – determined whether jobseekers completed Compass and trusted its outputs.

These product and user improvements were prerequisites for a credible impact evaluation. With a stable, usable product and evidence that Compass supports reflection and confidence, we’re now running an RCT with Harambee Youth Employment Accelerator and SAYouth.mobi in South Africa. That study will test whether helping young people articulate their skills leads to better job search behaviors and labor market outcomes.

In our next post, we’ll walk through the RCT design, pre-registered outcomes, and early results.

Join the conversation: Compass is open source on GitHub, with documentation here. We built it to be adaptable for anyone working on skills discovery or career guidance. We’d love to hear from you—whether you have questions about our product, evaluation metrics, or want to collaborate on localization or measurement. Drop us a line at hi@tabiya.org.