Measuring What Matters: Compass's Journey to Impact Evaluation in South Africa (3/3)

This is the final post in our three-part series on how Tabiya evaluates Compass, our AI-powered skills discovery tool for young jobseekers. In Part 1 of this series, we assessed how Compass’s modular architecture ensures model reliability. In Part 2, we described what it took to make the product actually work for users. This post tackles the hardest question: how can we know if Compass actually improve employment outcomes for young jobseekers? In the language of the Agency Fund’s framework for evaluating AI in the social sector, this is Level 4 — Impact Evaluation.

The South African Context

For a young woman in Cape Town who has spent years managing her family’s household and caring for younger siblings, those skills rarely appear on a CV and rarely count in a job interview. In 2025, South Africa’s unemployment rate among 15-24 year old stood at 60.2 percent. And for every man staying out of work for care responsibilities, more than eight women do. (You can learn more about the economic power of gender inclusion in South Africa in Harambee’s ‘Breaking Barriers’ reports).

During their time outside the formal labor force, many people gain relevant skills – e.g., through caregiving, budgeting, organizing, and other forms of informal work. However, these skills are rarely recognized as marketable if they want to enter the formal job market again. This is the challenge Compass was built to address.

Compass helps jobseekers discover and articulate their skills through natural conversation. It surfaces their prior experiences (including informal work) and translates them into clearly marketable skills. As users recognize skills that they didn’t know were valuable, they update their beliefs about labor market prospects and feel more confident describing their value to employers, ultimately changing how they search and apply for jobs. This is the “Theory of Change” we want to test.

Why Impact Evaluation Matters

A reliable model and usable product doesn’t guarantee real-world impact. A tool that works smoothly and leaves users feeling confident can still fail to change anything that matters. Impact evaluations address questions that model and product evaluation levels cannot answer: does the intervention actually cause its intended improvements in people’s lives?

This distinction matters because impact failures look different from product failures. When a user abandons Compass because the login is broken, that’s a Level 2 (Product) problem. But if users complete Compass and feel more confident but still don’t find work: that’s a Level 4 problem and the kind of evidence that rigorous impact evaluation enables.

Establishing causation requires more than tracking outcomes. It requires comparing what happened to a treatment group against what would have happened without the intervention. That means random assignment, pre-registered outcomes, and independent data collection. Most importantly, randomization enables us to isolate the intervention’s effect from other factors like economic conditions or personal circumstances. (Read more in the Agency Fund’s “AI-specific considerations”.)

Our partnership with Harambee Youth Employment Accelerator and SAYouth.mobi – South Africa’s largest public employment platform serving 4.5 million users – provides the scale needed for rigorous impact evaluation. We’re taking a phased approach that combines user evaluation with impact measurement. This includes a pilot randomized control trial (RCT) ahead of the full-scale RCT later this year.

Our RCT, developed by our colleagues at the University of Oxford and implemented with support from J-PAL Africa at the University of Cape Town, addresses a fundamental question:

If young people better understand and articulate their skills, will they be more motivated, confident, and successful in the job market?

To our knowledge, this is among the first rigorous tests of AI-enabled skills discovery in a lower-resource context. By combining user evaluation with impact measurement, we’re learning not just whether Compass works, but how and for whom. If the RCT confirms even the intermediate effects seen in the pilot, it would contribute one of the first causal estimates of AI-enabled skills discovery on labor market confidence and search behavior in a low-resource setting — evidence that can directly inform how employment services, policymakers, and funders approach AI-assisted skills recognition.

Impact Evaluation Design

To test these questions rigorously, we’re running an RCT with approximately 4,000 youth that are registered on SAYouth.mobi. In our experiments, these participants are randomly assigned to one of three groups:

Compass chatbot group (1,900): Completes conversational skills discovery through our AI chatbot

Control group (1,900): Completes an equivalent online task where there is no skills elicitation

Hybrid (static-form) group (200): Completes a traditional skills-picking form rather than the AI conversation

Following best practice in empirical research, we have published a Pre-Analysis Plan (PAP) on the AEA’s RCT registry. A PAP is a public document, published before data collection starts, specifying exactly which outcomes we will measure and how we will analyze them. Pre-registrations are increasingly common to commit against p-hacking and publication bias in economics.

What We’re Measuring

We’re measuring not just final employment outcomes, but also intermediate steps along the Theory of Change: whether people search more effectively, update their beliefs, and feel more confident. Measuring each link allows us to understand not just whether Compass works, but where in the chain it has the most effect. We’re also monitoring potential unintended effects, such as whether receiving information about low-demand skills discourages rather than redirects job search.

Platform metrics (during the intervention)

Engagement patterns: User completion rates, time spent, and conversation flows

Skill articulation: Quality and comprehensiveness of skills identified

Platform behavior: Changes in how users engage with SAYouth.mobi, including which jobs they click on and apply to

Intermediate Behavioral Outcomes

Job search behavior: Application frequency, types of jobs targeted, search strategies

User confidence: Self-reported ability to describe their value to employers

Belief updating: Whether participants’ expectations about their job prospects shift after identifying their skills

Skills alignment: Match between identified skills and job requirements

We prioritize these intermediate outcomes because in labor markets where formal jobs are scarce, demonstrating that Compass shifts confidence and search behavior is a meaningful result in itself — even before employment follows.

Final Outcomes

Employment status: Job placement and retention

Earnings: Monthly income from employment

Job offers and callbacks: Interview invitations and job offers received

Secondary Outcomes

Employment quality: Job satisfaction and working conditions

Self-efficacy: Confidence in navigating the job market

Time use: Allocation to paid work, unpaid work, and other activities

Well-being: Mental health and overall life satisfaction

How We’re Measuring

Our primary measures come from phone surveys, conducted together with J-PAL Africa 2-month after the intervention to gather reliable data on employment status, income, and subjective outcomes like confidence and job satisfaction. This approach is essential in informal labor markets where administrative employment data can be patchy. In addition to the survey data, we’re also collecting information to substantiate our finding from several sources:

Compass metadata: Conversation flows, skill extraction patterns

Harambee platform data: Application behavior, job matches from SAYouth.mobi

Qualitative feedback: User experiences and stories, as well as any barriers to AI engagement

To ensure we’re capturing reliable data, Compass participants receive incentives. Participants in both groups receive R40 (approximately $2.58 USD) to complete the online task to offset their internet costs. Those who complete the phone survey receive an extra R60 (approximately $3.70 USD). These amounts are not tied to treatment assignment and help ensure we can follow up with participants regardless of their employment outcomes.

What the pilot revealed

It was fun and also worked as a confidence booster to interact with the tool about qualifications and work experience. It makes one realize how capable they are even though they haven’t found permanent job as yet.

– Participant in the Compass pilot

This kind of feedback, from our pilot launched in September 2025, reflects what the quantitative signals also suggest: Compass users found the experience useful and confidence-building. Our pilot sample was too small to draw firm conclusions on outcomes — we’ll confirm these patterns with the full RCT — but two technical findings from the pilot shaped how we approached the main trial.

The initial system for matching participants’ identified skills to job listings on SAYouth.mobi revealed a very low overlap of skills matched available jobs (less than 2%). This wasn’t because the skills were wrongly identified. It was a consequence of using ESCO’s highly granular skill categories: the taxonomy is precise enough to distinguish between, e.g., “operating a forklift” and “operating warehouse equipment”. While this is useful for skill elicitation, it creates a sparse matching problem when job listings use broader language.
Our integration into SAYouth.mobi’s platform showed clear potential: participants actively engaged with job suggestions generated from their Compass skills profile, and the tool surfaced experiences that users had not previously considered relevant to formal work. 25% of people reported feeling more listened to by Chatbots compared to humans when it comes to career related topics. Many participants left positive feedback, mentioning that Compass “has helped me to create an informative resume” and that “It was a great experience. I learnt more than I had expected”.

Looking Ahead

As we gather data throughout 2026, we’re committed to sharing results openly — including what doesn’t work. This series has tracked Compass from model validation through product testing to a full randomized controlled trial. Whether the RCT confirms impact or surfaces limitations, the evidence will inform how we — and others building AI for development — approach what comes next.

Beyond South Africa, we’re continuing to expand Compass’s reach: We’re deploying Compass in Kenya with the Swahilipot Hub Foundation & NCCK as well as in Argentina with Fundación EMPUJAR. The will each be accompanied by similar Impact Evaluations, which will strengthen our evidence base further: if similar patterns emerge across contexts with different contexts and user populations, we can be more confident that the findings reflect something real about Compass rather than local conditions.

Join the conversation: Full details on the trial design and outcome measures are available in the Pre-Analysis Plan (PAP) on the AEA’s RCT registry. Compass is open source on GitHub, with documentation here. We built it to be adaptable for anyone working on skills discovery or career guidance. We’d love to hear from you—whether you have questions about our product, evaluation metrics, or want to collaborate on localization or measurement. Drop us a line at hi@tabiya.org.

Measuring What Matters: Compass’s Journey to Impact Evaluation in South Africa (3/3)