Our sole focus is to build a safe AI therapist clinically as effective as the 99th percentile human therapist and to make mental health support accessible to anyone. After 12 months of exploring all sorts of architectures, form factors, and approaches, we have recently had a major breakthrough. As of a few weeks ago, we now have a version with which we are starting to see very significant clinical outcomes in the form of GAD-7 (anxiety score) reductions even just after 2 weeks of using Sonia. We are receiving messages daily of how it’s changing people’s lives. We’re just at the start and there is still so much to do and build. Now it’s time to move fast and scale the team.

Skills: Prompt Engineering, Python

About Sonia

We are currently a 5 person team based in San Francisco and have raised $3.5m from top investors including Y Combinator, Moonfire, and the founders of Verkada, Reddit, and Instacart. We have a broad clinical advisory board, including Oren Frank, the founder of Talkspace.

The Role

To explain what you will do, it might help to first give a bit of background.

Sonia is a voice-based AI therapist that lives inside of a mobile app. Our system is fully built on top of OpenAI (”wrapper”). As simple as that sounds, you need to do more than tell GPT-5 to response empathically if you want to build an AI therapist as good as a top human. We have >50 prompts that run simultaneously throughout a conversation that need to interact with each other in realtime. Here are some examples:

We have prompts that run in parallel to the session and do various semantic checks (e.g. should a certain intervention be applied, is the client engaged, is there something the AI therapist has missed, is there an open topic that should be circled back to, etc etc etc). Some are latency critical, some can take more time to reason. Some need a lot of context from previous sessions, some only need the ongoing transcript. Some will generate an output that will be injected as “advice” into the response prompt, some will change the value of a parameter in the backend that gets used to sample other prompt inputs probabilistically
When the client sends a message, we run several safety checks in parallel to generating a response, such as a risk detection prompt that looks out for 5 categories of risk (suicidal ideation, medical emergencies, etc). Upon trigger, the original response call gets aborted and instead the response model receives an injection and special instructions on how to address this potential crisis and what information it needs to collect in order to know whether the conversation can be continued or should be escalated
Whenever we receive a transcription from the client, we need to evaluate semantically whether they are done speaking or might continue (in order to know when the AI should generate a response). This is very domain specific and in therapy specifically, you shouldn’t always directly respond when there is a pause. To take it a step further, there might be client specific patterns that can be learned (E.g. It might be that someone always ends an utterance with “So”. A generic semantic classifier would determine this as “not finished” and thus wait with responding. But if you have a client specific memory on linguistic patterns, this information can be taken into account and thus the system can learn that for this client specifically, transcribing “So” doesn’t necessarily mean that something else will follow and thus latency can be reduced by hitting the LLM immediately. We really obsess about the details. Our users are people who are often really struggling and every bit we can do to make the experience more smooth is worth it and can have an impact on their wellbeing.

Clearly we won’t publicly share the exact details of what we do, especially when it come to the semantic content of the responses itself (which is the main focus), but I hope it paints a picture of the types of workflows. It all rests on a few pretty simple observations:

A great therapist thinks of 20 different things at the same time, some impacting their very next response, some the direction they want to take the conversation over the next few turns, some their entire conceptualization of the client and approach to working with them
GPT-5 level intelligence is more than enough to do each of them pretty well in isolation if you define the task well
No foundation over the next 5 years will be able to do all of them implicitly at the same time. Predominantly because no dataset in the world contains the thinking process of a therapist. (The way we are building the architecture we are generating one)

Artificial Simplified Example: In a transcript with 200 client messages, if message 201 contradicts message 36, the LLM will not pick up on that, as it will be focused on completely different things - predominantly the content of the most recent few messages and instructions it is given. However, if you run a separate model that gets all 201 messages and simply checks if the most recent one contradicts any of the previous ones, the LLM will do very well and detect message 36 (because it gets reduced to a simple needle in the haystack problem). And if you then give the main response model the info that the last client message contradicted message 36, it is easy to get the response model touch on that or otherwise take it into account therapeutically and respond accordingly.

To summarize, what we do is deconstruct what is happening in the mind of an exceptional human therapist and build a highly complex realtime system that orchestrates all of these individual processes. Your role is to help with exactly that :).

While I hope the above tangent was helpful, in more concrete terms what you would be doing is:

Analyze data, read anonymized transcripts and talk to clinicians in order to understand where Sonia needs to improve semantically
Figure out why the AI is doing what it currently is doing and where certain unintended behavior comes from (a misleading instruction that can be fixed quickly? missing context that needs an entirely new memory component?)
Build datasets of relevant LLM traces that show this behavior and can be used to test once we have made changes (we have built our own internal eval tooling for that)
Get creative - find ways to fix the problem and improve the system by modifying prompts or coming up with a fully new component that should be integrated
Iterate 10+ times without giving up until something finally works and be very detail oriented / perfectionist. Balance being incredibly analytical while trusting your intuition
Test thoroughly to make sure there are no negative side effects before deploying
Monitor production and repeat the entire process

Of course, at a startup of our size and the speed at which we move, every week will be different and there will be other responsibilities and tasks where you will be involved. For example, we have several LLM based features outside of the voice AI sessions themselves (personalized meditations, reflection questions, session reviews etc) and are also looking for support iterating on some of the internal tooling we have in order to better evaluate and measure Sonia’s performance.

Culture

We care about three things - kindness, hard work and intelligence. We only want people who genuinely are interested (ideally obsessed) about making the world a happier place and are ready to work (very) hard to make people smile at scale. For the most part, we work 6-7 days a week in person in SF and are at the office 8-8 on weekdays. It will be intense, but we have a lot of fun and are basically all best friends. And trust me — being on a user interview where someone is crying telling you how your software changed their life and is the reason they have a job again — is the most rewarding thing ever and makes it all worth it. Additionally, you can count on me personally doing everything I possibly can to make you happy, successful and fulfilled at Sonia.

What We Are Looking For

Background in computer science and strong engineering skills (especially python)
Previous experience building AI products end to end and writing code that was used by other people (new grad ok - can be side projects or internships)
Highly creative and strong systems thinker
Exceptional writing skills - this is a VERY important (please send me something you wrote)
Previous experience in psychology is strongly preferred - if not, you need to have a genuine interest to learn quickly and be the type of person who will read a clinical psychology textbook on a Sunday afternoon

A Few Client Stories from the past Weeks

A person from Alabama who had such strong OCD and anxiety that they always needed to turn around with the car 3-4 times to check whether they actually turned off the stove. Sonia helped them get it down to one.
A women from the South who lives in a rural area with no therapist in proximity from her Blue Shield insurance. She has been unemployed for the past 15 years and told us that due to her chats with Sonia she gained the courage to apply for jobs again. A few days later, just before her user interview, she got accepted to her job and is now employed again.
A lady who had spammed my personal phone number about how good Sonia is and how she is trying to get her fiancé to use the app because she thinks it will save their marriage. Turns out he doesn’t have an iPhone so we bought him one to try it out.
The parent of 2 criminal sons, has built out a mantra with Sonia and is now able to go through her days with more acceptance and peace. Something that according to her, her weekly Monday morning therapist that she has been seeing for a long time hasn’t managed to achieve.

15min chat with me :)
2-4h remote assessment
30min chat with all 3 co-founders
2 day onsite trial in SF (compensated)

About Sonia

The Role

Culture

What We Are Looking For

A Few Client Stories from the past Weeks

Other jobs at Sonia

Hundreds of YC startups are hiring on Work at a Startup.