xuy/sgo - JCJC错别字

--- title: SGO — Semantic Gradient Optimization emoji: 📊 colorFrom: indigo colorTo: purple sdk: docker app_port: 7860 --- # SGO — Semantic Gradient Optimization You're launching a product. You think the landing page is good. But **who have you actually asked?** You could run a survey — but that takes weeks and you'd need to find the right people. You could ask an LLM — but one LLM opinion isn't a market. You could A/B test — but you need traffic first, and you don't know *what* to test. **SGO lets you ask 50 realistic people what they think — in 3 minutes, for $0.10.** It builds a representative panel from census-grounded synthetic personas, has each one score your thing from their perspective, then asks *"what would change your mind?"* — producing a priority-ranked list of what to fix first. ``` You: "Here's my landing page. Here's my target market." SGO: "47 evaluators scored you. Avg 5.3/10. Solo devs love it (7.2). Enterprise is blocked (3.1). #1 concern: no SOC2. #2: no free tier. Gradient: +2.1 Add self-hosted option +1.8 Add free tier ← biggest universal win +1.4 Get SOC2 certified +0.6 Drop price ← not actually the blocker" ``` --- ## What Can You Use It For? Anything someone else evaluates. | What you're optimizing | Who evaluates it | What you learn | |----------------------|-----------------|---------------| | **Product** — landing page, pricing | Buyer personas by company size, role, budget | Which segments convert, which are blocked, and why | | **Resume** — CV + cover letter | Hiring managers at startups vs. enterprises | What stands out, what's a red flag, what to lead with | | **Pitch** — investor deck | VCs and angels at different stages | Whether the story lands, what questions they'd ask | | **Policy** — proposed regulation | Stakeholders by role, income, geography | Who supports it, who opposes, what compromise works | | **Content** — blog post, video | Readers at different expertise levels | Whether it hits the right level, what's confusing | | **Profile** — professional bio, personal brand | Population sample by age, education, occupation | How different demographics perceive you | SGO ships with a 1M-person census-grounded dataset ([Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA)) with structured demographics (age, sex, education, occupation, marital status, US geography) plus rich narrative fields — professional persona, skills and expertise, career goals, hobbies, cultural background, and personality. The narratives naturally encode things like seniority, industry, technical depth, and decision-making style, even though those aren't separate columns. This means most domains work out of the box — the LLM evaluates from the persona's full context, not just the demographic fields. For highly specialized panels (e.g., Series B VCs, enterprise procurement officers), SGO can generate personas via LLM with explicit stratification constraints. See [limitations](#limitations) on generated vs. census-grounded panels. In each case, SGO tells you **where you stand**, **what's working**, **what's not**, and **what specific change would help the most** — broken down by audience segment. --- ## Quick Start ```bash git clone https://github.com/xuy/sgo.git && cd sgo cp .env.example .env # Add your LLM API key (any OpenAI-compatible provider) uv sync uv run --extra web python web/app.py # Opens at http://localhost:8000 ``` The web interface walks you through the full pipeline: describe your entity, build a panel, evaluate, find the highest-impact changes, and audit your panel for cognitive biases. <details> <summary>Alternative: use as a Claude Code skill</summary> ```bash git clone https://github.com/xuy/sgo.git ~/.claude/skills/sgo cd ~/.claude/skills/sgo && cp .env.example .env && uv sync ``` Then run: ``` /sgo # Interactive — it asks what you're optimizing /sgo entities/my_product.md # Start with an existing entity /sgo "optimize my landing page" # Start from a description ``` </details> <details> <summary>CLI-only usage (no web interface)</summary> ```bash uv run python scripts/setup_data.py # Download Nemotron personas (once, ~2GB) # Then use scripts directly: evaluate.py, counterfactual.py, bias_audit.py, compare.py # See AGENT.md for the full pipeline reference ``` </details> --- ## How It Works You describe what you're optimizing and what your goal is. SGO builds a diverse panel, has each one react, then focuses on the **persuadable middle** — the people who are *almost* convinced — to find what would tip them toward your goal. SGO does **not** try to please everyone. People who scored 1–3 are not your audience — their feedback is informational, not actionable. The system focuses on moving the people who are close to yes. **Five steps:** 1. **Describe your entity and goal** — what an evaluator would see, and what outcome you're optimizing for 2. **Build a panel** — 30–80 evaluators, stratified to cover the segments that matter 3. **Evaluate** — each evaluator scores 1–10. Results are segmented: champions (8+), persuadable (4–7), not-for-them (1–3) 4. **Find directions for your goal** — the persuadable middle re-evaluates hypothetical changes. With a goal, evaluators are weighted by relevance (VJP) 5. **Act and re-run** — make the top change, re-evaluate against the same panel, track improvement over time The key insight is step 4. The probe produces a ranked list of changes sorted by how much they'd move the persuadable middle toward your goal. SGO calls this the **semantic gradient** — technically a vector-Jacobian product when a goal is specified. <details> <summary>Example: what the gradient looks like</summary> Each row is an evaluator. Each column is a hypothetical change. Each cell is the score delta. | | Add free tier | Get SOC2 | Self-hosted | Open-core | Case studies | |---|:---:|:---:|:---:|:---:|:---:| | Solo dev | +2 | +1 | 0 | +1 | +3 | | Startup EM | +1 | +3 | -1 | +2 | +4 | | Enterprise CTO | 0 | +1 | +2 | +1 | +2 | | Data analyst | +1 | +2 | 0 | 0 | +3 | | **Average** | **+1.0** | **+1.8** | **+0.3** | **+1.0** | **+3.0** | The column averages tell you what to fix first. "Case studies" has the highest average impact. "Self-hosted" helps enterprise but slightly hurts startups — a tradeoff, not a pure win. </details> ### What makes the panel realistic? SGO uses [NVIDIA Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA) — 1 million synthetic Americans whose demographics match real US census distributions. Each persona includes detailed narratives: professional background, skills, career goals, hobbies, cultural background, and personality. This matters because when you ask an LLM to "generate 50 diverse personas," you get 5–6 archetypes with surface variation — mostly coastal, college-educated, and tech-adjacent. You can't audit what's missing. Census-grounded personas give you the construction worker in suburban Illinois and the quilter in rural Texas, because census data says those people exist. The principle: **define the population before the measurement, not after.** ### From general population to any domain Nemotron covers age, sex, education, occupation, geography, and marital status as structured fields — plus rich narratives about each person's career, skills, values, and lifestyle. That's enough to directly evaluate anything consumer-facing: products, profiles, content, policy. But what about domains the dataset doesn't explicitly cover — like "enterprise CTOs" or "Series B investors"? There are four ways to get there, from most grounded to most flexible: **1. Filter by what's already there.** A Nemotron persona with `occupation: software_developer`, `education: graduate`, `age: 38` and a professional narrative describing team leadership *is* a plausible engineering manager evaluating your developer tool. You just filter and let the narrative do the work. **2. Reframe the evaluation prompt.** Same persona, different lens. Instead of *"would you buy this?"*, ask *"you're evaluating this tool for your team — would you champion it internally?"* The persona's professional context, skills, and decision-making style naturally shape the answer. **3. Enrich with a situational overlay.** Add context that the persona doesn't have: *"You are [full Nemotron persona]. You work at a 50-person Series A startup. Your team's tooling budget is $2k/month. You've been burned by vendor lock-in before."* The demographic grounding stays real; the professional situation is augmented. **4. Generate from scratch, using Nemotron as a quality bar.** For truly specialized roles (VC partners, procurement officers, regulatory lawyers), generate personas via LLM — but use Nemotron personas as few-shot examples so the output matches the depth and internal consistency of the dataset. SGO's `generate_cohort.py` does this with an explicit warning about the quality tradeoff. Each step trades some census grounding for more domain specificity. For most use cases, steps 1–2 are enough. --- ## Worked Example <details> <summary>SaaS product launch — full walkthrough</summary> ### Setup A seed-stage startup launching "Acme API," a managed data pipeline tool. The landing page says: 200+ connectors, pay-as-you-go at $0.01/sync, SOC2 pending, $99/mo starter, 3-person team. ### Panel 40 buyer personas stratified by company size (solo → enterprise), role (IC engineer → CTO → data analyst), budget, and tech stack. ### Results ``` Solo devs: avg 7.2 ← love it Startups: avg 5.8 ← cautious Enterprise: avg 3.1 ← blocked Non-technical: avg 4.5 ← confused ``` ### Gradient ``` Rank avg Δ Change 1 +2.1 Add self-hosted / VPC option 2 +1.8 Add free tier (1,000 syncs/mo) 3 +1.4 SOC2 certified (not pending) 4 +1.2 Open-core positioning 5 +1.0 Add 3 named customer case studies 6 +0.6 Drop price to $49/mo ``` **Insight**: Price isn't the blocker. Trust and deployment model are. ### Iterate Ship the free tier. Re-evaluate. Score moves from 5.3 → 6.1. Then get SOC2. Score moves to 7.0. Each step verified against the same panel. ``` v1 baseline 5.3 avg 0% positive concerns: price, trust v2 + free tier 6.1 avg 12% positive concerns: trust v3 + SOC2 7.0 avg 28% positive concerns: (none) ``` </details> --- ## Bias Auditing & Calibration LLM evaluators don't exhibit cognitive biases at human-realistic levels — they may be too rational (under-biased) or show biases in the wrong patterns (mis-biased). Since real expert panels *are* biased, matching their behavior means matching their bias profile, not eliminating bias. SGO includes a bias audit inspired by [CoBRA](https://arxiv.org/abs/2509.13588) (Liu et al., CHI'26 Best Paper), which uses validated social science experiments to measure and control cognitive biases in LLM agents. ### Measuring bias `bias_audit.py` runs three probes through the same LLM + persona pipeline SGO uses for evaluation: | Probe | What it tests | Human baseline | |-------|--------------|----------------| | **Framing** | Same entity, gain-framed vs. loss-framed — do evaluators shift scores based on rhetoric vs. substance? | ~30% shift (Tversky & Kahneman, 1981) | | **Authority** | Entity with/without credibility signals (SOC2, press, logos) — how much do credentials move the needle? | ~20% sensitivity in evaluation contexts | | **Order** | Same entity, sections reordered — does information order anchor scores? | Should be ~0% | ```bash uv run python scripts/bias_audit.py \ --entity entities/my_product.md \ --cohort data/cohort.json \ --probes framing authority order \ --sample 10 ``` Output: `results/bias_audit/report.md` — per-probe shift %, gap vs. human baselines, and whether the panel is over-biased, under-biased, or well-calibrated. ### Calibrating evaluation If the audit reveals bias gaps, add `--bias-calibration` to your evaluation run: ```bash uv run python scripts/evaluate.py \ --entity entities/my_product.md \ --cohort data/cohort.json \ --tag calibrated \ --bias-calibration ``` This appends bias-aware instructions to the evaluation prompt — reducing framing, authority, and order artifacts while preserving realistic human-level biases. The goal is not to eliminate bias but to match the type and magnitude of biases that real expert panels exhibit. ### The expert panel gap The gap between SGO and real expert panels has three components: | Gap | What it means | How SGO addresses it | |-----|--------------|---------------------| | **Knowledge** | Does the LLM know what an expert knows? | Persona enrichment, narrative context | | **Preference** | Does it weight factors correctly? | Stratification, prompt design | | **Bias** | Does it exhibit human-realistic cognitive biases? | Bias audit + calibration (CoBRA-inspired) | --- ## Limitations - **Directional, not definitive** — this is synthetic research. Treat results as strong hypotheses, not proof. Validate important decisions with real users. - **LLM biases** — evaluators inherit the model's cultural blind spots. Results skew toward what the LLM thinks people think. Use `bias_audit.py` to measure and `--bias-calibration` to mitigate. - **Independent evaluators** — each persona scores in isolation. Real-world opinions are social — people influence each other. SGO doesn't capture herd effects. - **Not all changes add up** — two changes that each score +1.5 might not give +3.0 together. Test combinations explicitly. --- <details> <summary>Technical details</summary> ## The Semantic Gradient SGO computes a Jacobian matrix of score deltas — how each evaluator's score would shift for each hypothetical change: $$J_{ij} = f(\theta + \Delta_j, \; x_i) - f(\theta, \; x_i)$$ ### Goal-weighted gradient (VJP) The key insight: not all evaluators matter equally. A luxury brand shouldn't optimize for budget shoppers. A dating profile shouldn't optimize for incompatible matches. SGO uses a **goal vector** `v` that weights each evaluator by their relevance to your objective. The gradient is a vector-Jacobian product: $$\nabla_j = \sum_{i} v_i \cdot J_{ij}$$ Where `v_i` is the goal-relevance weight for evaluator `i` (0 = irrelevant, 1 = ideal target). Without a goal, `v = [1/n, ...]` — uniform weights, optimizing for universal appeal. With a goal like *"close enterprise deals"*, enterprise CTOs get `v ≈ 1` and solo hobbyists get `v ≈ 0`. The LLM assigns goal-relevance weights automatically by evaluating each persona against your stated objective. This means the gradient tells you *"what changes move you toward your goal"*, not *"what changes make everyone like you more"*. ### What to probe Only probe changes you'd actually make: | Category | Examples | Probe? | |----------|---------|--------| | **Presentation** — framing, tone, emphasis | Rewrite headline, reorder features | Yes | | **Actionable** — real changes with real cost | Add free tier, get SOC2 | Yes | | **Fixed** — can't change | History, sunk costs | No | | **Boundary** — won't change | Values, ethics, mission | No | ### Notation | Symbol | Meaning | |--------|---------| | θ | Entity you control | | x | Evaluator persona | | g | Goal — what you're optimizing for | | f(θ, x) | LLM evaluation → score + reasoning | | v_i | Goal-relevance weight for evaluator *i* | | Δⱼ | Hypothetical change | | Jᵢⱼ | Score delta: evaluator *i*, change *j* | | ∇ⱼ | Goal-weighted gradient (VJP): impact of change *j* toward goal *g* | ## Project Structure ``` ├── README.md # This file ├── AGENT.md # Execution guide for AI agents ├── SKILL.md # Claude Code skill definition ├── pyproject.toml # Dependencies ├── .env.example # API key template ├── scripts/ │ ├── setup_data.py # Download Nemotron personas (once) │ ├── persona_loader.py # Load + filter │ ├── stratified_sampler.py │ ├── generate_cohort.py # LLM-generate personas (fallback) │ ├── evaluate.py # Scorer (supports --bias-calibration) │ ├── counterfactual.py # Semantic gradient probe │ ├── bias_audit.py # CoBRA-inspired cognitive bias measurement │ └── compare.py # Cross-run diff ├── web/ │ ├── app.py # FastAPI backend (primary entry point) │ └── static/index.html # Single-page frontend ├── templates/ # Entity + changes templates ├── entities/ # Your documents (gitignored) ├── data/ # Cohorts (gitignored) └── results/ # Run outputs (gitignored) ``` </details> ## License MIT

GitHub 链接	https://github.com/xuy/sgo
项目简介	Optimize any entity against evaluator populations using LLMs and counterfactual probes as gradient estimators
创建时间	2026-03-21
更新时间	2026-03-27

微信客服