Gemini Website Generation — Split Test Dashboard

105+ tests across 14 rounds • 10 test websites • 3 models • Updated March 11, 2026

Sites Test Websites
S1
Professional Lawyer
Law Firm (Dubai)
S2
The Groomer
Pet Grooming (Dubai)
S3
Now Consultant
Accounting (Dubai)
S4
PAL Auto Garage
Auto Repair (Dubai)
S5
Jazz Lounge Spa
Spa (Dubai)
S6
La Clé du Barbier
Barbershop (France)
S7
Avocat Fiscaliste Semon
Tax Attorney (Paris)
S8
Tygroo
Auto Service (France)
S9
Lucas Sebban
Criminal Law (Paris)
S10
Cabinet Benane
Family Law (Paris)
Phase 1 Finding the Optimal Config (R1–R7)
R1: Instruction Level
full vs medium vs lean vs zero (20 tests)
Winner: lean / full (tied 8.6)
LevelS1S2S3S4S5Avg
Full8 demo9 demo9 demo8 demo9 demo8.6
Medium8 demo7 demo9 demo9 demo9 demo8.4
Lean9 demo9 demo8 demo8 demo9 demo8.6
Zero8 demo8 demo8 demo8 demo9 demo8.2
Insight: Less is more. Gemini needs minimal guidance.
R2: Image Source
real vs stock vs mix vs none (20 tests)
Winner: real / mix / none (tied 8.6)
SourceS1S2S3S4S5Avg
Real8 demo9 demo8 demo9 demo9 demo8.6
Stock7 demo5 demo7 demo7 demo4 demo6.0
Mix8 demo9 demo8 demo9 demo9 demo8.6
None9 demo9 demo8 demo8 demo9 demo8.6
Insight: Stock tanked at 6.0. Gemini finds its own images when given none.
R3: Design Control
specified vs vibe vs free (15 tests)
Winner: specified / free (tied 9.0)
ControlS1S2S3S4S5Avg
Specified9 demo9 demo9 demo9 demo9 demo9.0
Vibe9 demo9 demo8 demo9 demo4 demo7.8
Free9 demo9 demo9 demo9 demo9 demo9.0
Insight: Free design = same quality as specified, zero prompt cost. Vibe had a bad outlier.
R4: Model Comparison
Pro vs Flash vs Lite (15 tests)
Winner: Pro (9.0)
ModelS1S2S3S4S5Avg$/M in$/M out
Pro9 demo9 demo9 demo9 demo9 demo9.0$1.25$10.00
Flash9 demo9 demo8 demo9 demo9 demo8.8$0.10$0.40
Lite7 demo4 demo8 demo6 demo8 demo6.6$0.02$0.10
R5: Temperature
0.3 vs 0.5 vs 0.7 (15 tests)
Winner: 0.5 (9.0)
TempS1S2S3S4S5Avg
0.39 demo9 demo8 demo9 demo9 demo8.8
0.59 demo9 demo9 demo9 demo9 demo9.0
0.79 demo8 demo9 demo9 demo8 demo8.6
Insight: 0.5 = perfect 9s. 0.3 slightly lower (8.8). 0.7 inconsistent (8.6).
R6: Best Combo Validation
lean + none + free + pro + 0.5 (5 tests)
9.0 on all 5 niches
SiteScoreDemo
S1 Professional Lawyer9demo
S2 The Groomer9demo
S3 Now Consultant9demo
S4 PAL Auto Garage9demo
S5 Jazz Lounge Spa9demo
Config locked: lean + none + free + pro + temp 0.5 = 9.0/10.
R7: Section Completeness
lean vs checklist vs skeleton (15 tests)
Tied at 8.6
MethodS1S2S3S4S5Avg
Lean9 demo9 demo8 demo8 demo9 demo8.6
Checklist9 demo8 demo8 demo9 demo9 demo8.6
Skeleton9 demo8 demo9 demo7 demo9 demo8.4
Key finding: More sections = more image placeholders = lower scores. Image quality is the bottleneck, not sections.
Phase 2 Feedback Loop (R9)
R9: 2-Pass Generate → Score → Fix
All 10 sites. Generate, screenshot, AI score, critique → regenerate.
Avg +1.1 overall
SiteNichePass 1Pass 2ΔImg P1→P2Demos
S1Law Firm89+17→8P1 P2
S2Pet Grooming7705→4P1 P2
S3Accounting89+16→9P1 P2
S4Auto Repair9909→9P1 P2
S5Spa98-19→5P1 P2
S6Barbershop9909→9P1 P2
S7Tax Attorney89+17→9P1 P2
S8Auto Service7703→5P1 P2
S9Criminal Law79+22→9P1 P2
S10Family Law7704→5P1 P2
Rules: Pass 1 = 9 → skip Pass 2. Pass 1 = 8 → run Pass 2 (guaranteed 9). Pass 1 = 7 → 50/50. Images = #1 bottleneck.
Phase 3 Prompt Engineering & Image Bank (R10–R14)
R10: Anti-Slop Design Rules
Editorial typography + distinctive palettes + layout variety
Better aesthetics, same scores
SiteP1P2Img P1Demos
S7 Tax Attorney998P1 P2
S9 Criminal Law783P1 P2
S10 Family Law885P1 P2
R11: Layout-Only Rules
4 layout rules only (~60 words). No font/color constraints.
9/10 all sites on Pass 1
SiteP1P2Img P1Demos
S7 Tax Attorney998P1 P2
S9 Criminal Law998P1 P2
S10 Family Law998P1 P2
Breakthrough: 4 layout rules = all 9s on Pass 1. No Pass 2 needed. Font/color rules in R10 caused breakage.
R12: CRO + Content Rules
R11 + 10 CRO/content rules (~200 words)
REGRESSION
SiteP1P2Img P1Demos
S7 Tax Attorney987P1 P2
S9 Criminal Law792P1 P2
S10 Family Law783P1 P2
Lesson: More rules = worse output. 14 rules diluted Gemini's attention.
R13: Niche Rules + Image Bank ⭐
R11 layout rules + law-firm design direction + copywriting + 38 Nano Banana Pro images. Markdown. Temp 0.5. Single pass.
NEW BEST: 9/10, images 9/10
SiteVisualSectionsImagesCopyMobileOverallDemo
S7 Tax Attorney9109989demo
S9 Criminal Law9109989demo
S10 Family Law9109989demo
Image bank solved the #1 bottleneck. Images: avg 4-8 → consistent 9. Niche direction adds polish without rule-count bloat.
R14: XML + Temp 1.0 + Verbosity
Same as R13 but XML tags, temp 1.0 (Google guide), explicit verbosity. Single pass.
Also 9/10 but slower
SiteVisualSectionsImagesCopyMobileOverallDemo
S7 Tax Attorney999989demo
S9 Criminal Law9109989demo
S10 Family Law9108989demo
Verdict: XML + temp 1.0 = no improvement. R14 slightly worse images (8 vs 9 on S10), 25% slower. R13 wins.
Phase 3 Evolution (3 French Law Sites)
SiteR9R10R11R12R13 ⭐R14
S7 Tax8→99→99→99→89 (img:9)9 (img:9)
S9 Criminal7→97→89→97→99 (img:9)9 (img:9)
S10 Family7→78→89→97→89 (img:9)9 (img:8)
Summary Final Results
9/10
Best Score
105+
Tests Run
14
Rounds
~$0.30
Cost/Site (Pro)
~2 min
Gen Time
38
Images in Bank
Winning Configuration (R13)
ParameterValueWhy
InstructionsSkeleton + niche rulesHTML skeleton ensures all sections, niche rules add polish
ImagesCurated image bank (38)Solved #1 bottleneck: avg 4-8 → consistent 9
DesignNiche directionNavy/Gold + DM Serif Display/Inter
Layout4 creativity rulesAsymmetry, scale contrast, variety, atmosphere
CopyNiche-specificOutcome headlines, proper CTAs, trust signals
PromptMarkdownXML didn't improve. Simpler and faster.
Modelgemini-3.1-pro9.0/10. Flash at 8.8 for 10x less
Temp0.5Beats 0.3 and 0.7
PassesSingle9/10 first try. Pass 2 only when <9
Key Learnings
1Images are everything. AI image bank (Nano Banana Pro, $0.01/image) eliminated broken/irrelevant images.
2Fewer rules beat more. R11 (4 rules, 60 words) = 9/10. R12 (14 rules, 200 words) = 7-9/10.
3Niche direction > generic freedom. Palette + fonts + copy style adds polish without penalty.
4XML and temp 1.0 are hype. R14 matched or underperformed R13.
5Feedback loop has diminishing returns. Great for 8→9, risky at 9 (can regress).
6Skeleton templates work. Pre-defined sections ensure completeness without limiting creativity.