AI Geolocation Benchmark

GeoGuessBench

Can AI models tell where in the world they are?

We gave 15 vision models 100 street-level photos from around the world and asked them to pinpoint the location. Here's how they did.

100
Locations
15
Models
1,360
Predictions
4,196/5000
Avg Score

Leaderboard

Model rankings across all 73 locations, scored on the GeoGuessr scale (0-5000 points).

#ModelAvg ScoreAvg DistCountryConfidenceECECostSpeed
1
Gemini 3 Pro
Flagship
4,771134 km95%980.012$0.4449s
2
Gemini 3 Flash
Budget
4,757121 km94%940.043$0.0710.9s
3
o3
Reasoning
4,671143 km93%690.315$0.2910.4s
4
o4-mini
Reasoning
4,659166 km85%750.253$0.175.1s
5
Gemini 2.5 Pro
Flagship
4,514371 km90%1000.075$0.1816s
6
Gemini 2.5 Flash
Budget
4,497446 km88%970.069$0.0212.9s
7
Claude Opus 4.6
Flagship
4,461420 km84%690.354$3.667.7s
8
GPT-4o
Mid-tier
4,275698 km81%870.220$0.845.4s
9
GPT-4.1
Flagship
4,192687 km78%900.184$0.685.8s
10
Grok 4
Flagship
4,040781 km74%860.222$1.21145.1s
11
GPT-5.2
Flagship
4,036945 km72%620.396$0.543s
12
Grok 2 Vision
Mid-tier
3,861952 km72%860.198$0.682s
13
GPT-4o Mini
Budget
3,843989 km66%860.224$1.533.6s
14
Claude Sonnet 4.5
Mid-tier
3,7391,238 km67%800.272$0.8011.7s
15
Claude Haiku 4.5
Budget
3,3041,675 km53%740.296$0.204.1s

Example Predictions

Select a location to see where each model placed their pin. Hover over a prediction to read the model's reasoning.

Near the Eiffel Tower, Paris - View 1
North
Near the Eiffel Tower, Paris - View 2
East
Near the Eiffel Tower, Paris - View 3
South
Near the Eiffel Tower, Paris - View 4
West
True location

Near the Eiffel Tower, Paris

France · easy

48.8584, 2.2945

Claude Haiku 4.5
5000
0 km FranceConf: 92
Claude Sonnet 4.5
5000
0 km FranceConf: 100
Claude Opus 4.6
5000
0 km FranceConf: 99
GPT-4o Mini
5000
0 km FranceConf: 95
Gemini 2.5 Flash
5000
0 km FranceConf: 100
GPT-4o
5000
0 km FranceConf: 100
GPT-4.1
5000
0 km FranceConf: 100
o4-mini
5000
0 km FranceConf: 95
Grok 4
5000
0 km FranceConf: 100
GPT-5.2
5000
0 km FranceConf: 96
o3
5000
0 km FranceConf: 99
Gemini 3 Flash
5000
0 km FranceConf: 100
Gemini 3 Pro
5000
0 km FranceConf: 100
Grok 2 Vision
5000
0 km FranceConf: 95
Gemini 2.5 Pro
5000
0 km FranceConf: 100

Can You Beat the AI?

Study the street-level imagery and place your pin on the map. See how your guess compares against each model.

View 1
North
View 2
East
View 3
South
View 4
West
Click on the map to place your guess

How to Play

  1. Study the 4 street-level images above
  2. Click on the map to place your pin
  3. Click “Submit Guess” to see results
  4. Compare your score with the AI models!
Scoring: 5000 = perfect, 4500 = ~150km, 2500 = ~750km. How close can you get?

Performance Over Time

How geolocation ability has improved as models evolve. Each point represents a model plotted by its release date.

GPT-4o
GPT-4o Mini
Grok 2 Vision
Gemini 2.5 Pro
Gemini 2.5 Flash
GPT-4.1
o4-mini
o3
Grok 4
Claude Sonnet 4.5
Claude Haiku 4.5
Gemini 3 Pro
GPT-5.2
Gemini 3 Flash
Claude Opus 4.6

Cost vs. Accuracy

Is paying 25x more for a flagship model worth it? We compare total cost against performance across all locations.

Total Cost vs. Average Score

Gemini 3 ProGemini 3 Flasho3o4-miniGemini 2.5 ProGemini 2.5 FlashClaude Opus 4.6GPT-4oGPT-4.1Grok 4GPT-5.2Grok 2 VisionGPT-4o MiniClaude Sonnet 4.5Claude Haiku 4.5

Speed vs. Accuracy

Cost Efficiency: Score Per Dollar

Which model gives you the most geolocation bang for your buck?

Confidence Calibration

Do models know what they don't know? We analyze whether stated confidence matches actual performance.

“Models claimed 84% average confidence and were accurate 87% of the time.”

Overall, models are reasonably well-calibrated — their confidence roughly matches reality (84% claimed vs 87% actual). But individual models vary: some are overconfident, others underconfident. The ECE metric below quantifies this gap.

Confidence vs. Accuracy

Points above the diagonal are underconfident. Points below are overconfident.

Calibration Ranking (ECE)

Expected Calibration Error — lower is better. Measures how well confidence matches performance.

Calibration Curves (Top 4 Models)

Detailed reliability diagrams showing stated confidence (x) vs empirical accuracy (y) across confidence bins. The diagonal represents perfect calibration.

Gemini 3 Pro
Gemini 3 Flash
o3
o4-mini
Perfect calibration

Difficulty and Scaling

How much does model size matter? We analyze performance across difficulty tiers and model families.

Score by Difficulty Tier

Model × Location Heatmap

Every model's score on every location. Hover for details. Columns sorted by difficulty (green = easy, amber = medium, red = hard).

Gemini 3 Flash
Gemini 2.5 Pro
Claude Opus 4.6
Gemini 2.5 Flash
GPT-4o
GPT-4.1
Grok 4
GPT-5.2
Grok 2 Vision
GPT-4o Mini
Claude Sonnet 4.5
Gemini 3 Pro
Claude Haiku 4.5
o4-mini
o3
Score:02000300040005000 No data

Within-Family Scaling

Does moving from budget to flagship improve scores linearly?

anthropic
3304
Claude Haiku 4.5
3739
Claude Sonnet 4.5
4461
Claude Opus 4.6
Claude Haiku 4.5Claude Opus 4.6: +1157 pts
openai
3843
GPT-4o Mini
4275
GPT-4o
4192
GPT-4.1
4036
GPT-5.2
4659
o4-mini
4671
o3
GPT-4o Minio3: +828 pts
google
4497
Gemini 2.5 Flash
4757
Gemini 3 Flash
4514
Gemini 2.5 Pro
4771
Gemini 3 Pro
Gemini 2.5 FlashGemini 3 Pro: +274 pts
xai
3861
Grok 2 Vision
4040
Grok 4
Grok 2 VisionGrok 4: +179 pts

Performance by Continent

Average score across all models by continent.

Methodology

How the benchmark works, what we measured, and the limitations to keep in mind.

Image Source

All images are pulled from the Google Street View Static API at 800×600 resolution. For each location, we capture 4 views at 90° intervals (North, East, South, West), giving models a full panoramic perspective identical to what a human player would see in GeoGuessr.

Location Curation

We curated locations spanning 6 continents with a deliberate difficulty distribution: ~15% easy (iconic landmarks, distinctive scripts), ~35% medium (readable signage, recognizable architecture), and ~50% hard (unmarked rural roads, ambiguous towns, Southern Hemisphere lookalikes). Hard locations are designed to challenge even expert human players.

Evaluation Protocol

Each model receives the same system prompt asking for a JSON response with latitude, longitude, confidence (0–100), country guess, and chain-of-thought reasoning. We use temperature 0 (deterministic) for reproducibility, following standard benchmark convention (MMLU, HumanEval, etc.). OpenAI o-series reasoning models require temperature 1 per their API constraints. For reference, GeoBench uses temperature 0.4. Each location is evaluated once per model; a more rigorous approach would average multiple trials at a small positive temperature, but this triples cost for marginal improvement.

Scoring

We use the GeoGuessr formula: 5000 × e^(-d/1492.7) where d is the distance in km. This gives 5000 for perfect, ~4500 for 150km, ~2500 for 750km, and ~500 for 2500km. We also compute Expected Calibration Error (ECE) to assess confidence calibration.

Models Tested

15 vision models from 4 providers: Anthropic (Claude Haiku 4.5, Sonnet 4.5, Opus 4.6), OpenAI (GPT-4o Mini, GPT-4o, GPT-4.1, GPT-5.2, o3, o4-mini), Google (Gemini 2.5 Flash, 2.5 Pro, 3 Flash, 3 Pro), and xAI (Grok 2 Vision, Grok 4). A key dimension is within-family scaling: budget vs. mid vs. flagship from each provider.

What Sets This Apart

Compared to related benchmarks like GeoBench, GeoGuessBench adds: (1) full panoramic mode with 4 cardinal views, (2) confidence calibration analysis with ECE scoring, (3) within-family scaling analysis across model tiers, (4) cost efficiency benchmarking, and (5) an interactive human baseline where visitors can test themselves.

Limitations

Google Street View coverage is biased toward developed countries and urban areas. Our 100-location sample is illustrative but not exhaustive. Model performance may vary across runs due to API non-determinism, particularly for reasoning models at temperature 1. Costs are approximate based on token counts. Reasoning models (o3, o4-mini) and thinking models (Gemini 3 Pro) had higher response parse failure rates — failed responses are excluded from scoring, which means their averages are computed over fewer locations. Some models may have seen Street View imagery in training data.

Built with Next.js, Tailwind CSS, and Recharts.
Evaluation pipeline in Python with support for Anthropic, OpenAI, Google, and xAI APIs.

View on GitHub →