I’ve been experimenting with GPT-4V, Claude, and Gemini and realized something strange:

They can describe art. Solve riddles. Explain GPTs.
But ask: “How many pencils are on the table?”
Or “Which object is left of the cup?”
And they fall apart.

So I built a benchmark to test that specifically:

What is VisQuant?

  • 100 synthetic images
  • 40+ everyday object types
  • Labeled object counts and spatial layout
  • 2 reasoning Q&A pairs per image
  • Grounded annotations in JSON and CSV
  • Baseline tested on GPT-4V
  • Entirely open-source

What It Tests
VisQuant isolates the visual intelligence primitives models often skip over:

  • Counting
  • Spatial relationships
  • Left/right/stacked inference
  • Multi-hop VQA from structured scenes

Why?
Because current benchmarks like VQAv2 or GQA are messy, noisy, and hide these weaknesses.
VisQuant is small, clean, focused — and it exposes real gaps in model reasoning.

Get It:
🗃️ Dataset (HuggingFace): [https://huggingface.co/datasets/Anas-Mohiuddin-Syed/VisQuant]

📜 Paper: ArXiv preprint incoming

📂 License: CC BY 4.0 — free for research + fine-tuning

Would love:

  • Feedback
  • Collabs
  • Benchmarks from others (Claude, Gemini, etc.)
  • Ideas for v2