I’ve been experimenting with GPT-4V, Claude, and Gemini and realized something strange:
They can describe art. Solve riddles. Explain GPTs.
But ask: “How many pencils are on the table?”
Or “Which object is left of the cup?”
And they fall apart.
So I built a benchmark to test that specifically:
What is VisQuant?
- 100 synthetic images
- 40+ everyday object types
- Labeled object counts and spatial layout
- 2 reasoning Q&A pairs per image
- Grounded annotations in JSON and CSV
- Baseline tested on GPT-4V
- Entirely open-source
What It Tests
VisQuant isolates the visual intelligence primitives models often skip over:
- Counting
- Spatial relationships
- Left/right/stacked inference
- Multi-hop VQA from structured scenes
Why?
Because current benchmarks like VQAv2 or GQA are messy, noisy, and hide these weaknesses.
VisQuant is small, clean, focused — and it exposes real gaps in model reasoning.
Get It:
🗃️ Dataset (HuggingFace): [https://huggingface.co/datasets/Anas-Mohiuddin-Syed/VisQuant]
📜 Paper: ArXiv preprint incoming
📂 License: CC BY 4.0 — free for research + fine-tuning
Would love:
- Feedback
- Collabs
- Benchmarks from others (Claude, Gemini, etc.)
- Ideas for v2