This is a simplified guide to an AI model called Grounding-Dino maintained by Adirik. If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter.
Model overview
grounding-dino
is an AI model that can detect arbitrary objects in images using human text inputs such as category names or referring expressions. It combines a Transformer-based detector called DINO with grounded pre-training to achieve open-vocabulary and text-guided object detection. The model was developed by IDEA Research and is available as a Cog model on Replicate.
Similar models include GroundingDINO, which also uses the Grounding DINO approach, as well as other object detection models like stable-diffusion and text-extract-ocr.
Model inputs and outputs
grounding-dino
takes an image and a comma-separated list of text queries describing the objects you want to detect. It then outputs the detected objects with bounding boxes and predicted labels. The model also allows you to adjust the confidence thresholds for the box and text predictions.
Inputs
- image: The input image to query
- query: Comma-separated text queries describing the objects to detect
- box_threshold: Confidence level threshold for object detection
- text_threshold: Confidence level threshold for predicted labels
- show_visualisation: Option to draw and visualize the bounding boxes on the image
Outputs
- Detected objects with bounding boxes and predicted labels
Capabilities
grounding-dino
can detect a wide var...