I'm working on a personal project where I want to generate custom emoji-style images from text prompts — like turning this:
Flying pig
→ 🐖 with wings
(see cover image!)
I'm using black-forest-labs/FLUX.1-dev as the base model. It’s a diffusion model similar to Stable Diffusion, but optimized for low-VRAM generation.
What I have:
- ~25k 512x512 emoji-style images
- Captions for each (in .txt files)
- A train.json mapping image to caption
dataset/
├── images/image_001.png,...
├── captions/caption_001.txt,...
└── train.json # [{ "image": "images/image_001.png", "caption": "captions/caption_001.txt" }, ...]
What I need help with:
- How many images is “enough”? Is 25k too much or just fine?
- Any working training script for FLUX.1?
- I tried one (PyTorch + diffusers), but outputs look like noise.
- Best training config?
- Should I freeze VAE/text encoder?
- Recommended batch size, LR, etc?
- How do I export the model to ONNX or TFLite?
Planning to use it in a Flutter app later.
A sample setup or any advice would be helpful for beginners to get started.