I'm working on a personal project where I want to generate custom emoji-style images from text prompts — like turning this:

Flying pig → 🐖 with wings
(see cover image!)

I'm using black-forest-labs/FLUX.1-dev as the base model. It’s a diffusion model similar to Stable Diffusion, but optimized for low-VRAM generation.


What I have:

  • ~25k 512x512 emoji-style images
  • Captions for each (in .txt files)
  • A train.json mapping image to caption
dataset/
├── images/image_001.png,...
├── captions/caption_001.txt,...
└── train.json  # [{ "image": "images/image_001.png", "caption": "captions/caption_001.txt" }, ...]

What I need help with:

  1. How many images is “enough”? Is 25k too much or just fine?
  2. Any working training script for FLUX.1?
    • I tried one (PyTorch + diffusers), but outputs look like noise.
  3. Best training config?
    • Should I freeze VAE/text encoder?
    • Recommended batch size, LR, etc?
  4. How do I export the model to ONNX or TFLite?

Planning to use it in a Flutter app later.
A sample setup or any advice would be helpful for beginners to get started.