ai: Best Local Image Models July 2025 and Comfy Missing Manual Addendum

Well, I’ve been keeping track of the latest models in Ollama. This is pretty easy because they have the latest list and we just rotate off old models, trying to keep the total models on the machine to 600GB (wow, that’s a lot!). It is convenient to have the latest sorting, so we can churn things off, for example, right now they are:

mistral-small3.2:24b
gemma3n:e2b and e4b. Which are laptop, tablet, and phone focus models
magistral. The reasoning model from mistral
devstral. The coding agent model from Mistral
qwen2.5vl: Their latest vision model
phi4-reasoning and phi4-mini-reasoning
qwen3. This one is winning lots of benchmarks and is tool-calling and thinking

But right now, the Comfy UI models are nearly as big as 300GB, and I’ve not churned off a model yet. So here is what we have, and a list of the newer ones to try. But their README at least tells you what model support they’ve added:

v0.3.43: Flux Kontext and Omnigen 2
v0.3.42: Cosmos Predict 2
v0.3.39: HiDream

They have a blog that talks about new models, and these include workflow files, which is nice but we are in a strange world where models that are 60 days old are ancient. I was trying to get all this work done with a script, but it is so complicated that it’s easier just to. use the template loading instead since it does automatic model downloads.

ComfyUI Manual Update

This is an addendum to the base Comfy Missing Manual and updates for a few changes:

They have a cool feature where the EXIF data in a video or image has the workflow, so If you drag an image to ComfyUI, it will pop up the json workflow that created it. That is awesome so you know exactly how it was made.
Their template feature has arrive, so most of the time it is Workflow > Template and you can get all the models and it should install plugins. It doesn’t work all the time though.
There are no standards at all so different templates work differently and it doesn’t know anything about how to relate various files together. I wrote a script to basically download split files from huggingface (with their CLI) and so you can manage them, but the template load doesn’t do any of that.
So if you want something that works well, then it is a bit of a painful translation to “huggingface-cli” and noting where the files go
The net of all the testing below is that Google Imagen, Flux.1 Kontext are really great models for both offline and very fast (10 seconds) and local and private (700 seconds on an M4 Max)

ComfyUI integration with OpenWebUI via MCP Server

One thing that is a pain is that the base Open WebUI service has a fixed workflow so you can’t switch models, but with the new ComfyUI MCP Server, you have much more flexiblility which is more flexible, because the client can ask for a model. You can install it nicely with `uvx comfy

Comfy: Generate with Sonnet and then test Prompts on LMArena.ai

If you are trying to figure out the best model overall for your needs, then this is handy way to test the best ones. Here was a test prompt and most models got the spelling wrong (the usual problem). Note that this prompt needed some tuning but the main thing about prompts as usual is to pick words that are going to get the “attention” of the system. So fewer words are good. The tuning to get exact words out is hard, but a good suggestion is to stick the words you want and then put it through an LLM which has seen millions of prompts to help you:

Write prompt for AI text-to-image model. Make sure words are clear and correct:

Medieval castle on top of the hill
A long road leads to the gate
Three towers inside the castle
A flag on each tower

Each flag has a different word on it:
- Data
- Scale
- Brand

Style: dark, ominous, futuristic, Raining

So feed the thing this to Sonnet-4 and some other models LLMs and I got:

A dark, ominous medieval castle perched dramatically on top of a steep hill, rendered in a futuristic gothic style. Heavy rain falls from stormy black clouds, creating a moody atmospheric scene. A long, winding cobblestone road snakes up the hillside leading to the imposing castle gate. 

The castle features three prominent towers rising from within the fortress walls. Each tower is topped with a large, clearly visible flag billowing in the storm wind. 

**CRITICAL TEXT REQUIREMENTS:**
- Left tower flag: exactly "DATA" in bold capital letters
- Center tower flag: exactly "SCALE" in bold capital letters  
- Right tower flag: exactly "BRAND" in bold capital letters

Text must be:
- Perfectly spelled and legible
- High contrast against flag background
- Large, bold sans-serif font
- Each word centered on its respective flag
- No additional text or symbols

Art style: Dark cyberpunk-medieval fusion, ominous atmosphere, dramatic lighting with occasional lightning illuminating the scene. Color palette dominated by deep blacks, dark grays, and muted blues. Rain effects with visible droplets and mist. Cinematic composition with the castle as the focal point against the turbulent sky.

Technical details: High contrast, detailed architecture blending medieval stonework with subtle futuristic elements, moody lighting, photorealistic rain effects, 4K quality, dramatic perspective emphasizing the castle's imposing presence.

The closed source models I could find that were not at all accurate were my original prompts (by just hitting the arena with the same prompt over and over), but with this Sonnet API version, I got just with my barebones prompt these to work, so these are good models to try

Imagen-4.0-ultra-generate-preview-06-06
Flux-1-Kontext-Max
GPT-Image-1
Photon
Imagen-4.0-generate-preview-06-06
Flux.1 Kontext Dev worked on my machine (the only model that did)

With the original prompt augmented by Sonnet it added these models

Ideogram-v3-quality (spelled right but didn’t get the tower have flags right)
Seedream-3
Cosmo Predict2 2B did this perfectly locally in 204 seconds! (Sadly Omnigen didn’t work well)

As another experiment Write prompt for AI text-to-image model. Make sure words are clear and correct:
s another experiment trying to make a slide

Write prompt for AI text-to-image model. Make sure every word are spelled correctly:

Presentation slide, 16x9 size

Title: Overall Mission
Left Side bullets:
- Enterprise AI
- Safe AI
- Reliable AI

Right Side image:
- Castle on hill with three towers
- Each tower has a flag
- Flags have a single word
- Words are: "Safe", "Reliable", "Fast"

And you will soon see some of the tricks, but this is what Claude Sonnet produced, the Gemini result wasn’t too great or detailed and also that from claude code which was interesting, the api call was more detailed:

A clean, professional presentation slide in 16:9 aspect ratio with a white or light background. Corporate business style layout.

**EXACT TEXT REQUIREMENTS - MUST BE SPELLED PERFECTLY:**

Title at top: "Overall Mission" in large, bold, dark text

Left side bullet points in a clear, readable font:
• Enterprise AI
• Safe AI  
• Reliable AI

The right side contains a photo-real image of a medieval castle positioned on a hill. Dark, rainy, and stormy night, futuristic. The castle has three distinct towers. Each tower displays a flag with a single word in bold, legible text:
- First tower flag: "Safe"
- Second tower flag: "Reliable" 
- Third tower flag: "Fast"

**CRITICAL TEXT SPECIFICATIONS:**
- All text must be perfectly spelled with no errors
- High contrast text (dark on light background)
- Professional sans-serif font
- Flag text should be large enough to read clearly
- Each word centered on its respective flag
- No additional text, symbols, or decorative elements on flags

Style: Clean corporate presentation design, professional business aesthetic, simple and clear layout, high readability, suitable for boardroom presentation.

Technical requirements: Sharp text rendering, high contrast, 16:9 aspect ratio, presentation slide format, clean typography throughout.

If you say “illustration” then you get a clip art looking slide with:

imagen-4.0-generate-preview-06-06
recraft-v3
Flux-1.1-pro
Ideogram-v3-quality (the most perfect as a icon looking thing
Flux.1 Kontext. This worked with the longer prompt perfectly in 700 seconds!
Cosmo-Predict2 2B. Works great with this enhanced prompt but is super slow at 2400 seconds

If you say “photo real” then you get a bunch more with more detailed photography:

Flux-1-kontext-pro (cost: $0.04/image) or you can call it from comfy or mulmocast too or mcp server via fal.ai or with uvx comfy-ui-mcp-server in package.json

Flux-1-kontext-max (cost: $0.08/image) and the same $0.08 fal.ai is also available the same way:

{
  "mcpServers": {
    "fal-flux-kontext-max": {
      "command": "npx",
      "args": [
        "-y",
        "https://github.com/PierrunoYT/fal-flux-kontext-max-mcp-server.git"
      ],
      "env": {
        "FAL_KEY": "your-fal-api-key-here"
      }
    }
  }
}

For local models, here is how to connect to a local Comfy

"mcpServers": {
  "comfy-ui-mcp-server": {
    "command": "uvx",
    "args": [
      "comfy-ui-mcp-server"
    ]
  }
}

Recraft-v3 (this now appears correctly! and looks really good! $10/month 1000 credits where the cost is 2 credits per image 1000 or $0.005/image) but via fal.ai

imagen-4.0-ultra-generate-preview-06-06 (looks very good $0.06/image). Note that it only accepts 480 tokens as input. You can do a bunch for free with Google AI Studio or as an mcp server call via fal.ai (but this currently doesn’t work) or via replicate through mcp server.

{
  "mcpServers": {
    "replicate-imagen4": {
      "command": "npx",
      "args": ["-y", "https://github.com/PierrunoYT/replicate-imagen4-mcp-server.git"],
      "env": {
        "REPLICATE_API_TOKEN": "your_token_here"
      }
    }
  }
}

OmniGen2: No template yet, but tutorial does text decently sometimes

OmniGen2. 7B multimodal model from BectorSpaceLab. This is text-to-image, instructed edit, multi-image composition, and text in images. The tutorial is nice for this 7B parameter model. It is based on Qwen 2.5 VL:3B and then a diffusion transformer that’s 4B. It is so new, there is nothing in the template files, so you have to do the manual downloads or you can copy the image in the tutorial and you will get the template Which is pretty nice.

The main claim to fame here is that it should generate clear text content in images (which when you are doing slides is the biggest issue). but I found that it didn’t do a great job. Generates 1Kx1K image in 433 seconds so about the same as nVidia Cosmo-Protect2. And 370s for a simple sign works

And with the right prompt does text well but definitely misspells. Sigh.

Flux.1 Kontext Video: float8_e4m3fn fails on Mac but full FP16 flux1-kontext-dev.safetensors is perfect!

Flux.1 Kontext. I’ve loved the Flux.1 Dev model and this is supposed to be even better. The easiest way to install it is to go to Workflow > Browse Templates > Flux > Flux.1 Kontext Dev but then you have to make sure you delete the models. The main difference compared with most models is that it has a bunch of next commands like “transform to 1960s pop art style”, “add ascii single world ‘in’ no additional letters”, “remove all humans”, “remove an apple from face”, “use an elegant style to make a portrait of a swam”.

It is also very good for slide presentations like “create cartoon illustration with “WEEKEND” and “YOGART TIME” on yellow background with happy characters, “rotate camera 360”, “put two cute character images together”. And. you can do this multi-round so the input and be fed back, “change lighting to sunrise”. There are a few workflows. The grouped flow doesn’t seem to work on a Mac, there is an “MPS” error. The main problem is that the new FP8 type, specifically float8_e4m3fn is not supported by the MPS. back end. So a little stuck. But the new edit button is pretty cool, so you don’t need to set workflows anymore.

However, this is for a certain model, so if you use the flux1-dev-kontext_fp8_scaled.safetensors versions in this tutorial then you by pass this error and use the t5xxl_fp16.safetensors and also the same with FP8 is replaced with flux1-kontext-dev.safetensors. This takes 701 seconds since it iis twice as big for 1Kx1K image and note that I made this terrible mistake because the older flux1-dev.safetensors looks as file name, so wasted half a day wondering why the images were crap.

The main problem is the tutorial workflow is for the iterative version where you keep creating from the last input, but there is. no documentation on how to bootstrap with the first image. There is a Empty SD3 latent Image, but no now on how to use it.

This model is pretty incredible. I didn’t even need to use the more flowery prompt that was generated, my short one worked.

nVidia Cosmo-Predict2 Image (2B does text well if simple, 14B is perfect)

Cosmos-Predict2. From nVidia the tutorial does show there is a template for it and doing the template download does cause the text encoder and the diffusion model to get loaded. This is a 2B parameter model and there is a 14B model as well. It supported text-to-model and image-to-video. This works fine from the Template load and automatic download to do text to image.

It takes 179 seconds to make a very high-resolution 1Kx1K image. In another run it took 412 seconds, so not clear the difference. So if. you ask it to just make a sign it is great, but mixing with more complex scenes is hard. I think I’ve not mastered text prompting

It does not do a good job on complex slides with 2B (lots of mispellings even with prompts where Omnigen nearly works).

The 14B model did the slides perfectly with the 14B model doing the castle, although it is slower at 2461 seconds for a 1Kx1K which a nearly unusable 41 minutes, but if it is this good you can let it run overnight.

nVidia Cosmo-Predict2 Video (2B and 14B)

There is a separate workflow for the Image to Video that generates 480p 15fps videos with a start and end image. They call this video 2 World and it is 98 minutes for 5 seconds of video, so expensive but not crazy out of bounds compared to say Hunyuan which takes over a day to generate the same. In another test, a 480p video generate five frames in 446 seconds or 89 seconds.

Which is why the typical workflow is hierarchical

Take a story and cut it into a set of scene prompts
Generate a first key frame with the prompt of the first scene
Generate “low resolution” video that matches the first say 480p or 720p.
Take the last frame and repeat the above with the instructions for the next scene
Make sure the whole things holds together
Use outpainting to generate a 4K per frame
Stitch the whole thing together.

Note that this model does not handle text well, so if you say have three flags that say “Data”, “Scale” and “Cache”, it will get the letters and names wrong.

WAN2.1 VACE (video works from tutorial but not the template unless LORA set to 0.5)

WAN2.1-VACE Alibaba Tongyi multimodal 14B VACE for 720p but only needs 1.3B for 480p and It is in Workflow > Template > Video > WAN VACE [Text to Video | Reference to Video | Control Video | Outpainting | First-Last Frame | Inpainting] which includes the model’s Text to Video (This uses LORA so that on a RTX4090, it is only 4 minutes to make 81 frames vs 40 minutes), Reference to Video (that is match the style of the reference image), Control Video (control the input videos and reference images), Outpainting (extending video by expanding the size), First-to-Last Frame (do smooth transitions), Inpainting (edit video to remove objects).

In actual usage, I found that the template model doesn’t seem to work. I just get fuzz, but when using the model that is part of the tutorial, it doesn’t work when LORA strength is set to 0.7, and it says that it can be fuzzy so you have to adjust the LORA adapter between 0.3 and 0.7 (kind of a pain that the actual example given doesn’t work. The difference seems to be the injection of a LORA adapter in the template. The note says that this should change model strength between 0.3 and 0.7 and I found 0.5 works. It produces 5 frames in 389 seconds at 720p which is not bad 78 seconds per frame

For the record the vace-t2v flow works out of the box, but on a Mac, not the vide_wan_vace_14B_t2v. The non-LORA version generate 640x640x49 frames in 2815 seconds so about a minute a frame. Not bad.

Ace Step (Audio Works Well!)

Ace-Step (Works Well!). The first audio model from ACE Studio and StepFun which is so cool to do locally, so look in. Workflow > Template > Audio > ACE-Step v1 [Text to Instrumental Music | Text to Song | M2M Editing] where you can generate instrumental, songs with vocals and finally change existing songs in style and lyric with Music to Music Editing. This uses a relatively model 3.5B parameter model. It takes 102 seconds on an M4 Max to generate a 3-minute song. And the Music to Music is very cool at 92 seconds to transfer to a new song style and lyrics!

These are the models I last loaded in May and haven’t played with enough

Wan 2.1 (Video blurry, obsolete because of VACE?)

Wan2.1. (This comes out unusably blurry when LORA set to 0.5, which sometimes happen,s and you have to reboot but fast at 21s/frame for 480p). It does look like VACE is the later model. As this is 1.3B

From the Alibaba WAN team’s 1.3B model. Choose from Workflow > Template > Video > WAN 2.1 [ Text to Video | Image to Video | In painting | ControlNet | FLF2V 720p F16 ] where Text to Video. FLF2V generates a 720p video where you give it the first and last frame to create coherent videos.

WAN 2.1 Fun is the other variant of Fun Camera 1.3B and 14B

HiDream-I1 Full (Image model only without text, FP16 is fast)

HiDream-I1 (not working well with blurry images, but sometimes you need to reboot, even for example). From HidreamAI Full 50 steps, Dev distilled 28 steps that are a Text to Image model (which hopefully works better than the Flux.1 Kontext with its FP8 problem. At 1200 seconds for a 1000×1000, it doesn’t have that great a quality or speed compared with Cosmo-Predict2. The images have a dream-like quality. This could be because of using the FP8 model. But confusing as the examples look great.

The test between a Q5 GGUF model and the full FP8 shows that the FP8 is slower but noticeably better image quality. Also you have to hand download things from huggingface, so our install-comfyui.sh is a better way to load them.

IN the end though, the full FP16 model is fast and produces great results. It take 193 seconds for a 1Kx1K image, but doesn’t handle text at all.,

Hunyuan 3D 2.0: (3D from image haven’t tried yet)

Hunyuan3D 2.0. Makes 3D models! (a niche thing) which I’ve not used.

Hunyuan Image2Video (GGUF Loaders back to standard and hard to prompt)

I had used these extensively with GGUF a few months ago and got some good results, but this month, the GGUF loaders are no longer there. I don’t quite know what happened, but the automatic loading didn’t seem to work, so I tried to manually install the GGUF Loader and this failed. I then tried deleting it and after a few times it worked!

Hunyuan Image2Video (Works but not fast, but on a certain run I got garbatge). March 2025. This one works very well, it is limited to 120 frames, but is pretty fast but does take hours and hours to generate 720p x 121 frames. Some great experience with it.

I did learn a lot about the GGUF loaders, basically, the sync doesn’t work well and it puts the into ./ComfyUI/custom_nodes. This used to be ./ComfyUI/models/custom_nodes, but I copied them. I did see that I was using a hacked version of the GGUF loader, so I had to delete all of them and then the installing worked (this is really a repo clone).

So basically with this strategy, these models are deprecated where you only keep the latest model from each vendor. But here are the older models that I didn’t try

LTXV-Video. This is both Text-to-video and Image-to-video in 2B parameter model
Lumina Image 2.0. Feb 2025
LTXV. Lightricks video generation. November 2024

Flux.1 Dev (Works nicely, but hard to control)

Flux.1 Dev (Works Well). From Black Forest Labs November 2024. This is my go to for image generation. the.Dev works very well. I do find the Schnell version is pretty inferior. So time to clean some old models out! The main problem is the workflows I’ve been using are broken because Schedule selection is no long a part of Image Saver.

So I switched to “Flux.1 Dev Full Text to Image” instead which is the fat 16 bit model. The GGUF is the smaller one

Backing up, What are all these Nodes?

Well, it’s time to learn what all these components do, so now on to the fundamentals of image and video models.

CLIP

This was done around the time of dinosaurs by OpenAI and stands for Contrastive Language Image Pretraining. The idea is pretty simple, you take two models, one for text and the other for images,, and train them so their output vector is the same if the two concepts are the same. If you have a picture of a “frog”, then the output vector should match the output vector of the text, “frog”. And in this world, the vision model is typically a vision transformer and the text model is a plain old LLM.

Tong Family (richtong.org sync)