ai: ComfyUI and Video Generation Missing Mac Manual v1.6 2025-04-27

OK, my previous post on Open WebUI was getting long, so here is a getting a little long and while it covers ComfyUI integration, the bigger topic of using ComfyUI is scattered all over the winds. So if you just want to get ComfyUI to work, here’s the missing manual for that. Some of this duplicates the base setup, but then its a deeper drill down. The Comfy Wiki is the base place to find things, but other things are scattered all over the place

There have been many improvements to the installations (notably the use of brew install) and many of the workflows are now broken, so see below for working ones

Installation via Homebrew

As I said before, the easiest thing to do to just use the Comfy Desktop but there are lots of install options. This is a DMG that automatically downloads and it updates itself too! But instead of the download, just do a brew install comfyui

On my installation, this caused all kinds of problems with the .venv in /Applications/Comfy UI so just go there and delete the .venv and all should work.

Mac Do Not Accept Documents/ComfyUI Defaults, Use $HOME/ComfyUI

This is because ~/Documents are actually backed up by Apple iCloud so:

It will automagically make them disappear just leaving links unless you say Always Downloaded
It creates all kinds of caches because the ~/Documents include all the LLM files, so you are going to eat into your iCloud space
It will create all kinds cache files in ~/Library/Caches/CloudKit/com.apple.bird.../MMCS/Cloned Files and ~/Library/Application Support/CloudDocs/Ssession that are right now eating up 500GB of disk on my machine. And this is in addition to the files that are in Comfy itself
Use ~/ComfyUI instead, these are not cached.

Connection to Open WebUI

Also previously documented, you can right now only import a single workflow (exported as an API workflow) and you have to fill in which nodes and widgets have specific variables, like height and width, this means you have to carefully page through the JSON looking for the “inputs” sections that has these.

Where does ComfyUI live?

Like Open WebUI, this is a bit of a mystery, but basically, the default is ~/Documents/ComfyUI and here are the models, the user configuration etc.

Downloading with Huggingface-cli or with Comfy Manager

The automatic way to do this is that when you load a new workflow, click on the Manager tool bar icon and select Install Missing Custom Nodes, but many times, the models are not downloaded as nodes and you have to manually install them to the right place. Many of these are on hugging face, so you have to manually them manually.

Because the brew installation does not include the TUI, you should install it with pipx install huggingface[all] and this downloads it. If you don’t need the text user interface that let’s you select things to delete in a pretty way you can use , e brew install huggingface-cli to download things.

The cli also does caching which is very nice so you don’t download multiple times. The Huggingface site is really good about letting you copy the org/repo and also the relative paths in the git repos since they are using Git and Git LFS to store things and version them.

huggingface-cli <org>/<repo> <repo path> --local-dir ~/Documents/ComfYUI/
# this will place what the relative path is right there
huggingface-cli calcuis/hunyuan-gguf hunyuan-video-t2v-720p-q4_0.gguf --local-dir ~/Documents/ComfyUI/models/unet
# this will place the files into stripped_files which is inconvenient, so pay attention

The Confusing Interface Guide

The interface to Comfy is very non standard and takes a lot of getting used to. Basically, there are four places you need to look:

At the upper left is the menus, there is barefly anything there except the Workflow open which creates a tab for every graphical workflow that actually operates things. This will open in the center a graphical interface
At the left side is a tab bar of which the most useful is the first one which shows images recently generated.
About in the middle top is the Manager icon, this is the Comfy Manager and is your best friend to make sure that the nodes are loaded. The UI is complex and you should definitely not just load up on nodes. Instead, you use the guide below to load the nodes you need. And Install Missing Custom Nodes is the most useful. So if you load a new workflow this is a handy way to make sure all the pieces are there.
Underneath this thing is just a python application using uv to install components, so if you have problems, the Install PIP Packages will hopefully fix its
At the very upper right is a Terminal icon, click this to see the error messages but also in logs, many workflows will tell you their progress which is nice. Some nodes actually update their graphical displays to show there they are
Your best friend is at the lower right which is the icon which looks like a dashed rectangle. This recenters the workflow on the main screen. Also there is a floating RUN button that can get lost, so this is handy to help find it.

Nodes vs Workflows

If you are getting into it, there are zillion official and community nodes. I normally just fine the workflow that I need and then load the relevant nodes as there are way too many to keep up with. Some nodes are really useful though as noted below. For instance, the nodes that let you set various parameters are nice.

Guide to Workflows and new workflows

Now you can download models and these are updating so fast that they get obsolete quickly, but the basic things you are can do are. The easiest way to find the latest workflows is just to go to the the ComfyUI News is a good place to look as so much is changing all the time and also the Hugging Face Video Generation Arena

Text to Video. Take a prompt and make a video.
Image to Video. Take an image and make a video
Text to Image. Take a prompt and make a video
Image to Text. Image understanding
Image + Text to Image. Take an image and prompt to change it to make a new image
Video to Video with LORA. This changes an image to say add a new persons face

There are basically lots of different people adding nodes to do this workflow and a host of folks doing helper functions. As of April, Comfy has more “official” repo stuff which make it way easier to get this running.

The hotnesses change all the time. Back in January the hot ones were:

Hunyuan Text to Video. Hunyuan now has a page that collects the various community models which is nice. The big change since February is that are now Comfy Officel models
Janus Pro. Image to Text, Text to Image, Text and Image to Image. We are using the ones directly from Deepseek.
FLUX.1 Dev. The classic prompt to text and these seem stable and work even in April.

As of April, the new ones that have hit the scene are some are documented in the Comfy Wiki Resources and the Video Board and their ELO scores where the best model is Google Veo 2 at 1123 (so well above the other. And Kuaishou Kling 1.5(Pro) closest at 1051:

Alibaba Wan 2.1 14B. 1019. Adding this one now
Tencent Hunyuan Video. 1002. We have these below.
Zhipu AI CogVideoX-5b. 784. So pretty far behind

FLUX.1 by ComfyUI Installation

There is an official guide for this that includes how to make it work with LoRA and ControlNet. First a note on the various versions, Flux.1 Pro is proprietary, Flux.1 Dev is high quality (what I use) and Flux.1 Schnell is fast. Note that FLUX.1-dev has a non-commerical use license, but FLUX.11-Schnell is Apache 2. The confusing thing is that there are zillion variants of this available, but this is using the ComfyUI FP8 versions.

While you can install all the piece parts (see the Hunyuan below), Comfy supplies a checkpoint model (that is all the models are baked in:

Flux1-dev-fp8 just download into the ~/Documents/ComfyUI/models/checkpoints
Flux1-schnell-fp8. Also download into the checkpoints
Flux Dev Workflow. Download into ~/Documents/ComfyUI/workflow
Flux Schell FP8 Workflows. Download this as well

These models are pretty fat needing 8GB of VRAM, so you can also load models from City96 and Ilyasviel that are smaller using GGUF or NF4 formats. The GGUF gives you a bunch of quantizations to load.

Wan 2.1

This has a 14B and 1.3B parameter models for T2V (Text to Video) and I2V (Image to Video). And as with Hunyuan, there is a GGUF version from city96 and a quantieze one from Kijai, but here is the native ComfyUI workflow (although the workflow json files are missing). This doesn’t include the new First Frame and Last Frame to Video (an in betweener) aka Wan2.1-FLF2V-14B-720P.

The workflows are in the alternative instructions:

umt5_xxl_fp16.safetensors got to models/text_encoders
umt5_xxl_fp8_e4m3fn_scaled.safetensors to models/text_encoders
wan_2.1_vae.safetensors to models/vae
clip_vision_h.safetensors to models/clip_vision
wan2.1_t2v_1.3B_fp16.safetensors and save it to the ComfyUI/models/diffusion_models/
Other diffusion models go to models/diffusion_models which are i2v and t2v with names like wan2_1_i2v_720p_14b_fp16. The quality is fp16 > bf16 > fp8_scaled > fp8_e4m3fn so use the fp8_scaled if too big.
16.safetensors which means image to video for 720p resolution with 14B parameters in bf16 float format but the fp16 apparently gives better performance so use: wan2.1_i2v_480p_14B_fp16.safetensors
text to video workflow which goes into ComfyUI/workflows which uses the the 1.3B fp16 model
image to video workflow which uses the 14B fp16 at 480p Text to vdieo models like wan2.1_t2v_14B_bf16.safetensors

Hunyuan Text to Video: GGUF Installation

This is the hottest model from Alibaba and it is open sourced, so the trick is you need for GGUF installations. The reason to use GGUF is that the models are more quantized but much slower (see the test runs at the end). So I mainly use BF16 models instead because I have enough RAM and speedis a problem with Hunyuan when you are takes a day to generate something, saving an hour is nice 🙂

A Workflow.json that says how the pieces are tied together. This is normally in huggingface, so huggingface-cli repo/org workflow.json --local-dir ~/Documents/ComfyUI/user/default/workflows will just put it in the default workflows
Then you go to ComfyUI and open that workflows
At the upper right click on the Manager button and and then “Install Missing Custom Nodes”. If things are right this will load the right blocks for you and then you restart
This is the sucky manual part. Now you download the various models into the right places in ./models/vae and other places as

You can also get this directly from ComfyUI and the only pain here is that they use the directory split_files which doesn’t work well with --local-dir and it has a different Workflow.json, so these need to go into ./models

Hunyuan Text and Image to Video Model with BF16

This is now an officially supported thing which is nice so for Text to Video. The thing that is a little complicated is that these are so called splitfiles, so a simple huggingface-cli download puts them in the wrong directory, so you have to move them from split_files to the proper places.

First you need to manually place files into the right places
Text Encoders. ~/Documents/Comfy/models/text_encoders, put in clip_l.safetensors and lava_llam3_fp8_scaledkkk.safetensors
VAE. Put hunyuan_video_vae_bf16.safetensors file and into ~/Documents/ComfyUI/models/vae folder.
Text to Video. Download the hunyuan_video_t2v_720p_bf16.safetensors file and put it in your ComfyUI/models/diffusion_models folder.
Workflow. This is the workflow that runs this, download it and put into the ~/Documents/Comfy/workflow and open it to run

Then for Image to Video for the v1 concat model:

Clip_vision . Download the llava_llama3_vision.safetensors file and put it in your ~/Documents/ComfyUI/models/clip_vision/ folder.
Diffusion_Models. There are two different ones you can use. Download the “v1 concat” model with hunyuan_video_image_to_video_720p_bf16.safetensors file and put it in your ~/Documents/ComfyUI/models/diffusion_models/ folder.
Workflow. Download this into your workflow directory

Then there is the v2 replace model whcih follows the guiding image closely

Diffusion_model. Download the hunyuan_video_v2_replace_image_to_video_720p_bf16.safetensors file and put it in your ~/Documents/ComfyUI/models/diffusion_models/ folder
Workflow. Down the workflow json and put into your workflow directory

How to write a Text to Video Prompt

This is a bit of a mystery to me and many recommend just using a model like deepseek to generate the prompt (it has been trained on lots of prompts), but this explanation of what you need is good and this video is a little helpful, but the basic idea is that you want things in a specific order (makes sense, that’s probably how they were trained with tagging of videos in a precise way

You want 100-300 words. Most of these descriptions are very wordy and I have to experiment with cutting them down. It doesn’t need noise words I don’t think and some simple YAML format is probably better and in this order…
Subject. “A sleek electric car”, not that you should include size, color and other adjectives to guide the system
Scene. Described the general environment, like “minimalist modern apartment”
Action or Motion. What you want the subject to do like, “gracefully dancing through falling leaves”. Again the more adjectives, the more guidance.
Camera Work. That is how the shot moves, “dramatic circular pan”
Mood. What is the tone, “energetic vibrant mood”
Lighting. Define thing sliek “soft, warm sunlight filtering through trees”
Shot composition. So this tells it what kind of show “wide landscape shot emphasizing scale”

Using Ollama Node to Generate prompts

I’m not sure how this node is prompted, but you can download a node which use Ollama to generate a query. I’m not clear on what model its using or what the system prompt is or even how to print out the text as it just flows

Getting small models with GGUF

The default instructions use their FP8 model, but there are Q4_0 model which works great. Note that the Q4_K_M does not at all. The Q4 model halves the memory needed but it is also slower, so its a tradeoff. I you have enough memory Q8 seems quite good for BF8 is something you can try.

Best Video Height and Width for Hunyuan Video and Flux.1

At least with Flux.1, it likes to create images of the same size it was trained on, in this case 512×512, but the Hunyuan is trained on 720p models (1280×720). So just checking speeds, we need to find a natural size. I”m still confused about this overall.

How Long Do I have to Wait, Check Terminal Logs

While some of the nodes like K Sampler have a title bar that moves, the fool proof but very unobvious way to do this is to click on the tiny Terminal icon at the upper right of the toolbar. Then look at Logs, you should see something that looks like a progress bar, it will says something like 75% 15/20 [ 4:37:19<1:04:04, 764s/it]. This basically means you are 75% of the way there, you are step 15 of 20 (this normally set in the steps portion of the workflow. Then the cryptic stuff says that elapsed time is 4 hours 37 minutes and 19 seconds and after the greater than sign, it says there is one hour 4 minutes and 14 seconds to go. Finally you are running 745 seconds/iteration or step.

This is really handy if you are wondering how say a video generation job is going. Note that during generation, it can look like it is failing and there is no GPU usage, but this happens pretty often

Finally, click on the Terminal tab and you can see any error messages. For instance this is where you figure out if python has crashed.

Performance M1 Max vs M4 Max

So I tried some samples of just generation. I did just 2 frames to get a sense at ¼ resolution and full resolution and also with the original fp16 model (there is an fp8 also available).

The Prompt used followed the format above:

happy family with mother father and kids in summer dress

a beautiful sand beach with clouds waves, seagulls, shells and drift wood

walking along the carefree with wind blowing through their hair

The camera shows faces in closeup then zooms out to see the entire beach and pans from their front to behind

Beautiful sun lit with sun high in the air

The memory used includes the system and everything else not just the model. This was down on the second run of a single frame to warm caches. The most interesting thing is that the Q4 model runs slower but uses marginally less memory. Note that these differences are not a big deal for the Mac, but for PCs with only 24GB of VRAM they make a big difference as the difference between q4 and bf16 is 13GB.

The overall M4 Max numbers in RAM use are higher by about 10GB, I suspect this is because we are measuring system usage and Brave is a very fast browser, but the trends are about right. It’s interesting see that q8 models are very fast, could be because a byte is a very natural size for the machines I’m not sure, but if you don’t care too much about memory like on a mac, the q8 is a good way to go and then we tried ¼ and 1/16 size and you can see for thumbnails, it is pretty amazingly fast at 15s per frame, but of course this is more like a cartoon than a realistic image and you can see why a common trick is to do a small video and then image upscale it.

Note this is at 20 steps in the diffusion but moving to 40 steps essentially doubled the time to compute and didn’t add much quality so 20 steps is not a bad places to start. So it took 37s seconds for 20 steps at q8 for 640×360 images but 69s with 40 steps.

And then the test of doing a 5 second video (120 frames) shows how long the batch process can last and that to store the additional images, it really pushes the memory which is why image to video is often used, so you can create longer images in batches. Also note that the 64GB doesn’t increase as much which probably indicates that we are reaching some limits here so there is less RAM caching.

Note that with the M4 Max I was getting hangs for various values, so I’m starting from 12 frames and seeing what’s going on. You can look at logs at ~/Documents/Comfy/comfy_8000.log but it shows nothing unusual, activity just drops to zero. There are some bugs, I’ve tried ComfyUI restart and next up is a full reboot. I also got rid of some nodes that I don’t use but was playing with but this didn’t help. I’m not quite sure what is going on, but I do get “aten:upsample_nearest3d.vec” which is CPU only.

The second run on the M4 Max I got a “typecheck: failed to fetch” which means out of memory (seems like a bug). The main thing is that if you pick a higher quantization then you have more memory, it’s pretty clear that I’m running out of memory which is why long runs of 120 frames on the M1 Max 64GB fail alot, so you might accept higher quantization and slow speed just so you can get the video. If you are not memory constrained then q8 seems very good. And there is now a blessed fp8 model that I haven’t tried that might be even better.

Time (sec)/Memory	M1 Max 64GB	M4 Max 128GB
q4, 640×360/20 steps	137s/35GB	58s/48GB
q4_k_m, 640×360, 20	No image, broken model	No image
q8, 640×360/20 steps	96s/41GB	37s/54GB
bf16, 640×360, 20 steps	121s/48GB	62s/58GB
q8, 320×180, 20 steps	34s/39GB	15s/56GB
q8, 1280×720, 20 steps	419s/41GB	201s/59GB
q8, 640×360, 40 steps	193s/39GB	69s/57GB
q8, 640×360, 20 step,12f	N/A	180s/55GB (15s/frame)
q8, 640×360, 20st, 25f	N/A	Hangx2/55GB,61GB 731s/43GB (25s/fr)
q8, 640×360, 20st, 121f (5 seconds)	8348s or 2:19 hrs/52GB (66s/frame)	Hang/57GB 3300s/55m/57GB (29s/fr)
q8. 640×360, 20st, 361f (15 seconds)	N/A	Overnight/81GB Many fails and hangs and not to spec

M4 Max is 2x M1 Max

So the performance difference in four years is kind of incredible. Some might be due to the larger RAM available, but even fro smaller configurations, this holds. If memory is a real issue then q4 with slower speed is the way to go. BAsically without much optimization, you get 120 frames/hour (or about 3 seconds/hour) so if its a long movie it won’t work, but a short 30 second clip you are going to need 10 hours of grinding.

This is why most folks use Image to Image or a cloud service, but it is great for the short clips. And it’s practical to do local because the single image ad generation times are so fast.

Test at 320×180 less that 5 minutes

To see if things are working, it is easy to generate a ¼ size image for a 1280×720 video or 320×180 and one thing to learn is that quantization is smaller, but there is a speed tradeoff, so for a single frame at 20 steps or 320x180x20x1: so the sweet spot seems to be q8 at least for this M4 Max as long as you do not run out of memory:

Quantization (320x180x20x1)	Seconds/frame
fp16	34
q8	19
q5	86
q4	42

Conclusion 1280×720 at 20 steps and 129 frames runs out of memory in 128GB machines, 640×360 works

This seems like a good compromise at 20 steps is to do q8 because it’s faster and quality is decent. 640×360 seems to generate some decent images. I am seeing that for longer running things, ComfyUI is crashing, looking at the logs in ~/Documents/ComfyUI/comfyui_8000.log I see it loads and model and then hangs,

Basically although it is supposed to handle up to 129 frames, most of the time it seems to crash and I’m guessing it is out of memory. Given that with 1280×720, I’m seeing overall RAM draw at 73% and rising as it adds more frames.

I did try 360 frames but I had a bunch of them and it takes 4 hours with an M4 Max and 8 hours with an M1 Max. So since most videos are actually a series of quick cuts, it makes more sense to build something that is a series of cuts. So you might want to learn how to do image to video so that you can match them up one by one.

Trained on 2MP, but can go to 4MP according to a Reddit user

First use a small model and then tune prompt then bump to big one. So ¼ is 480×270 (68sec on M1 Max) which is what I do, but here are the steps:

512×512. 5 seconds, 30 steps basic
1024×1014. Details shining
1920×1080. Eye opener
2560×1440. More detail but not better which makes sense, it is trained on 1920×1080

What the heck is Webp? GIF Remixed

This is the format that the default Hunyuan video outputs, its basically like animated GIFs, so they are interleaving images using VP8/9 but this is not accepted by YouTube or other systems. So you need to convert it to a normal video format to use it.

I couldn’t get imagemagick to properly convert a Webp to an MP4, so I had to use an online tool which worked great

Real Usage: Generate 1 frame, then 3 seconds and then 5 seconds

Now the main thing is how long a video do you really want, well, most commercials today are on YouTube is 15 seconds, but the most critical is the first 5 seconds (voices.com), so you can see you can easily generate ad length that by first generating a single frame quickly to make sure it is ok, then restart your machine and it will take about an hour to generate the critical 5 seconds. Once that is done, you run it for six hours (say an afternoon for an M4 Max).

If you really want the traditional 30 second ad, then you have to generate each scene individually, so generate 5 seconds and then 5 seconds more.

Single Image generated in 37 seconds for proofing

Then you can generate a few more frames and adjust the prompts

Optimizing even more with LORA and other tunings

There are lots of other pipelines to try and I”m a little confused what works where, but there is more research to be done. Basically images are easy and video is still hard. Just got to learn some terms and try it

Image to Video, Video + Prompt to Video

This same model supports more than text to video and Image to Video is next when I get time to try this

Trying Janus Pro for Image Generation, understanding and text+image to new image

The latest hotness from Deepseek and there is already a Janus Pro node. This was a bit of a pain, but I learned alot. The ComfyUI wiki recommends just downloading the Janus-Pro, but when I did that and tried it, it said “CUDA not installed”

In looking al the underlying repo, I saw a fix for this which added MPS back in, so here is some of the magic that actually works:

Uninstall the Janus Pro from CY as it will confuse things
Fork the CY repo into your own
Now you will see a bunch of pull requests that fix things like the MPS support
So now you add a few upstreams, in my case, I added git remote add alvin git@github.com/alvin-change/comfyui-janus-pro and then I did a tricky, git rebase -I alvin/main to get this into my branch
Now here’s the real trick, go to ~/Documents/ComfyUI/custom-nodes and symlink that repo in.
We have a mono repo that keeps track of everything, so this means that we can track all our changes the the symlink works. I linked in as ComfyUI-janus-pro-rich so as not to confuse the ComfyUI Node Manager

Now when you open something that requests the nodes, it find that repo links and it just works!

It looks pretty good, but the test workflow is pretty confusing. For one thing, the image generation and the image understanding are not integrated. This confused me quite a bit. I would have thought the image generation is tied to the image understanding so you can do image to image with text

Modifying Janus Pro to add Image Saver and Prompt Composer for Image+Text to new image

OK, this is where things get hairy, you really need to helper noes, the first is Image Save which lets you create a json file that is meta data for this. To do this, you have to Convert widget to input alot but it’s worth it to have this in the meta data of the files also Comfy Prompt Composer. These are basically string functions that let you concatenate and save and read prompts from disk. And the Save Text is really good because it will save all the prompts for you automatically in a text file.

Status of image generation by watching the K-sampler bar set steps to 20 for Flux.1 DEV and 4 steps with Flux.1 Schnell

Turns out the K-sampler actually shows its step progress at the top bar, so a good way to figure out what is going on. But the big question is how many steps to run. The folks on Reddit says. Given this I leave it at 20 steps for the default as going to 30 increases render time by 50% and is usually not worth it for me.

10 steps unusable
15 steps pretty good, some grain
20 steps really excellent
30 steps hard to tell might be better
50 steps really hard to tell

If you are in a real hurry and don’t mind more cartoon-like outputs, then you can usee Flux.1 Schnell. As the name implies, this is a distillld Flux.1 Dev and converges in as little as 4 steps.

Tong Family (richtong.org sync)