OK, my previous post on Open WebUI was getting long, so here is a getting a little long and while it covers ComfyUI integration, the bigger topic of using ComfyUI is scattered all over the winds. So if you just want to get ComfyUI to work, here’s the missing manual for that. Some of this duplicates the base setup, but then its a deeper drill down.
Installation
As I said before, the easiest thing to do to just use the Comfy Desktop. This is a DMG that automatically downloads and it updates itself too!
Connection to Open WebUI
Also previously documented, you can right now only import a single workflow (exported as an API workflow) and you have to fill in which nodes and widgets have specific variables, like height and width, this means you have to carefully page through the JSON looking for the “inputs” sections that has these.
Where does ComfyUI live?
Like Open WebUI, this is a bit of a mystery, but basically, the default is ~/Documents/ComfyUI
and here are the models, the user configuration etc.
Downloading with Huggingface-cli or with Comfy Manager
This is the first model that we tried and there is much talk about how to make it work
One thing that you want to make sure to do is to use the brew install huggingface-cli
to download things. This also does caching which is very nice so you don’t download multiple times. The Huggingface site is really good about letting you copy the org/repo and also the relative paths in the git repos since they are using Git and Git LFS to store things and version them.
The other more automatic way to do this is that when you load a new workflow, click on the Manager tool bar icon and select Install Missing Custom Nodes
huggingface-cli <org>/<repo> <repo path> --local-dir ~/Documents/ComfYUI/
# this will place what the relative path is right there
huggingface-cli calcuis/hunyuan-gguf hunyuan-video-t2v-720p-q4_0.gguf --local-dir ~/Documents/ComfyUI/models/unet
# this will place the files into stripped_files which is inconvenient, so pay attention
Hunyuan Text to Video: GGUF and FP16
This is the hottest model from Alibaba and it is open sourced, so the trick is you need for GGUF installations:
- A Workflow.json that says how the pieces are tied together. This is normally in huggingface, so
huggingface-cli repo/org workflow.json --local-dir ~/Documents/ComfyUI/user/default/workflows
will just put it in the default workflows - Then you go to ComfyUI and open that workflows
- At the upper right click on the Manager button and and then “Install Missing Custom Nodes”. If things are right this will load the right blocks for you and then you restart
- Now you download the various models into the right places in
./models/vae
and other places as
You can also get this directly from ComfyUI and the only pain here is that they use the directory split_files
which doesn’t work well with --local-dir
and it has a different Workflow.json, so these need to go into ./models
How to write a Text to Video Prompt
This is a bit of a mystery to me and many recommend just using a model like deepseek to generate the prompt (it has been trained on lots of prompts), but this explanation of what you need is good and this video is a little helpful, but the basic idea is that you want things in a specific order (makes sense, that’s probably how they were trained with tagging of videos in a precise way
- You want 100-300 words. Most of these descriptions are very wordy and I have to experiment with cutting them down. It doesn’t need noise words I don’t think and some simple YAML format is probably better and in this order…
- Subject. “A sleek electric car”, not that you should include size, color and other adjectives to guide the system
- Scene. Described the general environment, like “minimalist modern apartment”
- Action or Motion. What you want the subject to do like, “gracefully dancing through falling leaves”. Again the more adjectives, the more guidance.
- Camera Work. That is how the shot moves, “dramatic circular pan”
- Mood. What is the tone, “energetic vibrant mood”
- Lighting. Define thing sliek “soft, warm sunlight filtering through trees”
- Shot composition. So this tells it what kind of show “wide landscape shot emphasizing scale”
Using Ollama Node to Generate prompts
I’m not sure how this node is prompted, but you can download a node which use Ollama to generate a query. I’m not clear on what model its using or what the system prompt is or even how to print out the text as it just flows
Getting small models with GGUF
The default instructions use their FP8 model, but there are Q4_0 model which works great. Note that the Q4_K_M does not at all. The Q4 model halves the memory needed but it is also slower, so its a tradeoff. I you have enough memory Q8 seems quite good for BF8 is something you can try.
Best Video Height and Width for Hunyuan Video and Flux.1
At least with Flux.1, it likes to create images of the same size it was trained on, in this case 512×512, but the Hunyuan is trained on 720p models (1280×720). So just checking speeds, we need to find a natural size. I”m still confused about this overall.
How Long Do I have to Wait, Check Terminal Logs
While some of the nodes like K Sampler have a title bar that moves, the fool proof but very unobvious way to do this is to click on the tiny Terminal icon at the upper right of the toolbar. Then look at Logs, you should see something that looks like a progress bar, it will says something like 75% 15/20 [ 4:37:19<1:04:04, 764s/it]
. This basically means you are 75% of the way there, you are step 15 of 20 (this normally set in the steps portion of the workflow. Then the cryptic stuff says that elapsed time is 4 hours 37 minutes and 19 seconds and after the greater than sign, it says there is one hour 4 minutes and 14 seconds to go. Finally you are running 745 seconds/iteration or step.
This is really handy if you are wondering how say a video generation job is going. Note that during generation, it can look like it is failing and there is no GPU usage, but this happens pretty often
Finally, click on the Terminal tab and you can see any error messages. For instance this is where you figure out if python has crashed.
Performance M1 Max vs M4 Max
So I tried some samples of just generation. I did just 2 frames to get a sense at ¼ resolution and full resolution and also with the original fp16 model (there is an fp8 also available).
The Prompt used followed the format above:
happy family with mother father and kids in summer dress
a beautiful sand beach with clouds waves, seagulls, shells and drift wood
walking along the carefree with wind blowing through their hair
The camera shows faces in closeup then zooms out to see the entire beach and pans from their front to behind
Beautiful sun lit with sun high in the air
The memory used includes the system and everything else not just the model. This was down on the second run of a single frame to warm caches. The most interesting thing is that the Q4 model runs slower but uses marginally less memory. Note that these differences are not a big deal for the Mac, but for PCs with only 24GB of VRAM they make a big difference as the difference between q4 and bf16 is 13GB.
The overall M4 Max numbers in RAM use are higher by about 10GB, I suspect this is because we are measuring system usage and Brave is a very fast browser, but the trends are about right. It’s interesting see that q8 models are very fast, could be because a byte is a very natural size for the machines I’m not sure, but if you don’t care too much about memory like on a mac, the q8 is a good way to go and then we tried ¼ and 1/16 size and you can see for thumbnails, it is pretty amazingly fast at 15s per frame, but of course this is more like a cartoon than a realistic image and you can see why a common trick is to do a small video and then image upscale it.
Note this is at 20 steps in the diffusion but moving to 40 steps essentially doubled the time to compute and didn’t add much quality so 20 steps is not a bad places to start. So it took 37s seconds for 20 steps at q8 for 640×360 images but 69s with 40 steps.
And then the test of doing a 5 second video (120 frames) shows how long the batch process can last and that to store the additional images, it really pushes the memory which is why image to video is often used, so you can create longer images in batches. Also note that the 64GB doesn’t increase as much which probably indicates that we are reaching some limits here so there is less RAM caching.
Note that with the M4 Max I was getting hangs for various values, so I’m starting from 12 frames and seeing what’s going on. You can look at logs at ~/Documents/Comfy/comfy_8000.log
but it shows nothing unusual, activity just drops to zero. There are some bugs, I’ve tried ComfyUI restart and next up is a full reboot. I also got rid of some nodes that I don’t use but was playing with but this didn’t help. I’m not quite sure what is going on, but I do get “aten:upsample_nearest3d.vec” which is CPU only.
The second run on the M4 Max I got a “typecheck: failed to fetch” which means out of memory (seems like a bug). The main thing is that if you pick a higher quantization then you have more memory, it’s pretty clear that I’m running out of memory which is why long runs of 120 frames on the M1 Max 64GB fail alot, so you might accept higher quantization and slow speed just so you can get the video. If you are not memory constrained then q8 seems very good. And there is now a blessed fp8 model that I haven’t tried that might be even better.
Time (sec)/Memory | M1 Max 64GB | M4 Max 128GB |
q4, 640×360/20 steps | 137s/35GB | 58s/48GB |
q4_k_m, 640×360, 20 | No image, broken model | No image |
q8, 640×360/20 steps | 96s/41GB | 37s/54GB |
bf16, 640×360, 20 steps | 121s/48GB | 62s/58GB |
q8, 320×180, 20 steps | 34s/39GB | 15s/56GB |
q8, 1280×720, 20 steps | 419s/41GB | 201s/59GB |
q8, 640×360, 40 steps | 193s/39GB | 69s/57GB |
q8, 640×360, 20 step,12f | N/A | 180s/55GB (15s/frame) |
q8, 640×360, 20st, 25f | N/A | Hangx2/55GB,61GB 731s/43GB (25s/fr) |
q8, 640×360, 20st, 121f (5 seconds) | 8348s or 2:19 hrs/52GB (66s/frame) | Hang/57GB 3300s/55m/57GB (29s/fr) |
q8. 640×360, 20st, 361f (15 seconds) | N/A | Overnight/81GB Many fails and hangs and not to spec |
M4 Max is 2x M1 Max
So the performance difference in four years is kind of incredible. Some might be due to the larger RAM available, but even fro smaller configurations, this holds. If memory is a real issue then q4 with slower speed is the way to go. BAsically without much optimization, you get 120 frames/hour (or about 3 seconds/hour) so if its a long movie it won’t work, but a short 30 second clip you are going to need 10 hours of grinding.
This is why most folks use Image to Image or a cloud service, but it is great for the short clips. And it’s practical to do local because the single image ad generation times are so fast.
Test at 320×180 less that 5 minutes
To see if things are working, it is easy to generate a ¼ size image for a 1280×720 video or 320×180 and one thing to learn is that quantization is smaller, but there is a speed tradeoff, so for a single frame at 20 steps or 320x180x20x1: so the sweet spot seems to be q8 at least for this M4 Max as long as you do not run out of memory:
Quantization (320x180x20x1) | Seconds/frame |
fp16 | 34 |
q8 | 19 |
q5 | 86 |
q4 | 42 |
Conclusion 1280×720 at 20 steps and 129 frames runs out of memory in 128GB machines, 640×360 works
This seems like a good compromise at 20 steps is to do q8 because it’s faster and quality is decent. 640×360 seems to generate some decent images. I am seeing that for longer running things, ComfyUI is crashing, looking at the logs in ~/Documents/ComfyUI/comfyui_8000.log
I see it loads and model and then hangs,
Basically although it is supposed to handle up to 129 frames, most of the time it seems to crash and I’m guessing it is out of memory. Given that with 1280×720, I’m seeing overall RAM draw at 73% and rising as it adds more frames.
I did try 360 frames but I had a bunch of them and it takes 4 hours with an M4 Max and 8 hours with an M1 Max. So since most videos are actually a series of quick cuts, it makes more sense to build something that is a series of cuts. So you might want to learn how to do image to video so that you can match them up one by one.
Trained on 2MP, but can go to 4MP according to a Reddit user
First use a small model and then tune prompt then bump to big one. So ¼ is 480×270 (68sec on M1 Max) which is what I do, but here are the steps:
- 512×512. 5 seconds, 30 steps basic
- 1024×1014. Details shining
- 1920×1080. Eye opener
- 2560×1440. More detail but not better which makes sense, it is trained on 1920×1080
What the heck is Webp? GIF Remixed
This is the format that the default Hunyuan video outputs, its basically like animated GIFs, so they are interleaving images using VP8/9 but this is not accepted by YouTube or other systems. So you need to convert it to a normal video format to use it.
I couldn’t get imagemagick to properly convert a Webp to an MP4, so I had to use an online tool which worked great
Real Usage: Generate 1 frame, then 3 seconds and then 5 seconds
Now the main thing is how long a video do you really want, well, most commercials today are on YouTube is 15 seconds, but the most critical is the first 5 seconds (voices.com), so you can see you can easily generate ad length that by first generating a single frame quickly to make sure it is ok, then restart your machine and it will take about an hour to generate the critical 5 seconds. Once that is done, you run it for six hours (say an afternoon for an M4 Max).
If you really want the traditional 30 second ad, then you have to generate each scene individually, so generate 5 seconds and then 5 seconds more.

Then you can generate a few more frames and adjust the prompts
Optimizing even more with LORA and other tunings
There are lots of other pipelines to try and I”m a little confused what works where, but there is more research to be done. Basically images are easy and video is still hard. Just got to learn some terms and try it
Image to Video, Video + Prompt to Video
This same model supports more than text to video and Image to Video is next when I get time to try this
Trying Janus Pro for Image Generation, understanding and text+image to new image
The latest hotness from Deepseek and there is already a Janus Pro node. This was a bit of a pain, but I learned alot. The ComfyUI wiki recommends just downloading the Janus-Pro, but when I did that and tried it, it said “CUDA not installed”
In looking al the underlying repo, I saw a fix for this which added MPS back in, so here is some of the magic that actually works:
- Uninstall the Janus Pro from CY as it will confuse things
- Fork the CY repo into your own
- Now you will see a bunch of pull requests that fix things like the MPS support
- So now you add a few upstreams, in my case, I added
git remote add alvin git@github.com/alvin-change/comfyui-janus-pro
and then I did a tricky,git rebase -I alvin/main
to get this into my branch - Now here’s the real trick, go to
~/Documents/ComfyUI/custom-nodes
and symlink that repo in. - We have a mono repo that keeps track of everything, so this means that we can track all our changes the the symlink works. I linked in as
ComfyUI-janus-pro-rich
so as not to confuse the ComfyUI Node Manager
Now when you open something that requests the nodes, it find that repo links and it just works!
It looks pretty good, but the test workflow is pretty confusing. For one thing, the image generation and the image understanding are not integrated. This confused me quite a bit. I would have thought the image generation is tied to the image understanding so you can do image to image with text
Modifying Janus Pro to add Image Saver and Prompt Composer for Image+Text to new image
OK, this is where things get hairy, you really need to helper noes, the first is Image Save which lets you create a json file that is meta data for this. To do this, you have to Convert widget to input
alot but it’s worth it to have this in the meta data of the files also Comfy Prompt Composer. These are basically string functions that let you concatenate and save and read prompts from disk. And the Save Text is really good because it will save all the prompts for you automatically in a text file.
Status of image generation by watching the K-sampler bar set steps to 20 for Flux.1 DEV and 4 steps with Flux.1 Schnell
Turns out the K-sampler actually shows its step progress at the top bar, so a good way to figure out what is going on. But the big question is how many steps to run. The folks on Reddit says. Given this I leave it at 20 steps for the default as going to 30 increases render time by 50% and is usually not worth it for me.
- 10 steps unusable
- 15 steps pretty good, some grain
- 20 steps really excellent
- 30 steps hard to tell might be better
- 50 steps really hard to tell
If you are in a real hurry and don’t mind more cartoon-like outputs, then you can usee Flux.1 Schnell. As the name implies, this is a distillld Flux.1 Dev and converges in as little as 4 steps.
Leave a Reply
Only people in my network can comment.