ai: ComfyUI Missing Mac Manual v1.3

OK, my previous post on Open WebUI was getting long, so here is a getting a little long and while it covers ComfyUI integration, the bigger topic of using ComfyUI is scattered all over the winds. So if you just want to get ComfyUI to work, here’s the missing manual for that. Some of this duplicates the base setup, but then its a deeper drill down.

Installation

As I said before, the easiest thing to do to just use the Comfy Desktop. This is a DMG that automatically downloads and it updates itself too!

Connection to Open WebUI

Also previously documented, you can right now only import a single workflow (exported as an API workflow) and you have to fill in which nodes and widgets have specific variables, like height and width, this means you have to carefully page through the JSON looking for the “inputs” sections that has these.

Where does ComfyUI live?

Like Open WebUI, this is a bit of a mystery, but basically, the default is ~/Documents/ComfyUI and here are the models, the user configuration etc.

Downloading with Huggingface-cli

This is the first model that we tried and there is much talk about how to make it work

One thing that you want to make sure to do is to use the brew install huggingface-cli to download things. This also does caching which is very nice so you don’t download multiple times. The Huggingface site is really good about letting you copy the org/repo and also the relative paths in the git repos since they are using Git and Git LFS to store things and version them:

huggingface-cli <org>/<repo> <repo path> --local-dir ~/Documents/ComfYUI/
# this will place what the relative path is right there
huggingface-cli calcuis/hunyuan-gguf hunyuan-video-t2v-720p-q4_0.gguf --local-dir ~/Documents/ComfyUI/models/unet
# this will place the files into stripped_files which is inconvenient, so pay attention

Hunyuan Text to Video: GGUF and FP16

This is the hottest model from Alibaba and it is open sourced, so the trick is you need for GGUF installations:

  1. A Workflow.json that says how the pieces are tied together. This is normally in huggingface, so huggingface-cli repo/org workflow.json --local-dir ~/Documents/ComfyUI/user/default/workflows will just put it in the default workflows
  2. Then you go to ComfyUI and open that workflows
  3. At the upper right click on the Manager button and and then “Install Missing Custom Nodes”. If things are right this will load the right blocks for you and then you restart
  4. Now you download the various models into the right places in ./models/vae and other places as

You can also get this directly from ComfyUI and the only pain here is that they use the directory split_files which doesn’t work well with --local-dir and it has a different Workflow.json, so these need to go into ./models

How to write a Text to Video Prompt

This is a bit of a mystery to me and many recommend just using a model like deepseek to generate the prompt (it has been trained on lots of prompts), but this explanation of what you need is good and this video is a little helpful, but the basic idea is that you want things in a specific order (makes sense, that’s probably how they were trained with tagging of videos in a precise way

  1. YOu want 100-300 words. Most of these descriptions are very wordy and I have to experiment with cutting them down. It doesn’t need noise words I don’t think and some simple YAML format is probably better and in this order…
  2. Subject. “A sleek electric car”, not that you should include size, color and other adjectives to guide the system
  3. Scene. Described the general environment, like “minimalist modern apartment”
  4. Action or Motion. What you want the subject to do like, “gracefully dancing through falling leaves”. Again the more adjectives, the more guidance.
  5. Camera Work. That is how the shot moves, “dramatic circular pan”
  6. Mood. What is the tone, “energetic vibrant mood”
  7. Lighting. Define thing sliek “soft, warm sunlight filtering through trees”
  8. Shot composition. So this tells it what kind of show “wide landscape shot emphasizing scale”

Using Ollama Node to Generate prompts

I’m not sure how this node is prompted, but you can download a node which use Ollama to generate a query. I’m not clear on what model its using or what the system prompt is or even how to print out the text as it just flows

Getting small models with GGUF

The default instructions use their FP8 model, but there are Q4_0 model which works great. Note that the Q4_K_M does not at all. The Q4 model halves the memory needed but it is also slower, so its a tradeoff. I you have enough memory Q8 seems quite good for BF8 is something you can try.

Best Video Height and Width

At least with Flux.1, it likes to create images of the same size it was trained on, in this case 512×512, but the Hunyuan is trained on 720p models (1280×720). So just checking speeds, we need to find a natural size

Performance M1 Max vs M4 Max

So I tried some samples of just generation. I did just 2 frames to get a sense at ¼ resolution and full resolution and also with the original fp16 model (there is an fp8 also available).

The Prompt used followed the format above:

happy family with mother father and kids in summer dress

a beautiful sand beach with clouds waves, seagulls, shells and drift wood

walking along the carefree with wind blowing through their hair

The camera shows faces in closeup then zooms out to see the entire beach and pans from their front to behind

Beautiful sun lit with sun high in the air

The memory used includes the system and everything else not just the model. This was down on the second run of a single frame to warm caches. The most interesting thing is that the Q4 model runs slower but uses marginally less memory. Note that these differences are not a big deal for the Mac, but for PCs with only 24GB of VRAM they make a big difference as the difference between q4 and bf16 is 13GB.

The overall M4 Max numbers in RAM use are higher by about 10GB, I suspect this is because we are measuring system usage and Brave is a very fast browser, but the trends are about right. It’s interesting see that q8 models are very fast, could be because a byte is a very natural size for the machines I’m not sure, but if you don’t care too much about memory like on a mac, the q8 is a good way to go and then we tried ¼ and 1/16 size and you can see for thumbnails, it is pretty amazingly fast at 15s per frame, but of course this is more like a cartoon than a realistic image and you can see why a common trick is to do a small video and then image upscale it.

Note this is at 20 steps in the diffusion but moving to 40 steps essentially doubled the time to compute and didn’t add much quality so 20 steps is not a bad places to start. So it took 37s seconds for 20 steps at q8 for 640×360 images but 69s with 40 steps.

And then the test of doing a 5 second video (120 frames) shows how long the batch process can last and that to store the additional images, it really pushes the memory which is why image to video is often used, so you can create longer images in batches. Also note that the 64GB doesn’t increase as much which probably indicates that we are reaching some limits here so there is less RAM caching.

Note that with the M4 Max I was getting hangs for various values, so I’m starting from 12 frames and seeing what’s going on. You can look at logs at ~/Documents/Comfy/comfy_8000.log but it shows nothing unusual, activity just drops to zero. There are some bugs, I’ve tried ComfyUI restart and next up is a full reboot. I also got rid of some nodes that I don’t use but was playing with but this didn’t help. I’m not quite sure what is going on, but I do get “aten:upsample_nearest3d.vec” which is CPU only.

The second run on the M4 Max I got a “typecheck: failed to fetch” which means out of memory (seems like a bug).

Time (sec)/MemoryM1 Max 64GBM4 Max 128GB
q4, 640×360/20 steps137s/35GB58s/48GB
q4_k_m, 640×360, 20No imageNo image
q8, 640×360/20 steps96s/41GB37s/54GB
bf16, 640×360, 20 steps121s/48GB62s/58GB
q8, 320×180, 20 steps34s/39GB15s/56GB
q8, 1280×720, 20 steps419s/41GB201s/59GB
q8, 640×360, 40 steps193s/39GB69s/57GB
q8, 640×360, 20 step,12fN/A180s/55GB (15s/frame)
q8, 640×360, 20st, 25fN/AHangx2/55GB,61GB
731s/43GB (25s/fr)
q8, 640×360, 20st, 121f (5 seconds)8348s or 2:19 hrs/52GB (66s/frame)Hang/57GB
3300s/55m/57GB (29s/fr)
q8. 640×360, 20st, 361f (15 seconds)N/AOvernight/81GB

M4 Max is 2x M1 Max

So the performance difference in four years is kind of incredible. Some might be due to the larger RAM available, but even fro smaller configurations, this holds. If memory is a real issue then q4 with slower speed is the way to go. BAsically without much optimization, you get 120 frames/hour (or about 3 seconds/hour) so if its a long movie it won’t work, but a short 30 second clip you are going to need 10 hours of grinding.

This is why most folks use Image to Image or a cloud service, but it is great for the short clips. And it’s practical to do local because the single image ad generation times are so fast.

Conclusion Q8 at 640×360 is fastest at 15s/frame but can hang or fail so for long runs restart everything

This seems like a good compromise at 20 steps is to do q8 because it’s faster and quality is decent. 640×360 seems to generate some decent images. I am seeing that for longer running things, ComfyUI is crashing, looking at the logs in ~/Documents/ComfyUI/comfyui_8000.log I see it loads and model and then hangs

Stability problems at 360 frames

I had a bunch of them and it takes 4 hours with an M4 Max and 8 hours with an M1 Max. So since most videos are actually a series of quick cuts, it makes more sense to build something that is a series of cuts. So you might want to learn how to do image to video so that you can match them up one by one.

What the heck is Webp? GIF Remixed

This is the format that the default Hunyuan video outputs, its basically like animated GIFs, so they are interleaving images using VP8/9 but this is not accepted by YouTube or other systems. So you need to convert it to a normal video format to use it.

Real Usage: Generate 1 frame, then 5 seconds and then 15 seconds

NOw the main thing is how long a video do you really want, well, most commercials today are on YouTube is 15 seconds, but the most critical is the first 5 seconds (voices.com), so you can see you can easily generate ad length that by first generating a single frame quickly to make sure it is ok, then restart your machine and it will take about an hour to generate the critical 5 seconds. Once that is done, you run it for three hours (say an afternoon).

If you really want the traditional 30 second ad, then you can leave it over night (10 hours) to generate that, but I would focus on the 15 second versions to be more impactful. So here is the first image

Single Image generated in 37 seconds for proofing

Then you can generate a few more frames and adjust the prompts

Optimizing even more with LORA and other tunings

There are lots of other pipelines to try and I”m a little confused what works where, but there is more research to be done. Basically images are easy and video is still hard. Just got to learn some terms and try it

Image to Video, Video + Prompt to Video

This same model supports more than text to video and Image to Video is next

Trying Janus Pro

The latest hotness from Deepseek and there is already a Janus Pro node

Leave a Reply

Only people in my network can comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

I’m Rich & Co.

Welcome to Tongfamily, our cozy corner of the internet dedicated to all things technology and interesting. Here, we invite you to join us on a journey of tips, tricks, and traps. Let’s get geeky!

Let’s connect

Recent posts

Loading Mastodon feed…