CUDA 11.4 and above are recommended (this is for GPU users, flash-attention users, etc.) To run Qwen-72B-Chat in bf16/fp16, at least 144GB GPU memory is required (e.g., 2xA100-80G or 5xV100-32G). To run it in int4, at least 48GB GPU memory is requred (e.g., 1xA100-80G or 2xV100-32G).
It's derived from Qwen-72B, so same specs. Q2 clocks it in at only ~30GB.