With this project, users can run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac).
It supports Llama-2-7B/13B/70B with 8-bit, 4-bit. It also supports GPU inference (6 GB VRAM) and CPU inference.
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
Llama-2-7b-Chat-GPTQ is the GPTQ model files for Meta’s Llama 2 7b Chat. GPTQ 4-bit Llama-2 model requires less GPU VRAM to run it. The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.
If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).
Additional details are available here.