Downloaded models

Just downloaded llama3-8b-instruct with 8bit quantization, it seemed to download close to 20GB but where does it store the files. looking at ~/.cache/huggingface/hub does not seem to indicate any files that are in the range of GB, not even 100s of MBs where did the files go? Also i would imagine an 8b model with 8bit quantization should be taking up about 8GB of space, why is it downloading almost 20GB, does it perform the quantization locally? It took a long time to download the models but they are no where to be found.

1 Like

It depends on the OS and your administrator rights at the time of execution, but there are two patterns: either it is a symbolic link and is actually sent somewhere else, or it is in the cache folder as usual.

Also, if you use git to download the files, the size will be larger because it also downloads files related to git management. The cache is even different.

For downloads, I recommend snapshot_download, which is easy and fast. There are faster methods, but they are cumbersome.

2 Likes

This is the code that was utilized to download the model:

model_id = “meta-llama/Meta-Llama-3.1-8B-Instruct”
quant_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
attn_implementation=“flash_attention_2”,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Yes i missread the file contents, the models are indeed in ~/.cache/huggingface/hub/ but from what i am reading llama3 ia 15GB and gemma2 is 36GB … did the quantization work at all? Or is this something that i have to execute on the models …

The code you executed is for downloading the model and loading it into RAM or VRAM. It is not code for downloading only. It will be stored in the cache for the sake of the thing, so if you run the same code again, it will run in a shorter time…

Also, the ability to load the model onto VRAM in quantized form is not yet implemented in either Diffusers or Transformers, and even in the above code, it is expanded once to bfloat16.
You can call model.save_pretrained(~~) from that state to save the bfloat16 model, but it would be very large.

So, I think it would be better to think as downloads are to be performed separately, and first download it manually with the mouse or snapshot_download, and then write code to load it from the folder of the downloaded file, not directly from HF.

If all you need is to be able to use it, just run the inference after the code above and it should already work fine.

1 Like

Thanks for the response. So for this time, i should probably utilize some tools to convert it into gguf format and quantize and hand these models to ollama. I wanted to download 8b quantized models because the card i have has 12GB VRAM, it should be able to handle the 8b and 9b models and perhaps offer better performance and accuracy than the 4bit quantized models.

The link you have provided above makes no mention of gguf files… apparently ollama needs gguf format to utilize the model …

I think in 12GB VRAM, GGUF would work well, too. BTW, I use the 8B model quite a bit for fun too, but I rarely have trouble with accuracy degradation around GGUF’s Q4_K_M.
Also, if you’re okay with GGUF, HF is home to some quantization experts so to speak, so you’re better off getting it from one of their repos, which are better and more reliable. Especially imatrix quantized ones are good for anything.

1 Like

The link you have provided above makes no mention of gguf files…

GGUF is a proprietary format from a different company than HF, or rather from Llamacpp’s development, so there’s a delay in HF’s support. Instead, users and Llamacpp’s company have prepared various tools.

1 Like

You can also trace GGUF and other quantizations from Quantization on the bottom right-hand side of this page.

1 Like

Thanks for you response, i did find the gguf format quantizied models as:

Not knowing any better, i would imagine that the above is as good as any to download

mradermacher, bartowski, QuantFactory… People around here rarely make mistakes.

In your above post you indicate that 4 bit quantization works well enough. It makes me wonder if it may be better to utilize 2 different 4 bit quantized models simultaneously on the GPU rather than 1 8b model since i have only 12GB of VRAM available. This will allow 2 different models to be utilized in case a specific model did not provide the right answer for a given task. Does this make any kind of sense?

It looks like here is the list of things to do:
download gguf

utilize it with ollama

test it

fine tune it with existing code base, i suppose fine tuning will end up generating a new model file. I would imagine this step is incredibly important for the model to start behaving like existing code base.

repeat the above with multiple models and possibly utilize multiple models together to get the tasks done

I’ve seen a few use cases like that. I think there was some special technical name attached to it, but I don’t remember, because I really only use it for fun and independently.
But I’m sure it has a meaning.

P.S.
LangChain?