🧭 2. 전체 실습 흐름

AI/LLM

🧭 2. 전체 실습 흐름 - 6

HEAD1TON 2025. 7. 2. 17:07

✅ QLoRA 실습 개요

🧠 개념 요약

QLoRA는 8bit 양자화된 사전학습 모델에 LoRA 어댑터를 적용해 파인튜닝하는 기법입니다.
핵심 구성:
1. base 모델: 8bit로 로딩 (메모리 절약)
2. LoRA: 저차원 학습 가능한 모듈
3. Gradient checkpointing: GPU 메모리 절약

💡 효과

기준Full fine-tuneLoRAQLoRA

메모리 사용	매우 높음	낮음	매우 낮음
학습 속도	느림	빠름	빠름
성능	기준	거의 동일	거의 동일

📦 1단계: 필수 패키지 설치

pip install bitsandbytes accelerate peft transformers datasets

bitsandbytes: 8bit, 4bit 양자화 로딩 지원
accelerate: mixed precision + device 관리
peft: LoRA + QLoRA 처리 핵심
transformers: HF 모델
datasets: 학습 데이터셋

💡 로컬 환경에 CUDA 11.7+가 설치되어 있어야 합니다.

⚙️ 2단계: 4bit 모델 로딩 (QLoRA 핵심)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "tiiuae/falcon-rw-1b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # normal float 4bit
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_threshold=6.0,
    llm_int8_skip_modules=None
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

load_in_4bit=True: 모델을 4bit로 로딩
nf4: 성능과 압축의 균형이 좋은 4bit 포맷

🧩 3단계: LoRA 구성 + 적용

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

🧠 4단계: 메모리 절약 세팅 (Gradient Checkpointing 등)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()
model.config.use_cache = False

이 설정은 메모리 소비를 줄이기 위해 중간 계산을 캐싱하지 않음을 의미합니다.

🔁 5단계: Trainer로 학습 실행

기존 LoRA 학습과 동일한 방식으로 Trainer 사용이 가능합니다. 이전에 정의했던 TrainingArguments, Trainer 코드 그대로 사용하면 됩니다.

💾 6단계: QLoRA 결과 저장

model.save_pretrained("./output/qlora-adapter")
tokenizer.save_pretrained("./output/qlora-adapter")

📉 메모리 사용량 비교

항목16bit LoRAQLoRA (4bit)

모델 로딩	약 6~8 GB	2~3 GB
학습 가능 VRAM	8GB 이상 필요	4GB도 가능
성능	약간 손실 가능	동등하거나 비슷

저작자표시 변경금지 (새창열림)