Status Update Log
v0.0.1 - Initial Setup and Project Kickoff
- Created the project directory structure and initialized version control.
- Set up the basic environment using Python 3.12 and installed the required libraries for Transformer-based model training, including PyTorch and HuggingFace Transformers.
v0.0.2 - Dataset Preparation
- Selected the SQuAD dataset for fine-tuning the model.
- Loaded the dataset and performed initial data cleaning and formatting.
- Created scripts for dataset exploration to understand question-answer distribution and context.
v0.0.3 - Model Selection
- Chose the LLaMA-3.2-1B model as the base for fine-tuning due to its capabilities with language modeling tasks.
- Downloaded and set up the pre-trained model in the development environment.
- Verified model loading and initial inference to ensure correct setup.
v0.0.4 - Experimentation with Quantization
- Tested quantization techniques to fit the LLaMA model in available GPU memory.
- Implemented 8-bit quantization using
BitsAndBytesConfig
to reduce memory footprint. - Quantization helped in reducing memory usage and allowed the model to run smoothly on the g4dn.xlarge instance.
v0.0.5 - Gradient Checkpointing and Mixed Precision
- Enabled gradient checkpointing to save memory during backpropagation by recalculating activations on-the-fly.
- Utilized mixed precision training (fp16) to further optimize memory usage.
- Both approaches combined allowed training without running into memory limitations.
v0.0.6 - Parameter-Efficient Fine-Tuning (PEFT) Setup
- Implemented PEFT with LoRA to add trainable adapters to the LLaMA model.
- Used the
peft
library for seamless integration of LoRA adapters. - Ensured that the adapters effectively reduced the number of trainable parameters while maintaining model performance.
v0.0.7 - Custom GPU Memory Allocation Checks
- Encountered discrepancies between reported and actual available memory on the GPU instance.
- Wrote custom scripts to monitor GPU memory allocation and identify bottlenecks.
- Adjusted model configurations based on insights to ensure efficient use of GPU resources.
v0.0.8 - Optimizer and Training Configuration
- Implemented
paged_adamw_8bit
optimizer to offload optimizer states to CPU, reducing GPU memory requirements. - Introduced gradient accumulation to simulate larger batch sizes without increasing memory usage.
- Configured learning rate schedules and optimizer settings to suit the training requirements of LLaMA-3.2-1B.
v0.0.9 - Initial Training Attempt and CUDA Out of Memory (OOM) Errors
- First training attempt resulted in CUDA OOM errors.
- Identified causes related to batch size and learning rate settings.
- Made adjustments to the batch size and added a configuration to handle memory fragmentation (
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
).
v0.1.0 - Preventing System Hibernation
- Prevented the system from entering sleep, suspend, or hibernation states to avoid interruptions during training.
sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target
- Added symlinks for
/etc/systemd/system/sleep.target
,/suspend.target
,/hibernate.target
, and/hybrid-sleep.target
to/dev/null
.
v0.1.1 - Iteration Speed Improvements
- Increased training iteration speed from 10s/iteration to 15s/iteration.
- Adjusted gradient checkpointing settings to reduce computation overhead.
- Achieved more stable training runs by reducing memory consumption through optimized batch sizes and accumulation steps.
v0.1.2 - Learning Rate Finder and Checkpoint Issues
- Initiated the learning rate search process to find an optimal rate for model convergence.
- Received warnings about incompatible settings (
use_cache=True
) with gradient checkpointing; resolved by switching touse_cache=False
. - Iteration speed during learning rate search was approximately 5s/iteration, which needed adjustments to reduce overall time.
- Ended with a
KeyboardInterrupt
after the 27th iteration, requiring further refinements in checkpointing.
v0.1.3 - Learning Rate Finder Success
- Successfully ran the learning rate finder without OOM errors.
- Estimated total time for the learning rate search was about 17 minutes, an acceptable duration.
- Achieved a significant improvement in stability, with consistent training progress without memory-related interruptions.
- Prepared for the main training loop using the identified learning rate.
v0.1.4 - Taming the Llama: Training Begins
- Began fine-tuning LLaMA-3.2-1B on the SQuAD dataset.
- Set batch size to 12 and gradient accumulation steps to 3 for balanced training and memory efficiency.
- DataLoader workers were set to 4, contributing to efficient data handling during training.
- Iteration speed improved to 16.17s/iteration, allowing a full training cycle to complete in around 4 hours.
- Addressed frequent tokenizer parallelism warnings by setting the following environment variable:
import os os.environ["TOKENIZERS_PARALLELISM"] = "false"
v0.1.5 - Training Results and Warnings Addressed
- Fine-tuned the LLaMA model over 834 iterations, successfully completing training.
- Observed loss reduction from 9.73 initially to 7.95 at the end of epoch 3.
- Learning rate started from 0.0099 and adjusted dynamically during training for optimal learning.
- Validation losses reduced consistently, reaching 8.11 during regular evaluations.
- Encountered warnings related to quantization and checkpointing, resolved by modifying training parameters.
v0.1.6 - Evaluation Planning for LLaMA Model
- Developed an evaluation plan to assess the performance of the fine-tuned LLaMA-3.2-1B model on the SQuAD dataset.
- Key steps included:
- Loading the fine-tuned model and tokenizer using the Transformers library.
- Defining evaluation metrics, focusing on Exact Match (EM) and F1 Score.
- Preparing the validation set from the SQuAD dataset, loaded through DataLoader.
- Generating answers using the fine-tuned model to predict responses for questions in the validation set.
- Calculating EM and F1 scores for each prediction to evaluate performance.
- Looping through the entire validation set to derive average scores, providing insights into overall model efficacy.
- Saving evaluation results for future reference.
- Next steps involve running the evaluation process, interpreting results, and identifying areas for further improvement or fine-tuning.
v0.1.7 - Prevented System Hibernation (Note Added)
- Added a note regarding the persistence of power-saving feature changes across reboots:
sudo systemctl unmask sleep.target suspend.target hibernate.target hybrid-sleep.target
v0.1.8 - Increased Training Iteration Speed, Adjusting for Duration
- Increased iteration speed from 10s/iteration to 15s/iteration.
- However, the number of iterations also increased from 5000 to 8000, resulting in an estimated training duration of 32 hours.
- Searching for new ways to decrease overall training time.
- Attached are screenshots of progress for two different versions (try1 and try2) for comparison.
v0.1.9 - Learning Rate Search and Issues with Checkpointing
- Began the process of finding the optimal learning rate for the LLaMA model fine-tuning.
- Successfully loaded the SQuAD dataset and prepared it for training.
- GPU Memory updates:
- After loading the tokenizer: Allocated: 0.00 GB, Reserved: 0.00 GB.
- After model loading with quantization and adding LoRA configuration: Allocated: 2.03 GB, Reserved: 2.64 GB.
- Set up gradient checkpointing and started the learning rate search.
- Result: 5s/iteration it is taking 2 hours to find the learning rate
- Issues encountered:
use_cache=True
found incompatible with gradient checkpointing, switched touse_cache=False
.- Received user warnings regarding deprecated parameters (
use_reentrant
parameter) and quantization specifics. - The learning rate search indicated increasing learning rates with corresponding loss values.
- Ended with a KeyboardInterrupt after the 27th iteration, requiring adjustments to proceed further.
- Planning to run a subset of the data to reduce the time to 5 odd minutes
v0.1.10 - Learning Rate Finder Running Without OOM Error
- The learning rate finder is now running without encountering Out of Memory (OOM) errors, which is a significant improvement.
- Current iteration speed is approximately 2.49 seconds per iteration, with an estimated total time of 17 minutes to complete.
- Observations and potential next steps:
- Total time of 17 minutes is longer than the initial 10-minute target but still reasonable.
- Potential adjustments to reduce time further:
- Reduce
num_iterations
from 50 to 40 or 30. - Increase
batch_size
if GPU memory allows. - Decrease
accumulation_steps
if needed.
- Reduce
- Once the learning rate finder completes, the optimal learning rate will be used for the main training loop.
- Monitoring the following after learning rate completion:
- Optimal learning rate found.
- Initial training loss.
- Training speed in the main loop.
v0.1.11 - Increased Batch Size and Improved Training Time
- Improved training iteration speed (7sec -> 10sec -> 15.5sec -> 16.17 sec/it) by adjusting several hyperparameters:
- Batch size increased to 12.
- Gradient accumulation steps set to 3.
- DataLoader workers set to 4.
- Achieved better GPU utilization without running into Out of Memory (OOM) errors.
v0.1.12 - Evaluation Plan Execution for LLaMA Model
- Initiated the evaluation process of the fine-tuned LLaMA-3.2-1B model on the SQuAD validation set.
- Key stages completed:
- Loaded the fine-tuned model and tokenizer from the saved directory.
- Defined Exact Match (EM) and F1 Score as evaluation metrics.
- Prepared the validation set using DataLoader, ensuring efficient data batching.
- Generated predictions for the validation set questions using a question-context input format.
- Calculated EM and F1 scores for each prediction.
- Evaluated the entire validation set to derive average EM and F1 scores.
- Saved evaluation results to a JSON file for future analysis.
- Next steps involve analyzing the results, interpreting the model’s performance against benchmarks, and making decisions on further training or adjustments.
v0.1.13 - Final Training Summary and Future Plans
- Completed training and evaluation of LLaMA-3.2-1B on the SQuAD dataset.
- Achieved a final loss of 7.95 and a validation loss of 8.11.
- Recorded average F1 Score and Exact Match (EM) metrics from the evaluation process.
- Evaluation results indicated areas of strong performance and potential for further fine-tuning.
- Planned next steps:
- Analyze specific instances of lower performance to understand failure cases.
- Adjust hyperparameters based on evaluation insights.
- Consider additional rounds of fine-tuning or model architecture adjustments if needed.
- Start preparing for model deployment if evaluation results meet project benchmarks.