NVIDIA LLM 全棧式方案使用和優(yōu)化最佳實(shí)踐_第1頁
NVIDIA LLM 全棧式方案使用和優(yōu)化最佳實(shí)踐_第2頁
NVIDIA LLM 全棧式方案使用和優(yōu)化最佳實(shí)踐_第3頁
NVIDIA LLM 全棧式方案使用和優(yōu)化最佳實(shí)踐_第4頁
NVIDIA LLM 全棧式方案使用和優(yōu)化最佳實(shí)踐_第5頁
已閱讀5頁,還剩63頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

NVIDIALLM全棧式方案使用和優(yōu)化最佳實(shí)踐Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM2Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM3NVIDIAFull-StackSolutionforLLMNVIDIAMegatron-Core(M-core)forLLMTrainingNVIDIATensorRT-LLMforLLMInference4OverviewofNVIDIA’sLargeLanguageModelOfferingsforTrainingSolutionsatEachLeveloftheNemoFramework:EasytouseOOTBFWwithalargemodelMegatron-LM:AlightweightframeworkreferenceforusingMegatron-Core:LibraryforGPUoptimizedtechniquesforLLMTransformerEngine:HopperacceleratedTransformermodels.5WhyWeNeedNVIDIAMegatron-Core?6NVIDIATensorRT-LLM?FasterTransformertoleverageitsoptimizedkernelsforperfo?OthercomponentsforthecustomizationsofLLMinference,suchasCUTLASS7KeyFeaturesinNVIDIATensorRT-LLM8WhatisNVIDIATritonInferenceServer?FeaturesofTritonInferenceServer:9Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMBestPracticeforNVIDIAMegatron-Core?Enabledistributedoptimizertoshardoptimizerstates.?Utilizedistributedoptimizertoshardoptimizerstatessimultaneously.BestPracticeforNVIDIAMegatron-Core?EnableTransformerEngine(--transformer-impltransformer_engine)?EnableFlashAttention(--use-flash-attn)?Enablecommunicationoverlapping?EnablekernelfusionsBestPracticeforNVIDIAMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizatiEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizationActivationRecomputeModelsConfig/Spec(Customization)Config/Spec(Customization)TransformerBlockTransformerBlockTransformerLayerTransformerLayerMLPMLPSequenceParallelismSequenceParallelismDDistributedOptimizerAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtoUseNVIDIATensorRT-LLM?EasetheuseeffortsHowtoUseNVIDIATensorRT-LLM#Converthuggingfacellama-7bmodeltotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/llama/convert_checkpoint.py\--model_dirllama-7b-hf\--dtypefloat16\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp16-tp2#Quantizehuggingfacellama-7bandexporttotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/quantization/quantize.py\--model_dirllama-7b-hf\--dtypefloat16\--qformatfp8\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp8-tp2HowtoUseNVIDIATensorRT-LLM#Buildtrt-llmenginesfromtrt-llmcheckpoint#Optionallyenable/disablebuildingoptionstrtllm-build--checkpoint_dirtllm_ckpt/llama-7b-fp8-tp2\--gemm_pluginfloat16\--output_dirtllm_engines/llama-7b-fp8-tp2\--workers2#Runinferencewiththetrt-llmenginesmpirun-n2--allow-run-as-rootpythonexamples/run.py\--engine_dirtllm_engines/llama-7b-fp8-tp2\--tokenizer_dirllama-7b-hf\--max_output_len30\--input_text"Borninnorth-eastFrance,Soyertrainedasa"#ExamplegeneratedoutputOutput[Text0Beam0]:"chefinParisandLondonbeforemovingtoNewYorkin1850.Hewasthefirstcheftobehiredbythenewly"HowtoUseNVIDIATensorRT-LLM?Oneormoresafetensorsfilesstoringrankweights?Eachfilesavesadictmappingh{'transformer.vocab_embedding.weight':torch.Tensor(...),'transformer.layers.0.attention.qkv.weight':torch.Tensor(...),'transformer.layers.0.attention.dense.weight':torch.Tensor(...),'transformer.layers.0.mlp.fc.weight':torch.Tensor(...),'j.weight':torch.Tensor(...),'lm_head.weight':torch.Tensor(...)}HowtoUseNVIDIATensorRT-LLMBuildOptions?In-flightbatchingisenabledbydefaultwithtrtllm-build,whichrequiresth?CustomAllReducePlugin:recommendtoenableforNVLink-basednodes?Embeddingparallelismandsharingfeatures:recommendtoenabletoimprovethroughputandreducememoryusageRuntimeOptions?gpt_model_type:recommendtouseinflight_fused_batchingtoincreasethroughputandreducelatency?batch_scheduler_policy:recommendtouseguaranteed_no_evictfirstlyandchangetomax_utilizationforpossiblyhigher?kv_cache_free_gpu_mem_fraction(default=0.9)ispreferredovermax_tokens_in_paged_kv_cacheduetoease-of-use.They?enable_trt_overlap:recommendtosetfalsefirstlyHowtoUseNVIDIATensorRT-LLMPerformanceBestPractices:QuantizationWeight-onlyQuantizationlatency;Getthesclatency;Getthescalesfromexternallibraries.WeightandActivationQuantizationHowtoUseNVIDIATensorRT-LLMAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer?Option2:Buildviadockerfile–canmodifydockerfileeasily.#Updatethesubmodulescdtensorrtllm_backendgitlfsinstallgitsubmoduleupdate--init–recursive#UsetheDockerfiletobuildthebackendinacontainer#Forx86_64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm-fdockerfile/Dockerfile.trt_llm_backend.#Foraarch64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm--build-argTORCH_INSTALL_TYPE="src_non_cxx11_abi"-fdockerfile/Dockerfile.trt_llm_backend.HowtouseNVIDIATritonInferenceServer#PreparetheTRT-LLMbaseimageusingthedockerfilefromtensorrtllm_backend.cdtensorrtllm_backend#Specifythebuildargsforthedockerfile.BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.01-py3-minTRT_VERSION=9.2.0.5TRT_URL_x86=/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.linux.x86_64-gnu.cuda-12.2.tar.gz/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.Ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gzdockerbuild-ttrtllm_base\--build-argBASE_IMAGE="${BASE_IMAGE}"--build-argTRT_VER="${TRT_VERSION}"--build-argRELEASE_URL_TRT_x86="${TRT_URL_x86}"\--build-argRELEASE_URL_TRT_ARM="${TRT_URL_ARM}"-fdockerfile/Dockerfile.triton.trt_llm_backend.#RunthebuildscriptfromTritonServerrepo.Theflagsforsomefeaturesorendpointscanberemovedifnotneeded.TRTLLM_BASE_IMAGE=trtllm_basecdserver./build.py-v--no-container-interactive--enable-logging--enable-stats--enable-tracing\--enable-metrics--enable-gpu-metrics--enable-cpu-metrics\--filesystem=gcs--filesystem=s3--filesystem=azure_storage\--endpoint=http--endpoint=grpc--endpoint=sagemaker--endpoint=vertex-ai\--backend=ensemble--enable-gpu--endpoint=http--endpoint=grpc\--image=base,${TRTLLM_BASE_IMAGE}\--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG}\--backend=python:${PYTHON_BACKEND_REPO_TAG}HowtouseNVIDIATritonInferenceServer#Gotothetensorrt_llm/examples/llamadirectorycdtensorrt_llm/examples/llama#ConverttheLLaMAmodelintotensorrt-llmcheckpointformat.pythonconvert_checkpoint.py--model_dir/path/to/llama-7b-hf\--output_dir./tllm_checkpoint_1gpu_fp16\--dtypefloat16#BuildtheLLaMA7BmodelusingasingleGPUandFP16.trtllm-build--checkpoint_dir./tllm_checkpoint_1gpu_fp16\--output_dir./llama_model/fp16/1-gpu\--gemm_pluginfloat16\--context_fmhaenable\--max_beam_width1\--max_batch_size8\--max_input_len--gpt_attention_pluginfloat16\d_kv_cacheenable\--remove_input_paddingenableHowtouseNVIDIATritonInferenceServerconnectionofinputandoutputtensorsbnumberofrequeststhatmustbesenttoTriton.prompts(string)toinput_ids(listofints).forinference.fromoutput_ids(listofints)tooutputs(string).postprocessingmodetensorrt_llmandpostprocessingmodelstogether.AlsosupportsmorefHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer#EnterTritonNGCcontainerdockerrun--rm-it--nethost--shm-size=2g--ulimitmemlock=-1--ulimitstack=67108864\--gpusall-v/path/to/tensorrtllm_backend:/tensorrtllm_backendnvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3bash#LaunchTritonservercd/tensorrtllm_backend#--world_sizeisthenumberofGPUsyouwanttouseforservingpython3scripts/launch_triton_server.py--world_size=4--model_repo=/tensorrtllm_backend/all_models/inflight_batcher_llm++++|Model|Version|Status|++++|<model_name>|<v>|READY||..|.|..|++++I091914:52:10.475738293grpc_server.cc:2451]StartedGRPCInferenceServiceat:8001I091914:52:10.475968293http_server.cc:3558]StartedHTTPServiceat:8000I091914:52:10.517138293http_server.cc:187]StartedMetricsServiceat:8002HowtouseNVIDIATritonInferenceServercd/tensorrtllm_backend#Useinflight_batcher_llm_client.pypython3inflight_batcher_llm/client/inflight_batcher_llm_client.py--request-output-len200\--tokenizer-dir/path/to/llama/tokenizer\--text"Bor

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論