版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
NVIDIALLM全棧式方案使用和優(yōu)化最佳實(shí)踐Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM2Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM3NVIDIAFull-StackSolutionforLLMNVIDIAMegatron-Core(M-core)forLLMTrainingNVIDIATensorRT-LLMforLLMInference4OverviewofNVIDIA’sLargeLanguageModelOfferingsforTrainingSolutionsatEachLeveloftheNemoFramework:EasytouseOOTBFWwithalargemodelMegatron-LM:AlightweightframeworkreferenceforusingMegatron-Core:LibraryforGPUoptimizedtechniquesforLLMTransformerEngine:HopperacceleratedTransformermodels.5WhyWeNeedNVIDIAMegatron-Core?6NVIDIATensorRT-LLM?FasterTransformertoleverageitsoptimizedkernelsforperfo?OthercomponentsforthecustomizationsofLLMinference,suchasCUTLASS7KeyFeaturesinNVIDIATensorRT-LLM8WhatisNVIDIATritonInferenceServer?FeaturesofTritonInferenceServer:9Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMBestPracticeforNVIDIAMegatron-Core?Enabledistributedoptimizertoshardoptimizerstates.?Utilizedistributedoptimizertoshardoptimizerstatessimultaneously.BestPracticeforNVIDIAMegatron-Core?EnableTransformerEngine(--transformer-impltransformer_engine)?EnableFlashAttention(--use-flash-attn)?Enablecommunicationoverlapping?EnablekernelfusionsBestPracticeforNVIDIAMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizatiEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizationActivationRecomputeModelsConfig/Spec(Customization)Config/Spec(Customization)TransformerBlockTransformerBlockTransformerLayerTransformerLayerMLPMLPSequenceParallelismSequenceParallelismDDistributedOptimizerAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtoUseNVIDIATensorRT-LLM?EasetheuseeffortsHowtoUseNVIDIATensorRT-LLM#Converthuggingfacellama-7bmodeltotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/llama/convert_checkpoint.py\--model_dirllama-7b-hf\--dtypefloat16\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp16-tp2#Quantizehuggingfacellama-7bandexporttotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/quantization/quantize.py\--model_dirllama-7b-hf\--dtypefloat16\--qformatfp8\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp8-tp2HowtoUseNVIDIATensorRT-LLM#Buildtrt-llmenginesfromtrt-llmcheckpoint#Optionallyenable/disablebuildingoptionstrtllm-build--checkpoint_dirtllm_ckpt/llama-7b-fp8-tp2\--gemm_pluginfloat16\--output_dirtllm_engines/llama-7b-fp8-tp2\--workers2#Runinferencewiththetrt-llmenginesmpirun-n2--allow-run-as-rootpythonexamples/run.py\--engine_dirtllm_engines/llama-7b-fp8-tp2\--tokenizer_dirllama-7b-hf\--max_output_len30\--input_text"Borninnorth-eastFrance,Soyertrainedasa"#ExamplegeneratedoutputOutput[Text0Beam0]:"chefinParisandLondonbeforemovingtoNewYorkin1850.Hewasthefirstcheftobehiredbythenewly"HowtoUseNVIDIATensorRT-LLM?Oneormoresafetensorsfilesstoringrankweights?Eachfilesavesadictmappingh{'transformer.vocab_embedding.weight':torch.Tensor(...),'transformer.layers.0.attention.qkv.weight':torch.Tensor(...),'transformer.layers.0.attention.dense.weight':torch.Tensor(...),'transformer.layers.0.mlp.fc.weight':torch.Tensor(...),'j.weight':torch.Tensor(...),'lm_head.weight':torch.Tensor(...)}HowtoUseNVIDIATensorRT-LLMBuildOptions?In-flightbatchingisenabledbydefaultwithtrtllm-build,whichrequiresth?CustomAllReducePlugin:recommendtoenableforNVLink-basednodes?Embeddingparallelismandsharingfeatures:recommendtoenabletoimprovethroughputandreducememoryusageRuntimeOptions?gpt_model_type:recommendtouseinflight_fused_batchingtoincreasethroughputandreducelatency?batch_scheduler_policy:recommendtouseguaranteed_no_evictfirstlyandchangetomax_utilizationforpossiblyhigher?kv_cache_free_gpu_mem_fraction(default=0.9)ispreferredovermax_tokens_in_paged_kv_cacheduetoease-of-use.They?enable_trt_overlap:recommendtosetfalsefirstlyHowtoUseNVIDIATensorRT-LLMPerformanceBestPractices:QuantizationWeight-onlyQuantizationlatency;Getthesclatency;Getthescalesfromexternallibraries.WeightandActivationQuantizationHowtoUseNVIDIATensorRT-LLMAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer?Option2:Buildviadockerfile–canmodifydockerfileeasily.#Updatethesubmodulescdtensorrtllm_backendgitlfsinstallgitsubmoduleupdate--init–recursive#UsetheDockerfiletobuildthebackendinacontainer#Forx86_64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm-fdockerfile/Dockerfile.trt_llm_backend.#Foraarch64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm--build-argTORCH_INSTALL_TYPE="src_non_cxx11_abi"-fdockerfile/Dockerfile.trt_llm_backend.HowtouseNVIDIATritonInferenceServer#PreparetheTRT-LLMbaseimageusingthedockerfilefromtensorrtllm_backend.cdtensorrtllm_backend#Specifythebuildargsforthedockerfile.BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.01-py3-minTRT_VERSION=9.2.0.5TRT_URL_x86=/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.linux.x86_64-gnu.cuda-12.2.tar.gz/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.Ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gzdockerbuild-ttrtllm_base\--build-argBASE_IMAGE="${BASE_IMAGE}"--build-argTRT_VER="${TRT_VERSION}"--build-argRELEASE_URL_TRT_x86="${TRT_URL_x86}"\--build-argRELEASE_URL_TRT_ARM="${TRT_URL_ARM}"-fdockerfile/Dockerfile.triton.trt_llm_backend.#RunthebuildscriptfromTritonServerrepo.Theflagsforsomefeaturesorendpointscanberemovedifnotneeded.TRTLLM_BASE_IMAGE=trtllm_basecdserver./build.py-v--no-container-interactive--enable-logging--enable-stats--enable-tracing\--enable-metrics--enable-gpu-metrics--enable-cpu-metrics\--filesystem=gcs--filesystem=s3--filesystem=azure_storage\--endpoint=http--endpoint=grpc--endpoint=sagemaker--endpoint=vertex-ai\--backend=ensemble--enable-gpu--endpoint=http--endpoint=grpc\--image=base,${TRTLLM_BASE_IMAGE}\--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG}\--backend=python:${PYTHON_BACKEND_REPO_TAG}HowtouseNVIDIATritonInferenceServer#Gotothetensorrt_llm/examples/llamadirectorycdtensorrt_llm/examples/llama#ConverttheLLaMAmodelintotensorrt-llmcheckpointformat.pythonconvert_checkpoint.py--model_dir/path/to/llama-7b-hf\--output_dir./tllm_checkpoint_1gpu_fp16\--dtypefloat16#BuildtheLLaMA7BmodelusingasingleGPUandFP16.trtllm-build--checkpoint_dir./tllm_checkpoint_1gpu_fp16\--output_dir./llama_model/fp16/1-gpu\--gemm_pluginfloat16\--context_fmhaenable\--max_beam_width1\--max_batch_size8\--max_input_len--gpt_attention_pluginfloat16\d_kv_cacheenable\--remove_input_paddingenableHowtouseNVIDIATritonInferenceServerconnectionofinputandoutputtensorsbnumberofrequeststhatmustbesenttoTriton.prompts(string)toinput_ids(listofints).forinference.fromoutput_ids(listofints)tooutputs(string).postprocessingmodetensorrt_llmandpostprocessingmodelstogether.AlsosupportsmorefHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer#EnterTritonNGCcontainerdockerrun--rm-it--nethost--shm-size=2g--ulimitmemlock=-1--ulimitstack=67108864\--gpusall-v/path/to/tensorrtllm_backend:/tensorrtllm_backendnvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3bash#LaunchTritonservercd/tensorrtllm_backend#--world_sizeisthenumberofGPUsyouwanttouseforservingpython3scripts/launch_triton_server.py--world_size=4--model_repo=/tensorrtllm_backend/all_models/inflight_batcher_llm++++|Model|Version|Status|++++|<model_name>|<v>|READY||..|.|..|++++I091914:52:10.475738293grpc_server.cc:2451]StartedGRPCInferenceServiceat:8001I091914:52:10.475968293http_server.cc:3558]StartedHTTPServiceat:8000I091914:52:10.517138293http_server.cc:187]StartedMetricsServiceat:8002HowtouseNVIDIATritonInferenceServercd/tensorrtllm_backend#Useinflight_batcher_llm_client.pypython3inflight_batcher_llm/client/inflight_batcher_llm_client.py--request-output-len200\--tokenizer-dir/path/to/llama/tokenizer\--text"Bor
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2021高考化學(xué)(廣東專用)二輪考點(diǎn)突破-第五部分-化學(xué)實(shí)驗-專題二十三-實(shí)驗方案的設(shè)計與評價-
- 2020采購員個人工作計劃范文
- 2025年人教版八年級數(shù)學(xué)寒假預(yù)習(xí) 第12講 菱形的性質(zhì)與判定(2個知識點(diǎn)+6大考點(diǎn)舉一反三+過關(guān)測試)
- 學(xué)校化學(xué)教師個人工作總結(jié)
- 2020年小學(xué)教學(xué)論文開題報告范文
- 【導(dǎo)與練】2021屆高三物理大一輪復(fù)習(xí)(人教版適用)訓(xùn)練題:章末定時練3
- 陜西省渭南市尚德中學(xué)2024-2025學(xué)年高一上學(xué)期第二次階段性物理試卷(含答案)
- 遼寧省沈陽市名校2024-2025學(xué)年七年級上學(xué)期期末考試地理試題(含答案)
- 吉林省松原市前郭五中2024~2025學(xué)年高二上期末考試 生物(含答題卡、答案)
- 【名師金典】2022新課標(biāo)高考生物總復(fù)習(xí)限時檢測15孟德爾的豌豆雜交實(shí)驗(二)-
- 1.1、供應(yīng)商管理控制流程與風(fēng)險控制流程圖
- 初二年級勞動課教案6篇
- 箱變遷移工程施工方案
- 北師大版九年級數(shù)學(xué)下冊《圓的對稱性》評課稿
- 《遙感原理與應(yīng)用》期末考試試卷附答案
- 物流無人機(jī)垂直起降場選址與建設(shè)規(guī)范(征求意見稿)
- 工程分包管理制度
- 2023年湖南成人學(xué)位英語考試真題
- GB/T 9452-2023熱處理爐有效加熱區(qū)測定方法
- 肺炎支原體肺炎診治專家共識
- 藥物化學(xué)(第七版)(全套課件1364P)
評論
0/150
提交評論