This project extends the vllm
serving functionality to support data parallelism (DP) for serving large language models (LLMs). It allows you to serve models across multiple GPUs using data parallelism, enabling efficient scaling for high-throughput inference.😊
Features¶
- Data Parallelism (DP): Distribute model inference across multiple GPUs.
- OpenAI-Compatible API: Fully compatible with the OpenAI API specification.
- Multi-GPU Support: Utilize multiple GPUs by setting
CUDA_VISIBLE_DEVICES
. - Easy Integration: Inherits all parameters from
vllm serve
for seamless compatibility.
Usage¶
1. Start the Server¶
To install llmkit_data, using
git clone http://62.234.201.16/llm-kit/Data.git
pip install -e Data
To start the server, use the following command:
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Set visible GPUs
python -m llmkit_data.cli.serve --config {config_path} --log-dir {log_directory}
config_file’s example can be found in http://62.234.201.16/llm-kit/Data/blob/main/examples/config.yaml
Parameters¶
/path/to/model
: Path to the model to be served.--dtype auto
: Automatically infer the data type for the model.--api-key token-abc123
: API key for authentication.-dp 4
: Number of data-parallel workers. This must match the total number of GPUs required, calculated as:For example, ifTotal GPUs = Tensor Parallel Size (tp) * Pipeline Parallel Size (pp) * Data Parallel Size (dp)
tp=1
,pp=1
, anddp=4
, thenCUDA_VISIBLE_DEVICES
must include 4 GPUs.- Consult the vLLM documentation on the OpenAI-Compatible Server for information regarding other parameters: https://
docs .vllm .ai /en /latest /serving /openai _compatible _server .html.
Notes¶
CUDA_VISIBLE_DEVICES
: Must be set manually if not using Slurm. This ensures that the server only uses the specified GPUs and avoids conflicts with other users.- Slurm: If running in a Slurm environment, Slurm will automatically set
CUDA_VISIBLE_DEVICES
.
2. Client Script¶
Use the following Python script to interact with the server:
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor, as_completed
# Initialize the OpenAI client
client = OpenAI(
base_url="http://localhost:8000/v1", # Server URL
api_key="token-abc123", # API key
)
# Function to make a single API call
def generate_completion(messages):
completion = client.chat.completions.create(
model="/path/to/model", # Model path (same as server)
messages=messages,
)
return completion.choices[0].message
# List of messages to process
messages_list = [
[{"role": "user", "content": "Hello!"}],
[{"role": "user", "content": "Explain the concept of multithreading."}],
[{"role": "user", "content": "What is the capital of France?"}],
[{"role": "user", "content": "Tell me a joke."}],
]
# Function to make multithreaded API calls
def multithread_openai_calls(messages_list):
results = []
with ThreadPoolExecutor(max_workers=5) as executor: # Adjust max_workers as needed
# Submit tasks to the executor
future_to_messages = {executor.submit(generate_completion, messages): messages for messages in messages_list}
# Process results as they complete
for future in as_completed(future_to_messages):
messages = future_to_messages[future]
try:
result = future.result()
results.append((messages, result))
except Exception as e:
print(f"Error processing messages '{messages}': {e}")
results.append((messages, None))
return results
# Run the multithreaded OpenAI calls
results = multithread_openai_calls(messages_list)
# Print the results
for messages, result in results:
print(f"Messages: {messages}")
print(f"Completion: {result}")
print("-" * 50)
Design¶
Data Parallelism (DP)¶
The llmkit_data.cli.serve
module extends vllm serve
to support data parallelism. This allows the model to be served across multiple GPUs, with each GPU handling a portion of the incoming requests. Key design points:
- Worker Processes: Each GPU runs a separate worker process, managed by the main server.
- Load Balancing: Requests are distributed evenly across workers.
- Compatibility: All parameters from
vllm serve
are inherited, ensuring compatibility with existing workflows.
GPU Management¶
- Topology-Based Grouping:
The server uses
nvidia-smi topo -m
to retrieve the GPU topology matrix and groups GPUs to minimize communication costs within each group.
Requirements¶
vllm
(see vllm documentation)