Paid tool

Together AI

AI Acceleration Cloud for fast open-source model inference, fine-tuning, and training.

Visittogether.ai
Intro

What is Together AI?

Together AI is a comprehensive AI Acceleration Cloud designed for fast inference, fine-tuning, and training of open-source generative AI models. It features the Together Inference Engine, which leverages custom FP8 kernels and speculative decoding to deliver 4x faster performance than standard setups. The platform serves as a powerful alternative to native setups like RunPod or closed ecosystems like the Gemini API, providing developers with OpenAI-compatible APIs to deploy leading language, vision, code, and image models. Through its infrastructure, users can seamlessly access high-throughput models like Qwen or deploy a fast DeepSeek API endpoint for advanced reasoning tasks, alongside specialized tools like Llama Coder setups for autonomous development.

Together AI at a glance
Free endpoints available; Paid inference from $0.008/1M tokens; GPU Clusters from $1.30/hr756K monthly visitsPaid access
Pricing

Together AI Pricing Plans

Compare Together AI free options, Together AI paid pricing plans, and usage notes before you choose the best way to use this AI tool in 2026.

Free endpoints available; Paid inference from $0.008/1M tokens; GPU Clusters from $1.30/hr

$0.27 / 1M input tokens, $0.85 / 1M output tokens

Per-token pricing for the 128-expert MoE powerhouse model

$0.18 / 1M input tokens, $0.59 / 1M output tokens

Per-token pricing for the 109B parameter model optimized for multi-document analysis

$1.25 / 1M tokens

Flat rate per 1 million tokens for the open Mixture-of-Experts model

$3.00 / 1M input tokens, $7.00 / 1M output tokens

Per-token pricing for the state-of-the-art reasoning model

$0.55 / 1M input tokens, $2.19 / 1M output tokens

Optimized, high-throughput variant of DeepSeek-R1

$0.54 (Lite), $0.88 (Turbo), $0.90 (Reference) / 1M tokens

Tiered performance options based on full precision, optimization, or lowest cost

$0.20 / 1M input tokens, $0.60 / 1M output tokens

Hybrid instruct + reasoning MoE model optimized for throughput

$0.08 / Megapixel image

In-context image generation and editing endpoint yielding roughly 12.5 images per $1

Free

Free serverless endpoint for the fastest state-of-the-art image generation model

Free

Free serverless endpoint to experiment with distilled reasoning model capabilities

$0.025 / minute ($1.49 / hour)

Customizable GPU endpoints billed per minute for deploying standard hardware instances

$0.056 / minute ($3.36 / hour)

Dedicated single-tenant Hopper GPU deployment for demanding inference workloads

$0.083 / minute ($4.99 / hour)

High-memory dedicated GPU endpoint for large scale deployment

$0.48 / 1M tokens (LoRA), $0.54 / 1M tokens (Full FT)

Price per million tokens processed in the training dataset multiplied by the number of epochs

$2.90 / 1M tokens (LoRA), $3.20 / 1M tokens (Full FT)

Fine-tuning rates for large-scale language model weights

Starting at $1.75 / hour

Reserved training clusters with 80GB HBM2e memory and high-speed InfiniBand networking

Starting at $2.09 / hour

Reserved training clusters with 141GB HBM3e memory

Contact Sales

Next-generation training infrastructure clusters featuring 384GB or 192GB memory options

$0.0446 / hour per vCPU + $0.0149 / hour per GiB RAM

Custom VM sandbox environments for large automated AI development pipelines

$0.03 / session

Per 60-minute session execution cost for processing LLM-generated code

Pricing updated:Jun 11, 2026

Features

Together AI AI Features

Serverless Inference API supporting over 100 open-source chat, vision, language, code, image, and embedding modelsDedicated Endpoints for deploying customized models on single-tenant NVIDIA hardware with per-minute billingSupervised Fine-Tuning and DPO (Direct Preference Optimization) for both LoRA and full fine-tuning with full model ownershipTogether GPU Clusters powered by NVIDIA Blackwell (GB200, B200) and Hopper (H200, H100) architecturesTogether Code Sandbox and Code Interpreter for executing LLM-generated code and building AI development environmentsFlashAttention-3 and custom software stack optimizations achieving up to 75% GPU utilization on H100s
Pros & Cons

Together AI Pros and Cons

Pros

  • Extremely fast inference speeds, running models like Llama-3 8B at up to 400 tokens/second
  • Cost-efficient pricing, offering up to 11x lower costs compared to certain closed models like GPT-4o
  • No vendor lock-in, granting complete model ownership and enterprise VPC deployment flexibility
  • Introductory 50% discount on input and output tokens for batch inference workloads
  • High-speed interconnects (InfiniBand and NVLink) with a 99.9% uptime SLA for dedicated clusters

Limitations

  • Dedicated GPU endpoints require complex hardware orchestration configuration
  • Fine-tuning pricing scales dynamically based on the multiple of dataset size, evaluations, and epochs, requiring careful cost tracking

Together AI FAQ

Together AI accelerates open-source model performance using research-driven innovations. This includes transformer-optimized CUDA kernels that are over 75% faster than base PyTorch, quality-preserving quantization techniques like QTIP, and algorithmic speculative decoding using draft models trained on the RedPajama dataset.