What happened

A deep dive into the pricing structures of various LLM providers has revealed some intriguing insights, particularly regarding caching costs. A comparison was made across seven different GPU/LLM providers, compiling public pricing data into a single spreadsheet. It focused on aspects like input/output token pricing, context windows, and the distinct caching policies of each provider.

Why this matters

The most striking finding was how drastically caching can alter costs. For instance, a cache hit can be significantly cheaper than a cache miss—sometimes by tens of times. This has important implications for projects that rely on large system prompts, reusable context in retrieval-augmented generation (RAG) pipelines, or multi-turn conversations. In such cases, the headline token price may not be as critical as understanding a provider's caching policy.

Context

Historically, pricing comparisons for machine learning models have been challenging due to the lack of centralized information. Caching is often a hidden cost factor that can significantly influence the overall expense of using LLMs. As the demand for AI applications grows, understanding these nuances in pricing becomes increasingly essential for businesses looking to optimize their costs.

What this means

The findings suggest that when evaluating LLM providers, potential users should prioritize understanding caching policies along with token pricing. Additionally, the comparison highlights inconsistencies in model availability and context windows across providers, complicating decision-making. There are other metrics that are still hard to compare, such as real throughput, cold-start times, and network costs. As this area evolves, comprehensive data will be crucial for making informed choices in selecting the right provider.