LLM calls caching
Prompt caching allows you to save costs and speed up the LLM responses significantly for most common prompt/questions.
Semantical caching
PromptWatch uses semantical caching which means that the prompt is compared to the previous prompts by semantical (cosine) similarity. If the prompt matches previously generated prompt by predefined similarity (by default >0.97 ... which can be interpreted as over 97% similar), the response will be reused.
Caching options
When using cache you have these configuration options available:
cache_namespace_key: Optional[str]
- cache partition key - to create separate cache storage for this LLM... to ensure that the cached responses wont be mixed up from different LLMs/Chainscache_embeddings: Optional[Embeddings]
- langchain embeddding object for semantic search... if None, openAIs ada-002 model will be usedtoken_limit:int=None
- token encoding window for the model used in embeddings... if the prompt goes above this window, the cache won't be used to ensure accurate predictionsimilarity_limit:float=0.97
- minimum required similarity (cosine similarity = 1-distance)
Usage
For completion LLMs
from promptwatch.langchain import CachedLLM
llm = OpenAI(temperature=0)
cached_llm = CachedLLM(chat)
# now you can add your LLM into chain as you would the original
For chat LLMs
from promptwatch.langchain import CachedChatLLM
chat_llm = ChatOpenAI(temperature=0)
cached_chat_llm = CachedChatLLM(chat)
# now you can add your LLM into chain as you would the original
How and When to use caching
In general it is not great idea to use cached LLM for the whole chain. The semantical similarity can be tricky... since two prompts that are very similar could still have different response required.
For example
What is the distance from earth sun ?
What is the distance from earth moon ?
Are very similar (cosine similarity 0.99
) but the answer is quite different.
Ideal use-cases for LLM caching are a subchains that produces larger amount of text and will have another LLMChain processing these data. Good example a setup with Critic (the LLM criticism for one scenario can be relevant for another scenario but there is still another LLMChain following after cache that has chance to correct potential problems).