session
). The general strategy for summarization is to combine a list of recent messages verbatim with a compressed LLM-generated summary of the older messages not included. Implementing this correctly, in such a way that the resulting context is:
- Exhaustive: the combination of recent messages and summary should cover the entire conversation
- Dynamically sized: the tokens used on both summary and recent messages should be malleable based on desired token usage
- Performant: while creation of the summary by LLM introduces necessary latency, this should never add latency to an arbitrary end-user request
Creating Summaries
Honcho already has an asynchronous task queue for the purpose of deriving facts from messages. This is the ideal place to create summaries where they won’t add latency to a message. Currently, Honcho has two configurable summary types:- Short summaries: by default, enqueued every 20 messages and given a token limit of 1000
- Long summaries: by default, enqueued every 60 messages and given a token limit of 4000
Retrieving Summaries
Summaries are retrieved from the session by theget_context
method. This method has two parameters:
summary
: A boolean indicating whether to include the summary in the return type. The default is true.tokens
: An integer indicating the maximum number of tokens to use for the context. If not provided,get_context
will retrieve as many tokens as are required to create exhaustive conversation coverage.
- If the last message contains more tokens than the context token limit, no summary or message list is possible — both will be empty.
- If the last few messages contain more tokens than the context token limit, no summary is possible — the context will only contain the last 1 or 2 messages that fit in the token limit.
-
If the summaries contain more tokens than the context token limit, no summary is possible — the context will only contain the X most recent messages that fit in the token limit. Note that while summaries will often be smaller than their token limits, avoiding this scenario means passing a higher token limit than the Honcho-configured summary size(s). For this reason, the default token limit for
get_context
is a few times larger than the configured long summary size.
summary
to false and tokens
to some multiple of your desired message count. Note that context messages are not paginated, so there’s a hard limit on the number of messages that can be retrieved (currently 100,000 tokens).
As a final note, remember that summaries are generated asynchronously and therefore may not be available immediately. If you batch-save a large number of messages, assume that summaries will not be available until those messages are processed, which can take seconds to minutes depending on the number of messages and the configured LLM provider. Exhaustive get_context
calls performed during this time will likely just return the messages in the session.