Streaming Responses with Honcho

When working with AI-generated content, streaming the response as it’s generated can significantly improve the user experience. Honcho provides streaming functionality in its SDKs that allows your application to display content as it’s being generated, rather than waiting for the complete response.

When to Use Streaming

Streaming is particularly useful for:

  • Real-time chat interfaces
  • Long-form content generation
  • Applications where perceived speed is important
  • Interactive agent experiences
  • Reducing time-to-first-word in user interactions

Streaming with the Dialectic Endpoint

One of the primary use cases for streaming in Honcho is with the Dialectic endpoint. This allows you to stream the AI’s reasoning about a user in real-time.

Prerequisites

from honcho import Honcho

honcho = Honcho()

# Create or get an existing App
app = honcho.apps.get_or_create(name="demo-app")

# Create or get user
user = honcho.apps.users.get_or_create(app_id=app.id, name="demo-user")

# Create a new session
session = honcho.apps.users.sessions.create(app_id=app.id, user_id=user.id)

# Store some messages for context (optional)
honcho.apps.users.sessions.messages.create(
    app_id=app.id,
    user_id=user.id,
    session_id=session.id,
    content="Hello, I'm testing the streaming functionality",
    is_user=True
)

Streaming from the Dialectic Endpoint

import time

# Basic streaming example
with honcho.apps.users.sessions.with_streaming_response.stream(
    app_id=app.id,
    user_id=user.id,
    session_id=session.id,
    queries="What can you tell me about this user?",
) as response:
    for chunk in response.iter_text():
        print(chunk, end="", flush=True)  # Print each chunk as it arrives
        time.sleep(0.01)  # Optional delay for demonstration

Working with Streaming Data

When working with streaming responses, consider these patterns:

  1. Progressive Rendering - Update your UI as chunks arrive instead of waiting for the full response
  2. Buffered Processing - Accumulate chunks until a logical break (like a sentence or paragraph)
  3. Token Counting - Monitor token usage in real-time for applications with token limits
  4. Error Handling - Implement appropriate error handling for interrupted streams

Example: Restaurant Recommendation Chat

import asyncio
from honcho import Honcho

async def restaurant_recommendation_chat():
    honcho = Honcho()
    app = await honcho.apps.get_or_create(name="food-app")
    user = await honcho.apps.users.get_or_create(app_id=app.id, name="food-lover")
    session = await honcho.apps.users.sessions.create(app_id=app.id, user_id=user.id)
    
    # Store multiple user messages about food preferences
    user_messages = [
        "I absolutely love spicy Thai food, especially curries with coconut milk.",
        "Italian cuisine is another favorite - fresh pasta and wood-fired pizza are my weakness!",
        "I try to eat vegetarian most of the time, but occasionally enjoy seafood.",
        "I can't handle overly sweet desserts, but love something with dark chocolate."
    ]
    
    # Store the user's messages in the session
    for message in user_messages:
        await honcho.apps.users.sessions.messages.create(
            app_id=app.id,
            user_id=user.id,
            session_id=session.id,
            content=message,
            is_user=True
        )
        print(f"User: {message}")
    
    # Ask for restaurant recommendations based on preferences
    print("\nRequesting restaurant recommendations...")
    print("Assistant: ", end="", flush=True)
    full_response = ""
    
    # Stream the response
    with honcho.apps.users.sessions.with_streaming_response.stream(
        app_id=app.id,
        user_id=user.id,
        session_id=session.id,
        queries="Based on this user's food preferences, recommend 3 restaurants they might enjoy in the Lower East Side."
    ) as response:
        for chunk in response.iter_text():
            print(chunk, end="", flush=True)
            full_response += chunk
            await asyncio.sleep(0.01)
    
    # Store the assistant's complete response
    await honcho.apps.users.sessions.messages.create(
        app_id=app.id,
        user_id=user.id,
        session_id=session.id,
        content=full_response,
        is_user=False
    )

# Run the async function
if __name__ == "__main__":
    asyncio.run(restaurant_recommendation_chat())

Performance Considerations

When implementing streaming:

  • Consider connection stability for mobile or unreliable networks
  • Implement appropriate timeouts for stream operations
  • Be mindful of memory usage when accumulating large responses
  • Use appropriate error handling for network interruptions

Streaming responses provide a more interactive and engaging user experience. By implementing streaming in your Honcho applications, you can create more responsive AI-powered features that feel natural and immediate to your users.