2 minute read / Oct 7, 2023 /
The Promise and Pitfalls of Chaining Large Language Models for Email
Over the last few weeks I’ve been experimenting with chaining together large language models.
I dictate emails & blog posts often. Recently, I started using Whisper for drafting emails and documents. (Initially there were some issues with memory management, but I’ve since found a compiled version that works well on my Mac called whisper.cpp)
After tying Google’s Duet I wondered if I could replicate something similar. I’ve been chaining the Whisper dictation model together with LLaMA 2 model from Facebook. When drafting an email, I can dictate a response to LLaMA 2, which will then generate a reply using the context from my original email.
So far it works sometimes, but there are some clear limitations:
First, the default tone of the generated emails is far too formal.
Second, if I prompt LLaMA 2 to use a more casual tone, it often goes too far in the other direction. The problem is a lack of nuanced context - the appropriate level of familiarity varies greatly between emails to close colleagues versus board communications or potential investors. Without that nuance labeled and incorporated into the training data, it’s hard for the model to strike the right tone.
Third, in multi-party email threads things can get confusing. If Lauren introduces Rafa to me, then Rafa bccs Lauren on the email, LlaMA 2 often replies as Lauren.
Fourth, figuring out exactly the right settings for the model can be tough. Sometimes I dictate long emails, in which case the context windows (how much the computer listens to before transcribing) should be very long so the system can remember what I’ve said previously.
Other times I’m just returning a very fast email. A quick see you soon or thank you very much. In which case a long context window doesn’t make sense and I’m left waiting for the system to process.
I’m wondering whether small errors in the first model compound in the second model. Bad data from the transcription -> inaccurate prompt to the LLM -> incorrect output.
Overall the potential is exciting, but there are still challenges around tone, context, and multi-party interactions that need to be addressed before this can become a seamless productivity tool. In machine learning systems, achieving an 80% solution is pretty rapid. The marginal 15% - the magic behind ML - takes a huge amount of effort, data, & tuning.