As a weekend project I tried to code up a quick triage agent to categorize my emails, seeing as I can never seem to keep my inbox organized myself. And to make sure I didn’t have any obvious holes in my outline of tools I’d need to make, I threw together a quick query to Ollama with an LLM (Qwen3 8B) running locally on my PC to churn out an outline of what tools I should include with my agent. Turns out my prompting needed some work.
“
I want to build an LLM agent that leverages tool usage and reasoning in order to act as an Email Triage Agent. I imagine having a list of categories (or tags i guess since I’m on gmail?) that the LLM agent can then sort the emails into their respective folders by using the tools available to them, but I want to make sure that the agent has all the tools it could need in order to get all the information and context about the inbox and its emails, and how to allow it to organize the data in a way that is safe to my data. What tools should I include with the agent
“
This was the original prompt used to query Qwen3 (30B-A3B for initial attempt) about the required tools to add to the agent. Initial chat txt
I thought that was a fine prompt initially. The quality of work the LLM is able to do for you and how closely it gets to matching your expectations is very dependent on how much relevant information you’re able to give it. The quality of an output over another when working on something that doesn’t have an easily verifiable answer is also quite subjective from person to person. So I told it what I was trying to build, told it how I thought it should work, and what I would like the LLM to ponder about and what I want it to give me.
As you’ll see, this prompt was missing just a bit of critical information to make its output actually usable. As a point of comparison, you can see the final prompt and how little it actually differs from the original, yet we’ll find that these small changes make a large difference in the behavior and quality of the LLM’s output for real work. This is all the final prompt added:
“
I’ll handle the logic of actually implementing everything myself, that part is not too hard, I just want to think on a high abstract level of what tools I should even think about to add to the LLM calls. Lets make a list of tools in a numbered list with a short description for what each tool should do, and what kinda data it should approximately take in and put out
“
The two sentences were added at the end, very simple. Final chat txt
So lets see where our LLM takes us with with the first attempt with 125 tokens of input. With Qwen3 being a reasoning model, it will think to itself before giving a final answer, lets follow along.
> We are building an Email Triage Agent for Gmail. The agent will categorize emails into folders (labels in Gmail) based on a predefined list of categories.
>
> Key components:
> 1. **Tools for Information Gathering**: The agent needs to access the email data to make decisions.
> 2. **Tools for Action**: The agent needs to move emails to the correct folders (labels).
>
> Since we are using Gmail, we can leverage the Gmail API. However, note that the agent must be run in a secure environment to protect user data.
>
> Tools to include:
>
> 1. **Gmail API Access**:
> - `list_emails`: Retrieve a list of emails from the inbox (with optional filters like unread, date range, etc.).
> - `get_email_details`: Fetch the full content (subject, sender, date, body) of a specific email by its ID.
> - `search_emails`: Search emails based on criteria (query string, date range, etc.).
> - `update_email_labels`: Add or remove labels (folders) from an email.
(...)
The first 22 lines of the reasoning block from the LLM as its starting up with the first prompt.
A good start indeed! The LLM correctly restates the scenario showing that it understands the context of what we’re trying to achieve. It also identifies that there are two general types of tools I should equip the agent with, and then it lists out what I would call a perfectly minimal list of tools that enable the basic functionality needed for an Email Triage Agent.
It didn’t need to tell me to use the Gmail API though, that’s not quite what I was asking for. And I’m not sure what it means by ‘a secure environment’, that’s very vague and not really relevant to the original query. Those familiar with Qwen models will be painfully aware of how many tokens the model will prefer the output in its reasoning chain before coming back with the final answer. It is a trade-off (a bad one at that) to pay extra in number of tokens generated for the increased performance of Qwen3 compared to its predecessors. Lets see how the LLM continues.
(...)
> 3. **Safety and Data Protection**:
> - The agent should only have access to the user's Gmail account via OAuth 2.0 tokens (with minimal scopes: read, send, and modify).
> - We must ensure that the agent does not leak data. Therefore, we should run the agent in a secure environment (e.g., on the user's device or in a secure cloud environment with restricted access).
> - We can implement logging and monitoring to track the agent's actions.
>
> 4. **User Interface (Optional)**:
> - A way to define the categories and their corresponding labels (folder names) in Gmail.
> - A dashboard to review the agent's actions and adjust settings.
(...)
Lines 28-35 of the reasoning block from the LLM with the first prompt.
Uh oh, we’re going off the rails here. Maybe it’ll turn it around?
(...)
> Implementation Steps:
>
> 1. **Set up Gmail API**:
(...)
> 4. **Security Measures**:
> - Store OAuth tokens securely (e.g., encrypted storage).
> - Run the agent in a sandboxed environment (e.g., Docker container) to prevent unauthorized access to the host
(...)
Lines 37-59 of the reasoning block from the LLM with the first prompt.
(...)
> Example Code Snippet (Python with Gmail API):
>
> ```python
> from google
</details>
Lines 65-69 (end) of the reasoning block from the LLM, where I had cancelled generating the rest of the response.
Nope we’ve completely derailed. Crashed, and burned. I was holding on to some last sliver of hope as it was just getting through the implementation steps, but once the model started trying to write some Python code I was finished with it. By then it had already gotten the answer within the first 200 tokens, yet still continued to try and reason for another 450 before I stopped it.
I had originally tried that first prompt with the larger 30B-A3B version of Qwen3 as mentioned. With the more parameters available the ability of the model to hold onto important context longer throughout the response is enhanced, where smaller models typically lose the picture pretty quick before becoming unintelligible. For whatever reason, I switched from the 30B-A3B model to the 8B model. For keeping variables equivalent between runs, if you’re interested to see how the 8B model performed compared to the 30B-A3B model you can see the 8B’s attempt here, but I’ll spoil it. It was just as bad, touching on unrelated things like app security and legal compliance in the final answer.
From here I thought I could prompt my way out of this issue, I knew that if I just was a bit more specific with what I want I could get the model to output something to my expectations. I noticed my first problem was that I did not specific any bounds on the prompt. The LLM was trying to respond with the precision of a hammer, touching on every related topic out of the gate, when I was expecting more like the precision of a scalpel to give me just an outline of some tools to add to my agent.
Take two:
“
I want to build an LLM agent that leverages tool usage and reasoning in order to act as an Email Triage Agent. I imagine having a list of categories (or tags i guess since I’m on gmail?) that the LLM agent can then sort the emails into their respective folders by using the tools available to them, but I want to make sure that the agent has all the tools it could need in order to get all the information and context about the inbox and its emails, and how to allow it to organize the data in a way that is safe to my data. What tools should I include with the agent
I’ll handle the logic of actually implementing everything myself, that part is not too hard, I just want to think on a high abstract level of what tools I should even think about to add to the LLM calls
“
The second prompt to Qwen3 (8B)
I put some very simple bounds on the LLM, don’t think about any of the logic of the application, implementation will be handled at a later time. I believe that I kind of shot myself in the foot by associating ‘abstract’ with the prompt as we’ll come to see.
To build a robust **Email Triage Agent** using an LLM, you’ll need a combination of **tools** and **data access mechanisms** to ensure the agent can process, categorize, and securely manage emails. Here's a structured breakdown of the tools and considerations:
---
### **1. Core Tools for Email Access and Processing**
#### **a. Email Provider API (e.g., Gmail API)**
- **Purpose**: Fetch emails, metadata (sender, subject, date, attachments), and labels.
- **Key Features**:
- Read emails and their content (plaintext, HTML, attachments).
- Apply labels/tags (e.g., "Work", "Personal", "Urgent").
- Move emails between folders.
- Access email headers and thread information.
- **Why It’s Critical**: Direct access to the email data is essential for triage.
#### **b. Natural Language Processing (NLP) Tools**
- **Purpose**: Extract structured information from email content.
- **Key Features**:
- Sentiment analysis (to prioritize urgent or negative emails).
- Named Entity Recognition (NER) to identify people, organizations, or PII (e.g., phone numbers, addresses).
- Keyphrase extraction (to summarize email content).
- **Why It’s Critical**: Helps the LLM understand context and prioritize categorization.
(...)
Lines 39 -59 of the response block (final response) from the LLM with the second prompt. Second chat txt
It’s an alright response, but it is a bit too abstract. It is already an endeavor to wade through the swamp of LLM outputs where quality is like a coin flip, and reading a response like this it’s easy to have your eyes glaze over. Not to mention the model continues to delve into sidebars on data-anonymization, security and legal compliance tools, malware detection and attachment scanning, and much much more as the response marches on. It’s clear I’m still not being specific enough.
To be specific on what I’m not being specific enough about, I need to tell the model what I want the output to look like. Currently the model is using its own judgement to come up with the best structure for its response and what things to include in the response. What if we told the model to say exactly what I want to hear. Enter our final prompt:
“
I want to build an LLM agent that leverages tool usage and reasoning in order to act as an Email Triage Agent. I imagine having a list of categories (or tags i guess since I’m on gmail?) that the LLM agent can then sort the emails into their respective folders by using the tools available to them, but I want to make sure that the agent has all the tools it could need in order to get all the information and context about the inbox and its emails, and how to allow it to organize the data in a way that is safe to my data. What tools should I include with the agent
I’ll handle the logic of actually implementing everything myself, that part is not too hard, I just want to think on a high abstract level of what tools I should even think about to add to the LLM calls. Lets make a list of tools in a numbered list with a short description for what each tool should do, and what kinda data it should approximately take in and put out
“
The final prompt, total additions from initial: 76 tokens, or 71 words.
What has changed now is the addition of what I want the output to look like. It’s also in my personal self interest to have some control over this, If I’m gonna have to be the one to read the model outputs, I want to be sure that it is in a format that I can easily read and understand, depending on what I need. So I instruct the model, make a numbered list of tools with their expected inputs, outputs, and a little description for each. This narrows the scope of the model’s behavior incredibly. No longer will the model explore data-anonymization since that is not specifically in a numbered list of tools. Same goes for the other unnecessary sidebars into legal compliance and user interfaces (thank god).
Here’s a list of tools you should consider including in your LLM agent for email triage, along with their purposes, input/output, and how they contribute to the workflow:
---
### **1. Email Fetcher Tool**
**Purpose**: Retrieve emails from the inbox (e.g., Gmail) based on filters (sender, date, keyword, etc.).
**Input**: Filters (e.g., `"from:example.com"`, `"label:unread"`, `"after:2023-01-01"`).
**Output**: List of raw email data (headers, body, attachments, metadata).
**Why**: Enables the agent to access and process emails dynamically based on user-defined criteria.
---
### **2. Email Parser Tool**
**Purpose**: Extract structured data from raw email content (headers, body, attachments, metadata).
**Input**: Raw email data (e.g., raw JSON from Gmail API).
**Output**: Structured data (e.g., sender, subject, body text, attachment filenames, timestamps).
(...)
### **3. Categorization Tool**
(...)
### **4. Folder Management Tool**
(...)
Lines 42 -70 of the response block (final response) from the LLM with the last prompt. Final chat txt
Now that is a list I can read, and serves my original purpose I needed the LLM for! The first few are pretty obvious, but that’s good, it should always get those right. After covering the basics of getting batches of emails, the ability to read individual email summaries, then categorizing the individual emails, and doing basic folder management, we see the model get into some new ones. Most of them are not useful, but once in a while there will be something that stands out.
### **5. Data Sanitizer Tool**
(...)
### **6. Preference Loader Tool**
(...)
### **7. Reasoning Tool**
(...)
### **8. Attachment Handler Tool**
(...)
### **9. Validation Tool**
(...)
### **10. Log Tool**
(...)
### **Bonus: User Feedback Tool**
(...)
Lines 78 -126 of the response block (final response) from the LLM with the last prompt.
As expected, most of these are pretty bad. Numbers 5, 8, 9, and 10 are pretty unhelpful in my opinion. However, numbers 6, 7, and the bonus have some teeth. Having a bank of information with my specific categorization preferences be available in a tool for if context is overloaded is not a bad idea, it may be found out further down the road to be unnecessary but it made me think more than the others. Same goes for the reasoning tool, and a tool to get user feedback on items. Those could prove to be useful, and I think could warrant some attention after getting the first four (plus) essential tools functioning.
Takeaways
Lets take a moment to reiterate on the changes we’ve made and how they have influenced the behavior of the model. First, we restricted the scope of what the model is responsible for figuring out, reducing the number of possible of avenues it may decide to run down as it loses attention on the main idea:
I’ll handle the logic of actually implementing everything myself, that part is not too hard, I just want to think on a high abstract level of what tools I should even think about to add to the LLM calls
While the model still rambles on about similar topics of compliance and risk mitigation, we have restricted it to think about our project abstractly (for better or worse), which removes generation of code and instructions to get started with the Gmail API like we saw in the first example. Another way this can be done is to set out a list rules that you can append or shorten as the agent is tuned, which can be done as simply as saying:
RULE #1: Do NOT generate any code for this task
RULE #2: You are to think about the user’s project from a top down perspective
RULE #3: You are ONLY allow to respond to the user in the style of a pirate (Argh!)
With the model’s output domain being restricted by our explanation of our scenario and in what ways we need the LLM to behave, we are on our way to a usable result. To finish off the last of the misbehavior in the model, I needed to constrain the possible ways the model could format and put together their response by adding the following line:
Lets make a list of tools in a numbered list with a short description for what each tool should do, and what kinda data it should approximately take in and put out
By adding this I told the model what I’d be happy with reading. If it was just to make a numbered list of tools X, Y, and Z to code for the model, that’s all I need. Without telling the model exactly how you want it to give the information back, it will try and fill in the blanks itself and introducing randomness in its output. To extract any bit of usefulness from the outputs, I need to focus its attention to the important aspects of the context to then achieve the expected output sequence of tokens.
The example we covered is one of a very forgiving prompt structure requested, if you require down to the sentence level control over what the LLM comes back with (chat naming, summarization, transcribing, etc) that can be achieved by specifying with either instructions and/or examples:
OUTPUT STRUCTURE: A bullet point list with the tool name in a bold, and its importance to the MVP as a percentage in parenthesis. Each bullet point will have 2 sub-point. One with 1-2 sentences of what the tool will do and what information it needs to get functionality with less than 50 words. Then the second sub-point with a quick concise statement (less than 15 words) of if this is required for the MVP of the email triage agent.
EXAMPLE:
- **Get Inbox Tool** (100%)
- Sub-point 1
- Sub-point 2
While the above example is way more verbose than the ~30 words I added to my prompt, it doesn’t take much to transform the output of the model from one for to another.
How do current state-of-the-art models fair? (10-2025)
After spending a couple of minutes appending onto my prompt, I wondered how much better the brand new models perform on the initial prompt. Over time we’ve been seeing frontier AI models get better and better at filling in the missing context themselves, however there is still quite a spread between models in how well they are able to understand the unspoken context of what I’m looking for. I tested the original prompt against Claude Sonnet 4.5, GLM 4.6, GPT-5, Qwen3 Max, and Gemini 2.5 Flash 09-2025 on OpenRouter. The differences between all of them could be a post within itself.
Claude 4.5 provides the best response of all those test, in my opinion. Considering it is the newest of the bunch, and the most SOTA of the SOTA, that makes sense to me. It does do a bit of overextending to recommended some useless tools, but otherwise it comes right away with a list of tools.
GLM 4.6 doesn’t quite hit the mark. It doesn’t list an easy to read set of tools, it goes more on the abstract side of things but there eventually are defined tools hidden amongst the other headings and numbers and sub points. Not user friendly to read or navigate.
GPT-5 continues to disappoint me, and not for a lack of information – this time going way too deep into the topic. It comes back with technically the most complete answer, with really well defined tools with the essential ones at the top front and center. But then it rambles on for thousands of tokens with super niche tools like inspecting attachment metadata and setting up automatic email filters. Using a total of 5,313 tokens at a cost of $0.053 for the single request. Way too much thinking and output, almost giving Grok a run for its money.
(Compare that to Sonnet 4.5 at 1269 tokens at $0.019 for the request. The efficiency at which Sonnet is able to operate and still return a better response is staggering.)
Qwen3 Max performs best of the ‘open’ models. It is a bit overly flashy with emojis and a deep outline structure. But it does list out the tools in a nice to read and easy to understand way, hitting all the marks for required tools and fine options for add-ons past the minimal required items. It still fails the dismount so to speak, as it finishes up with some workflow examples, implementation suggestions, and development checklists. They seem fine, but not quite what I needed, a bit too overzealous.
Gemini 2.5 Flash surprised me being pretty good, in my opinion better than GLM and GPT-5 as far as understanding the context of what I wanted as the output. It does litter more garbage suggestions alongside the good ones, so its a mixed bag. While models like GPT-5 and Sonnet 4.5 gave similar answers, they put their best suggestions at the top, and then went down into the weirder suggestions. Gemini categorized the tools so the good and bad were mixed around. Either way, when the price and speed of 2.5 Flash is an order of magnitude cheaper than GPT-5, its clearly in a class of its own.
So, why didn’t you use this/that to start?
Simple, I will always choose to run my own model locally whenever feasible.
I could have saved a couple of minutes by being sloppy with my prompt and using a proprietary model like Gemini for fractions of a cent, but do you really want Google having more on you?
Leave a Reply