Skip to content

Commit 97fdb06

Browse files
authored
feat(eval): add openai memory on locomo with eval guide (#54)
* feat(eval): add eval dependencies * feat(eval): add configs example * docs(eval): update README.md * feat(eval): remove the dependency (pydantic) * feat(eval): add run locomo eval script * fix(eval): delete about memos redundant search branches * chore: fix format * feat(eval): add openai memory on locomo - eval guide * docs(eval): modify openai memory on locomo - eval guide
1 parent 8abc88a commit 97fdb06

File tree

4 files changed

+322
-1
lines changed

4 files changed

+322
-1
lines changed

evaluation/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,12 @@ This repository provides tools and scripts for evaluating the LoCoMo dataset usi
2525
## Evaluation Scripts
2626

2727
### LoCoMo Evaluation
28-
To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — `memos`, `mem0`, or `zep` — run the following command:
28+
⚙️ To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — `memos`, `mem0`, or `zep` — run the following [script](./scripts/run_locomo_eval.sh):
2929

3030
```bash
3131
# Edit the configuration in ./scripts/run_locomo_eval.sh
3232
# Specify the model and memory backend you want to use (e.g., mem0, zep, etc.)
3333
./scripts/run_locomo_eval.sh
3434
```
35+
36+
✍️ For evaluating OpenAI's native memory feature with the LoCoMo dataset, please refer to the detailed guide: [OpenAI Memory on LoCoMo - Evaluation Guide](./scripts/locomo/openai_memory_locomo_eval_guide.md).
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
import argparse
2+
import json
3+
import os
4+
import time
5+
6+
from collections import defaultdict
7+
from multiprocessing.dummy import Pool
8+
9+
from dotenv import load_dotenv
10+
from openai import OpenAI
11+
from tenacity import retry, stop_after_attempt, wait_random_exponential
12+
from tqdm import tqdm
13+
14+
15+
load_dotenv()
16+
17+
# Retry policy constants
18+
WAIT_MIN = 5 # minimum backoff delay in seconds
19+
WAIT_MAX = 30 # maximum backoff delay in seconds
20+
MAX_TRIES = 10 # maximum number of retry attempts
21+
22+
WORKERS = 5 # number of parallel worker processes
23+
24+
ANSWER_PROMPT = """
25+
You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
26+
27+
# CONTEXT:
28+
You have access to memories from a conversation. These memories contain
29+
timestamped information that may be relevant to answering the question.
30+
31+
# INSTRUCTIONS:
32+
1. Carefully analyze all provided memories
33+
2. Pay special attention to the timestamps to determine the answer
34+
3. If the question asks about a specific event or fact, look for direct evidence in the memories
35+
4. If the memories contain contradictory information, prioritize the most recent memory
36+
5. If there is a question about time references (like "last year", "two months ago", etc.),
37+
calculate the actual date based on the memory timestamp. For example, if a memory from
38+
4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
39+
6. Always convert relative time references to specific dates, months, or years. For example,
40+
convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory
41+
timestamp. Ignore the reference while answering the question.
42+
7. Focus only on the content of the memories. Do not confuse character
43+
names mentioned in memories with the actual users who created those memories.
44+
8. The answer should be less than 5-6 words.
45+
46+
# APPROACH (Think step by step):
47+
1. First, examine all memories that contain information related to the question
48+
2. Examine the timestamps and content of these memories carefully
49+
3. Look for explicit mentions of dates, times, locations, or events that answer the question
50+
4. If the answer requires calculation (e.g., converting relative time references), show your work
51+
5. Formulate a precise, concise answer based solely on the evidence in the memories
52+
6. Double-check that your answer directly addresses the question asked
53+
7. Ensure your final answer is specific and avoids vague time references
54+
55+
Memories:
56+
57+
{context}
58+
59+
Question: {question}
60+
Answer:
61+
"""
62+
63+
64+
class OpenAIPredict:
65+
def __init__(self, model="gpt-4o-mini"):
66+
self.model = model
67+
self.openai_client = OpenAI(
68+
api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_BASE_URL")
69+
)
70+
self.results = defaultdict(list)
71+
72+
def search_memory(self, idx):
73+
with open(f"openai_memory/{idx}.txt", encoding="utf-8") as file:
74+
memories = file.read().strip().replace("\n\n", "\n")
75+
76+
return memories, 0
77+
78+
def process_question(self, val, idx):
79+
question = val.get("question", "")
80+
answer = val.get("answer", "")
81+
category = val.get("category", -1)
82+
83+
response, search_memory_time, response_time, context = self.answer_question(idx, question)
84+
85+
result = {
86+
"question": question,
87+
"answer": response,
88+
"category": category,
89+
"golden_answer": answer,
90+
"search_context": context,
91+
"response_duration_ms": response_time,
92+
"search_duration_ms": search_memory_time,
93+
}
94+
95+
return result
96+
97+
@retry(
98+
wait=wait_random_exponential(min=WAIT_MIN, max=WAIT_MAX),
99+
stop=stop_after_attempt(MAX_TRIES),
100+
reraise=True,
101+
)
102+
def answer_question(self, idx, question):
103+
memories, search_memory_time = self.search_memory(idx)
104+
105+
answer_prompt = ANSWER_PROMPT.format(context=memories, question=question)
106+
107+
t1 = time.time()
108+
response = self.openai_client.chat.completions.create(
109+
model=self.model,
110+
messages=[{"role": "system", "content": answer_prompt}],
111+
temperature=0.0,
112+
)
113+
t2 = time.time()
114+
response_time = (t2 - t1) * 1000
115+
return response.choices[0].message.content, search_memory_time, response_time, memories
116+
117+
def process_data_file(self, file_path, output_file_path):
118+
with open(file_path, encoding="utf-8") as f:
119+
data = json.load(f)
120+
121+
# Function to process each conversation
122+
def process_conversation(item):
123+
idx, conversation = item
124+
results_for_conversation = []
125+
126+
# Process each question in the conversation
127+
for question_item in tqdm(
128+
conversation["qa"], desc=f"Processing questions for conversation {idx}", leave=False
129+
):
130+
if int(question_item.get("category", "")) == 5:
131+
continue
132+
result = self.process_question(question_item, idx)
133+
results_for_conversation.append(result)
134+
135+
return idx, results_for_conversation
136+
137+
# Use multiprocessing to process the conversations in parallel
138+
with Pool(processes=WORKERS) as pool:
139+
results = list(
140+
tqdm(
141+
pool.imap(process_conversation, list(enumerate(data))),
142+
total=len(data),
143+
desc="Processing conversations",
144+
)
145+
)
146+
147+
# Reorganize results and store them in self.results
148+
for idx, results_for_conversation in results:
149+
self.results[f"locomo_exp_user_{idx}"] = results_for_conversation
150+
151+
# Save results to output file
152+
with open(output_file_path, "w") as f:
153+
json.dump(self.results, f, indent=4)
154+
155+
156+
def main(version):
157+
os.makedirs(f"results/locomo/openai-{version}/", exist_ok=True)
158+
output_file_path = f"results/locomo/openai-{version}/openai_locomo_responses.json"
159+
openai_predict = OpenAIPredict()
160+
openai_predict.process_data_file("data/locomo/locomo10.json", output_file_path)
161+
162+
163+
if __name__ == "__main__":
164+
parser = argparse.ArgumentParser()
165+
parser.add_argument(
166+
"--version",
167+
type=str,
168+
default="default",
169+
help="Version identifier for loading results (e.g., 1010)",
170+
)
171+
args = parser.parse_args()
172+
version = args.version
173+
main(version)
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# OpenAI Memory on LoCoMo - Evaluation Guide
2+
3+
This document outlines the evaluation process for OpenAI's Memory feature using the LoCoMo dataset.
4+
5+
## 1. Introduction
6+
7+
Since OpenAI's [Memory feature](https://openai.com/index/memory-and-new-controls-for-chatgpt/) does not have a public API, the evaluation requires a manual process. Dialogues from the LoCoMo dataset are formatted and manually input into the ChatGPT web interface. The resulting memories are then retrieved from the account's memory management page and saved locally.
8+
9+
To evaluate the quality of these memories, we will use the `gpt-4o-mini` model via API. The model will be asked questions from the LoCoMo dataset, and the full history of memories for the relevant conversation will be provided as context. This simulates a perfect memory retrieval system, giving the model the best possible information to answer the question.
10+
11+
## 2. Step-by-Step Workflow
12+
13+
### Step 2.1: Generate Input Context for Memory Extraction
14+
15+
Run the following Python script to generate the input prompts for each session in each conversation. The script will create a separate `.txt` file for each session, containing the formatted conversation history and the extraction prompt.
16+
17+
**Script:**
18+
```python
19+
import json
20+
import os
21+
22+
# Ensure the path to the dataset is correct
23+
LOCOMO_DATA_PATH = "data/locomo/locomo10.json"
24+
SAVE_DIR = "openai_inputs"
25+
26+
os.makedirs(SAVE_DIR, exist_ok=True)
27+
28+
TEMPLATE = """Can you please extract relevant information from this conversation and create memory entries for each user mentioned? Please store these memories in your knowledge base in addition to the timestamp provided for future reference and personalized interactions.
29+
30+
{context}
31+
"""
32+
33+
with open(LOCOMO_DATA_PATH, "r", encoding="utf-8") as f:
34+
data = json.load(f)
35+
36+
for conv_idx, item in enumerate(data):
37+
conv = item["conversation"]
38+
39+
for i in range(1, 35):
40+
session_key = f"session_{i}"
41+
session_dt_key = f"session_{i}_date_time"
42+
if session_key not in conv:
43+
continue
44+
45+
session = conv[session_key]
46+
session_dt = conv[session_dt_key]
47+
48+
session_context = ""
49+
for chat in session:
50+
chat_str = f"({session_dt}) {chat['speaker']}: {chat['text']}\n"
51+
session_context += chat_str
52+
53+
input_string = TEMPLATE.format(context=session_context)
54+
55+
output_filename = os.path.join(SAVE_DIR, f"{conv_idx}-D{i}.txt")
56+
with open(output_filename, "w", encoding="utf-8") as f:
57+
f.write(input_string)
58+
59+
print(f"Generated {len(os.listdir(SAVE_DIR))} input files in '{SAVE_DIR}' directory.")
60+
```
61+
62+
**Example Input (`0-D9.txt`):**
63+
```plaintext
64+
Can you please extract relevant information from this conversation and create memory entries for each user mentioned? Please store these memories in your knowledge base in addition to the timestamp provided for future reference and personalized interactions.
65+
66+
(2:31 pm on 17 July, 2023) Melanie: Hey Caroline, hope all's good! I had a quiet weekend after we went camping with my fam two weekends ago. It was great to unplug and hang with the kids. What've you been up to? Anything fun over the weekend?
67+
(2:31 pm on 17 July, 2023) Caroline: Hey Melanie! That sounds great! Last weekend I joined a mentorship program for LGBTQ youth - it's really rewarding to help the community.
68+
... (rest of the conversation)
69+
```
70+
71+
### Step 2.2: Extract and Save Memories from ChatGPT
72+
73+
1. **Enable Memory:** In ChatGPT, go to **Settings -> Personalization** and ensure **Memory** is turned on.
74+
2. **Clear Existing Memories:** Before processing a new conversation, click on **Manage** and **Clear all** to ensure a clean slate.
75+
3. **Input and Verify:**
76+
* Open a new chat.
77+
* Ensure the model is set to **GPT-4o**.
78+
* Copy the content of a generated `.txt` file (e.g., `0-D1.txt`) and paste it into the chat.
79+
* After the model responds, verify that you see the "Memory updated" confirmation.
80+
4. **Save Memories:**
81+
* Click on **Manage** in the memory confirmation to view the newly generated memories.
82+
* Create a new local `.txt` file with the same name as the input file (e.g., `0-D1.txt`).
83+
* Copy each memory entry from ChatGPT and paste it into the new file, with each memory on a new line.
84+
5. **Reset Memories for the Next Conversation:**
85+
* Once all sessions for a conversation are complete, it is essential to **delete all memories to ensure a clean state for the next conversation**. Navigate to Settings -> Personalization -> Manage and click Delete all.
86+
87+
**Example Memory Output (`0-D9.txt`):**
88+
```plaintext
89+
As of November 17, 2023, Dave has taken up photography and enjoys capturing nature scenes like sunsets, beaches, waves, rocks, and waterfalls.
90+
Dave recently purchased a vintage camera that takes high-quality photos.
91+
Dave discovered a serene park nearby with a peaceful spot featuring a bench under a tree with pink flowers.
92+
As of November 17, 2023, Calvin attended a fancy gala in Boston where he had an inspiring conversation with an artist about music and art.
93+
Calvin finds music a powerful connector and source of creativity.
94+
Calvin took a photo in a Japanese garden that he shared with Dave.
95+
Calvin accepted an invitation to perform at an upcoming show in Boston, expressing excitement about the musical experience.
96+
```
97+
98+
### Step 2.3: Consolidate Memories
99+
100+
The memories are currently saved per session. You need to write a simple script to consolidate all memories belonging to the same conversation into a single file. For example, all memories from `0-D1.txt`, `0-D2.txt`, etc., should be merged into a single `conversation_0_memories.txt`.
101+
102+
103+
### Step 2.4: Automated Evaluation
104+
105+
Once the memories for all conversations have been extracted and saved, you can run the automated [evaluation script](../run_openai_eval.sh). This script will handle the process of generating answers, evaluating them, and calculating metrics.
106+
107+
```bash
108+
# Edit the configuration in ./scripts/run_openai_eval.sh
109+
./scripts/run_openai_eval.sh
110+
```
111+
112+
## 3. Considerations
113+
114+
- **Account Differences:** Be aware of potential differences between free and Plus accounts, such as context length limitations and the number of memories that can be stored.
115+
- **Granularity:** The evaluation process adds memories at the session level. To ensure high-quality memory extraction, you should follow this same principle. Feeding the entire conversation to the model at once has been shown to be ineffective, often causing it to overlook important details and leading to substantial information loss.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
#!/bin/bash
2+
3+
# Common parameters for all scripts
4+
LIB="openai"
5+
VERSION="063001"
6+
WORKERS=10
7+
NUM_RUNS=3
8+
9+
10+
echo "Running locomo_openai.py..."
11+
python scripts/locomo/locomo_openai.py --version $VERSION
12+
if [ $? -ne 0 ]; then
13+
echo "Error running locomo_openai.py."
14+
exit 1
15+
fi
16+
17+
echo "Running locomo_eval.py..."
18+
python scripts/locomo/locomo_eval.py --lib $LIB --version $VERSION --num_runs $NUM_RUNS
19+
if [ $? -ne 0 ]; then
20+
echo "Error running locomo_eval.py"
21+
exit 1
22+
fi
23+
24+
echo "Running locomo_metric.py..."
25+
python scripts/locomo/locomo_metric.py --lib $LIB --version $VERSION
26+
if [ $? -ne 0 ]; then
27+
echo "Error running locomo_metric.py"
28+
exit 1
29+
fi
30+
31+
echo "All scripts completed successfully!"

0 commit comments

Comments
 (0)