feat(eval): add openai memory on locomo with eval guide (#54)

Duguce · web-flow · commit 97fdb06aaf20 · 2025-07-11T00:26:34.000+08:00
* feat(eval): add eval dependencies

* feat(eval): add configs example

* docs(eval): update README.md

* feat(eval): remove the dependency (pydantic)

* feat(eval): add run locomo eval script

* fix(eval): delete about memos redundant search branches

* chore: fix format

* feat(eval): add openai memory on locomo - eval guide

* docs(eval): modify openai memory on locomo - eval guide
diff --git a/evaluation/README.md b/evaluation/README.md
@@ -25,10 +25,12 @@ This repository provides tools and scripts for evaluating the LoCoMo dataset usi
 ## Evaluation Scripts
 
 ### LoCoMo Evaluation
-To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — `memos`, `mem0`, or `zep` — run the following command:
+⚙️ To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — `memos`, `mem0`, or `zep` — run the following [script](./scripts/run_locomo_eval.sh):
 
 ```bash
 # Edit the configuration in ./scripts/run_locomo_eval.sh
 # Specify the model and memory backend you want to use (e.g., mem0, zep, etc.)
 ./scripts/run_locomo_eval.sh
 ```
+
+✍️ For evaluating OpenAI's native memory feature with the LoCoMo dataset, please refer to the detailed guide: [OpenAI Memory on LoCoMo - Evaluation Guide](./scripts/locomo/openai_memory_locomo_eval_guide.md).
diff --git a/evaluation/scripts/locomo/locomo_openai.py b/evaluation/scripts/locomo/locomo_openai.py
@@ -0,0 +1,173 @@
+import argparse
+import json
+import os
+import time
+
+from collections import defaultdict
+from multiprocessing.dummy import Pool
+
+from dotenv import load_dotenv
+from openai import OpenAI
+from tenacity import retry, stop_after_attempt, wait_random_exponential
+from tqdm import tqdm
+
+
+load_dotenv()
+
+# Retry policy constants
+WAIT_MIN = 5  # minimum backoff delay in seconds
+WAIT_MAX = 30  # maximum backoff delay in seconds
+MAX_TRIES = 10  # maximum number of retry attempts
+
+WORKERS = 5  # number of parallel worker processes
+
+ANSWER_PROMPT = """
+    You are an intelligent memory assistant tasked with retrieving accurate information from conversation memories.
+
+    # CONTEXT:
+    You have access to memories from a conversation. These memories contain
+    timestamped information that may be relevant to answering the question.
+
+    # INSTRUCTIONS:
+    1. Carefully analyze all provided memories
+    2. Pay special attention to the timestamps to determine the answer
+    3. If the question asks about a specific event or fact, look for direct evidence in the memories
+    4. If the memories contain contradictory information, prioritize the most recent memory
+    5. If there is a question about time references (like "last year", "two months ago", etc.),
+       calculate the actual date based on the memory timestamp. For example, if a memory from
+       4 May 2022 mentions "went to India last year," then the trip occurred in 2021.
+    6. Always convert relative time references to specific dates, months, or years. For example,
+       convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory
+       timestamp. Ignore the reference while answering the question.
+    7. Focus only on the content of the memories. Do not confuse character
+       names mentioned in memories with the actual users who created those memories.
+    8. The answer should be less than 5-6 words.
+
+    # APPROACH (Think step by step):
+    1. First, examine all memories that contain information related to the question
+    2. Examine the timestamps and content of these memories carefully
+    3. Look for explicit mentions of dates, times, locations, or events that answer the question
+    4. If the answer requires calculation (e.g., converting relative time references), show your work
+    5. Formulate a precise, concise answer based solely on the evidence in the memories
+    6. Double-check that your answer directly addresses the question asked
+    7. Ensure your final answer is specific and avoids vague time references
+
+    Memories:
+
+    {context}
+
+    Question: {question}
+    Answer:
+    """
+
+
+class OpenAIPredict:
+    def __init__(self, model="gpt-4o-mini"):
+        self.model = model
+        self.openai_client = OpenAI(
+            api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_BASE_URL")
+        )
+        self.results = defaultdict(list)
+
+    def search_memory(self, idx):
+        with open(f"openai_memory/{idx}.txt", encoding="utf-8") as file:
+            memories = file.read().strip().replace("\n\n", "\n")
+
+        return memories, 0
+
+    def process_question(self, val, idx):
+        question = val.get("question", "")
+        answer = val.get("answer", "")
+        category = val.get("category", -1)
+
+        response, search_memory_time, response_time, context = self.answer_question(idx, question)
+
+        result = {
+            "question": question,
+            "answer": response,
+            "category": category,
+            "golden_answer": answer,
+            "search_context": context,
+            "response_duration_ms": response_time,
+            "search_duration_ms": search_memory_time,
+        }
+
+        return result
+
+    @retry(
+        wait=wait_random_exponential(min=WAIT_MIN, max=WAIT_MAX),
+        stop=stop_after_attempt(MAX_TRIES),
+        reraise=True,
+    )
+    def answer_question(self, idx, question):
+        memories, search_memory_time = self.search_memory(idx)
+
+        answer_prompt = ANSWER_PROMPT.format(context=memories, question=question)
+
+        t1 = time.time()
+        response = self.openai_client.chat.completions.create(
+            model=self.model,
+            messages=[{"role": "system", "content": answer_prompt}],
+            temperature=0.0,
+        )
+        t2 = time.time()
+        response_time = (t2 - t1) * 1000
+        return response.choices[0].message.content, search_memory_time, response_time, memories
+
+    def process_data_file(self, file_path, output_file_path):
+        with open(file_path, encoding="utf-8") as f:
+            data = json.load(f)
+
+        # Function to process each conversation
+        def process_conversation(item):
+            idx, conversation = item
+            results_for_conversation = []
+
+            # Process each question in the conversation
+            for question_item in tqdm(
+                conversation["qa"], desc=f"Processing questions for conversation {idx}", leave=False
+            ):
+                if int(question_item.get("category", "")) == 5:
+                    continue
+                result = self.process_question(question_item, idx)
+                results_for_conversation.append(result)
+
+            return idx, results_for_conversation
+
+        # Use multiprocessing to process the conversations in parallel
+        with Pool(processes=WORKERS) as pool:
+            results = list(
+                tqdm(
+                    pool.imap(process_conversation, list(enumerate(data))),
+                    total=len(data),
+                    desc="Processing conversations",
+                )
+            )
+
+        # Reorganize results and store them in self.results
+        for idx, results_for_conversation in results:
+            self.results[f"locomo_exp_user_{idx}"] = results_for_conversation
+
+        # Save results to output file
+        with open(output_file_path, "w") as f:
+            json.dump(self.results, f, indent=4)
+
+
+def main(version):
+    os.makedirs(f"results/locomo/openai-{version}/", exist_ok=True)
+    output_file_path = f"results/locomo/openai-{version}/openai_locomo_responses.json"
+    openai_predict = OpenAIPredict()
+    openai_predict.process_data_file("data/locomo/locomo10.json", output_file_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--version",
+        type=str,
+        default="default",
+        help="Version identifier for loading results (e.g., 1010)",
+    )
+    args = parser.parse_args()
+    version = args.version
+    main(version)
diff --git a/evaluation/scripts/locomo/openai_memory_locomo_eval_guide.md b/evaluation/scripts/locomo/openai_memory_locomo_eval_guide.md
@@ -0,0 +1,115 @@
+# OpenAI Memory on LoCoMo - Evaluation Guide
+
+This document outlines the evaluation process for OpenAI's Memory feature using the LoCoMo dataset.
+
+## 1. Introduction
+
+Since OpenAI's [Memory feature](https://openai.com/index/memory-and-new-controls-for-chatgpt/) does not have a public API, the evaluation requires a manual process. Dialogues from the LoCoMo dataset are formatted and manually input into the ChatGPT web interface. The resulting memories are then retrieved from the account's memory management page and saved locally.
+
+To evaluate the quality of these memories, we will use the `gpt-4o-mini` model via API. The model will be asked questions from the LoCoMo dataset, and the full history of memories for the relevant conversation will be provided as context. This simulates a perfect memory retrieval system, giving the model the best possible information to answer the question.
+
+## 2. Step-by-Step Workflow
+
+### Step 2.1: Generate Input Context for Memory Extraction
+
+Run the following Python script to generate the input prompts for each session in each conversation. The script will create a separate `.txt` file for each session, containing the formatted conversation history and the extraction prompt.
+
+**Script:**
+```python
+import json
+import os
+
+# Ensure the path to the dataset is correct
+LOCOMO_DATA_PATH = "data/locomo/locomo10.json"
+SAVE_DIR = "openai_inputs"
+
+os.makedirs(SAVE_DIR, exist_ok=True)
+
+TEMPLATE = """Can you please extract relevant information from this conversation and create memory entries for each user mentioned? Please store these memories in your knowledge base in addition to the timestamp provided for future reference and personalized interactions.
+
+{context}
+"""
+
+with open(LOCOMO_DATA_PATH, "r", encoding="utf-8") as f:
+    data = json.load(f)
+
+for conv_idx, item in enumerate(data):
+    conv = item["conversation"]
+
+    for i in range(1, 35):
+        session_key = f"session_{i}"
+        session_dt_key = f"session_{i}_date_time"
+        if session_key not in conv:
+            continue
+
+        session = conv[session_key]
+        session_dt = conv[session_dt_key]
+
+        session_context = ""
+        for chat in session:
+            chat_str = f"({session_dt}) {chat['speaker']}: {chat['text']}\n"
+            session_context += chat_str
+
+        input_string = TEMPLATE.format(context=session_context)
+
+        output_filename = os.path.join(SAVE_DIR, f"{conv_idx}-D{i}.txt")
+        with open(output_filename, "w", encoding="utf-8") as f:
+            f.write(input_string)
+
+print(f"Generated {len(os.listdir(SAVE_DIR))} input files in '{SAVE_DIR}' directory.")
+```
+
+**Example Input (`0-D9.txt`):**
+```plaintext
+Can you please extract relevant information from this conversation and create memory entries for each user mentioned? Please store these memories in your knowledge base in addition to the timestamp provided for future reference and personalized interactions.
+
+(2:31 pm on 17 July, 2023) Melanie: Hey Caroline, hope all's good! I had a quiet weekend after we went camping with my fam two weekends ago. It was great to unplug and hang with the kids. What've you been up to? Anything fun over the weekend?
+(2:31 pm on 17 July, 2023) Caroline: Hey Melanie! That sounds great! Last weekend I joined a mentorship program for LGBTQ youth - it's really rewarding to help the community.
+... (rest of the conversation)
+```
+
+### Step 2.2: Extract and Save Memories from ChatGPT
+
+1.  **Enable Memory:** In ChatGPT, go to **Settings -> Personalization** and ensure **Memory** is turned on.
+2.  **Clear Existing Memories:** Before processing a new conversation, click on **Manage** and **Clear all** to ensure a clean slate.
+3.  **Input and Verify:**
+    * Open a new chat.
+    * Ensure the model is set to **GPT-4o**.
+    * Copy the content of a generated `.txt` file (e.g., `0-D1.txt`) and paste it into the chat.
+    * After the model responds, verify that you see the "Memory updated" confirmation.
+4.  **Save Memories:**
+    * Click on **Manage** in the memory confirmation to view the newly generated memories.
+    * Create a new local `.txt` file with the same name as the input file (e.g., `0-D1.txt`).
+    * Copy each memory entry from ChatGPT and paste it into the new file, with each memory on a new line.
+5.  **Reset Memories for the Next Conversation:** 
+    * Once all sessions for a conversation are complete, it is essential to **delete all memories to ensure a clean state for the next conversation**. Navigate to Settings -> Personalization -> Manage and click Delete all.
+
+**Example Memory Output (`0-D9.txt`):**
+```plaintext
+As of November 17, 2023, Dave has taken up photography and enjoys capturing nature scenes like sunsets, beaches, waves, rocks, and waterfalls.
+Dave recently purchased a vintage camera that takes high-quality photos.
+Dave discovered a serene park nearby with a peaceful spot featuring a bench under a tree with pink flowers.
+As of November 17, 2023, Calvin attended a fancy gala in Boston where he had an inspiring conversation with an artist about music and art.
+Calvin finds music a powerful connector and source of creativity.
+Calvin took a photo in a Japanese garden that he shared with Dave.
+Calvin accepted an invitation to perform at an upcoming show in Boston, expressing excitement about the musical experience.
+```
+
+### Step 2.3: Consolidate Memories
+
+The memories are currently saved per session. You need to write a simple script to consolidate all memories belonging to the same conversation into a single file. For example, all memories from `0-D1.txt`, `0-D2.txt`, etc., should be merged into a single `conversation_0_memories.txt`.
+
+
+### Step 2.4: Automated Evaluation
+
+Once the memories for all conversations have been extracted and saved, you can run the automated [evaluation script](../run_openai_eval.sh). This script will handle the process of generating answers, evaluating them, and calculating metrics.
+
+```bash
+# Edit the configuration in ./scripts/run_openai_eval.sh
+./scripts/run_openai_eval.sh
+```
+
+## 3. Considerations
+
+-   **Account Differences:** Be aware of potential differences between free and Plus accounts, such as context length limitations and the number of memories that can be stored.
+-   **Granularity:** The evaluation process adds memories at the session level. To ensure high-quality memory extraction, you should follow this same principle. Feeding the entire conversation to the model at once has been shown to be ineffective, often causing it to overlook important details and leading to substantial information loss.
diff --git a/evaluation/scripts/run_openai_eval.sh b/evaluation/scripts/run_openai_eval.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+# Common parameters for all scripts
+LIB="openai"
+VERSION="063001"
+WORKERS=10
+NUM_RUNS=3
+
+
+echo "Running locomo_openai.py..."
+python scripts/locomo/locomo_openai.py --version $VERSION
+if [ $? -ne 0 ]; then
+    echo "Error running locomo_openai.py."
+    exit 1
+fi
+
+echo "Running locomo_eval.py..."
+python scripts/locomo/locomo_eval.py --lib $LIB --version $VERSION --num_runs $NUM_RUNS
+if [ $? -ne 0 ]; then
+    echo "Error running locomo_eval.py"
+    exit 1
+fi
+
+echo "Running locomo_metric.py..."
+python scripts/locomo/locomo_metric.py --lib $LIB --version $VERSION
+if [ $? -ne 0 ]; then
+    echo "Error running locomo_metric.py"
+    exit 1
+fi
+
+echo "All scripts completed successfully!"