missing trials when doing local experiment with runners-cpus

Hi @DonggeLiu @jonathanmetzman

Lately, I've been running lots of local experiments on fuzzbench and noticed that after I added `--runners-cpus` flag reports were sometimes incomplete due to race condition. 

This is my config:
```
# The number of trials of a fuzzer-benchmark pair.
trials: 5

# The amount of time in seconds that each trial is run for.
# 1 day = 24 * 60 * 60 = 86400
max_total_time: 3600

# The location of the docker registry.
# FIXME: Support custom docker registry.
# See https://github.com/google/fuzzbench/issues/777
docker_registry: gcr.io/fuzzbench

# The local experiment folder that will store most of the experiment data.
# Please use an absolute path.
experiment_filestore: /home/zuka/hexhive/data/local-runs/experiment-data

# The local report folder where HTML reports and summary data will be stored.
# Please use an absolute path.
report_filestore: /home/zuka/hexhive/data/local-runs/report-data

# Flag that indicates this is a local experiment.
local_experiment: true
```
and I use this command to start experiment:
```
PYTHONPATH=. python3 experiment/run_experiment.py \                                                                                                                                                                
--experiment-config experiment-config.yaml \
--benchmarks curl_curl_fuzzer_http freetype2_ftfuzzer bloaty_fuzz_target jsoncpp_jsoncpp_fuzzer libxml2_xml sqlite3_ossfuzz vorbis_decode_fuzzer \
--experiment-name libafl-1h-with-seeds \
--fuzzers libafl_default libafl_random libafl_weighted libafl_valprof libafl_covaccount \
--concurrent-builds 15 --runners-cpus 15 --measurers-cpus 1
```


Adding runners-cpus besides restricting number of usable CPUs, also adds pinning to docker command. Most of the times I am getting only first cycle of trials (If I run with --runners-cpus 16, then I get only 16 trials in the report). For other trials there were fuzzer logs, corpus archives, but no coverage archives. 

The reason for this is `measurer_main_process` ends before the next cycle of trials is started. I see `Finished measure loop.` in the logs after the first cycle and the loop is never restarted. 

After some more debugging I found the issue in this piece of code inside `measure_manager_loop`

```python3
        while not scheduler.all_trials_ended(experiment):
            continue_inner_loop = measure_manager_inner_loop(
                experiment, max_cycle, request_queue, response_queue,
                queued_snapshots)
             if not continue_inner_loop:
                break
            time.sleep(MEASUREMENT_LOOP_WAIT)
```
After the first cycle ends, `measure_manager_inner_loop` returns False and the loop breaks out, because there are no unmeasured snapshots in the database *yet*. 

I don't really understand the need for this break, so to fix the issue for my runs, I just removed `break` logic from the measurer loop and just let it run until `scheduler.all_trials_ended`.  If you think this is an acceptable solution I can create PR.







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing trials when doing local experiment with runners-cpus #2075

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

missing trials when doing local experiment with runners-cpus #2075

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions