-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Summary
Tidier.jl is already in Project.toml but underutilized. Refactor aggregation code to use @chain, @summarize, and @group_by patterns for cleaner, more readable data transformations.
Motivation
Current situation:
src/CampaignAnalysis.jl has manual loop-based aggregations:
for deg in degrees
metric_values = []
exp_ids = []
for (exp_id, stats) in exp_stats
if haskey(stats, metric_label)
metric_data = stats[metric_label]
key_value = extract_key_metric_value(metric_label, metric_data)
if key_value !== nothing && !isnan(key_value)
push!(metric_values, key_value)
push!(exp_ids, exp_id)
end
end
end
# Compute statistics
push!(results, mean(metric_values))
endProblems:
- Verbose and imperative
- Hard to read and maintain
- Error-prone (manual array management)
- Doesn't leverage existing Tidier dependency
Better with Tidier.jl:
@chain campaign_df begin
@group_by(degree, experiment_id)
@summarize(
mean_l2 = mean(l2_error),
std_l2 = std(l2_error),
min_l2 = minimum(l2_error),
max_l2 = maximum(l2_error),
num_points = n()
)
@arrange(degree, mean_l2)
endBenefits:
- ✅ More readable and declarative
- ✅ Less code (fewer lines, less complexity)
- ✅ Leverages existing dependency (already in Project.toml!)
- ✅ Familiar syntax for R/dplyr users
- ✅ Easier to test and debug
- ✅ Better performance (optimized under the hood)
Current Tidier Usage
Already in dependencies:
[deps]
Tidier = "f0413319-3358-4bb0-8e7c-0c83523a93bd"But only used minimally in the codebase. We're paying the dependency cost without getting the benefits!
Files to Refactor
Primary targets:
src/CampaignAnalysis.jl-aggregate_metrics_across_experiments()src/CampaignAnalysis.jl-aggregate_csv_metrics_by_degree()src/StatisticsCompute.jl- Various aggregation helpers
Example Refactors
Before:
function compute_mean_by_degree(data::Dict)
results = Dict{Int, Float64}()
for (deg, values) in data
results[deg] = mean(values)
end
return results
endAfter:
function compute_mean_by_degree(df::DataFrame)
@chain df begin
@group_by(degree)
@summarize(mean_value = mean(value))
end
endTasks
- Audit codebase for manual aggregation loops
- Identify top 5 functions to refactor (prioritize readability gains)
- Refactor
aggregate_metrics_across_experiments()to use Tidier - Refactor
aggregate_csv_metrics_by_degree()to use Tidier - Add Tidier.jl usage examples to documentation
- Create coding style guide recommending Tidier for aggregations
- Add performance benchmarks (Tidier vs manual loops)
- Update CONTRIBUTING.md with Tidier best practices
Documentation Additions
Add section to README:
## Data Aggregation with Tidier.jl
This package uses Tidier.jl for data aggregation following dplyr-style patterns:
# Example: Aggregate campaign metrics by degree
@chain campaign_results begin
@group_by(degree)
@summarize(
mean_error = mean(l2_error),
experiments = n()
)
endPriority
MEDIUM - Code quality improvement, leverages existing dependency
Estimated Effort
1-2 weeks (audit, refactor, test, document)
Reactions are currently unavailable