Skip to content

Leverage existing Tidier.jl dependency for cleaner aggregation code #4

@gescholt

Description

@gescholt

Summary

Tidier.jl is already in Project.toml but underutilized. Refactor aggregation code to use @chain, @summarize, and @group_by patterns for cleaner, more readable data transformations.

Motivation

Current situation:

src/CampaignAnalysis.jl has manual loop-based aggregations:

for deg in degrees
    metric_values = []
    exp_ids = []
    
    for (exp_id, stats) in exp_stats
        if haskey(stats, metric_label)
            metric_data = stats[metric_label]
            key_value = extract_key_metric_value(metric_label, metric_data)
            
            if key_value !== nothing && !isnan(key_value)
                push!(metric_values, key_value)
                push!(exp_ids, exp_id)
            end
        end
    end
    
    # Compute statistics
    push!(results, mean(metric_values))
end

Problems:

  • Verbose and imperative
  • Hard to read and maintain
  • Error-prone (manual array management)
  • Doesn't leverage existing Tidier dependency

Better with Tidier.jl:

@chain campaign_df begin
    @group_by(degree, experiment_id)
    @summarize(
        mean_l2 = mean(l2_error),
        std_l2 = std(l2_error),
        min_l2 = minimum(l2_error),
        max_l2 = maximum(l2_error),
        num_points = n()
    )
    @arrange(degree, mean_l2)
end

Benefits:

  • ✅ More readable and declarative
  • ✅ Less code (fewer lines, less complexity)
  • ✅ Leverages existing dependency (already in Project.toml!)
  • ✅ Familiar syntax for R/dplyr users
  • ✅ Easier to test and debug
  • ✅ Better performance (optimized under the hood)

Current Tidier Usage

Already in dependencies:

[deps]
Tidier = "f0413319-3358-4bb0-8e7c-0c83523a93bd"

But only used minimally in the codebase. We're paying the dependency cost without getting the benefits!

Files to Refactor

Primary targets:

  1. src/CampaignAnalysis.jl - aggregate_metrics_across_experiments()
  2. src/CampaignAnalysis.jl - aggregate_csv_metrics_by_degree()
  3. src/StatisticsCompute.jl - Various aggregation helpers

Example Refactors

Before:

function compute_mean_by_degree(data::Dict)
    results = Dict{Int, Float64}()
    for (deg, values) in data
        results[deg] = mean(values)
    end
    return results
end

After:

function compute_mean_by_degree(df::DataFrame)
    @chain df begin
        @group_by(degree)
        @summarize(mean_value = mean(value))
    end
end

Tasks

  • Audit codebase for manual aggregation loops
  • Identify top 5 functions to refactor (prioritize readability gains)
  • Refactor aggregate_metrics_across_experiments() to use Tidier
  • Refactor aggregate_csv_metrics_by_degree() to use Tidier
  • Add Tidier.jl usage examples to documentation
  • Create coding style guide recommending Tidier for aggregations
  • Add performance benchmarks (Tidier vs manual loops)
  • Update CONTRIBUTING.md with Tidier best practices

Documentation Additions

Add section to README:

## Data Aggregation with Tidier.jl

This package uses Tidier.jl for data aggregation following dplyr-style patterns:

# Example: Aggregate campaign metrics by degree
@chain campaign_results begin
    @group_by(degree)
    @summarize(
        mean_error = mean(l2_error),
        experiments = n()
    )
end

Priority

MEDIUM - Code quality improvement, leverages existing dependency

Estimated Effort

1-2 weeks (audit, refactor, test, document)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions