Skip to content

FEAT: Dataset Loading Changes#1451

Merged
ValbuenaVC merged 47 commits intoAzure:mainfrom
ValbuenaVC:datasetloader
Mar 19, 2026
Merged

FEAT: Dataset Loading Changes#1451
ValbuenaVC merged 47 commits intoAzure:mainfrom
ValbuenaVC:datasetloader

Conversation

@ValbuenaVC
Copy link
Contributor

@ValbuenaVC ValbuenaVC commented Mar 10, 2026

Description

Features:

  • Addition of filters argument to get_all_dataset_names, which rejects datasets that don't meet filter criteria. filters has type SeedDatasetFilters.
  • SeedDatasetProviders have two options for storing static metadata (dynamic metadata, like derived attributes at runtime, has been scoped out of this PR):
    • If they are remote datasets, they are stored directly as named class attributes (e.g. harm_categories), and use types like SeedDatasetFoobar.
    • If they are local datasets, they are stored in the *.prompt file as tags, and extracted from it.
  • In all cases, SeedDatasetMetadata acts as a unified schema and ground truth for logic related to parsing metadata, and we expect only a few class attributes to count as metadata.
  • Datasets that don't have any metadata fields are excluded from filtering logic, with the sole exception that if the filter asks for tags = {"all"}, all filtering logic is bypassed.

Notes for Follow-Up PRs:

  • Populating dataset metadata is a follow-up item that's out of scope here.
  • Derived attributes (e.g. exact size after downloading a dataset) are also out of scope.

Tests and Documentation

  • Addition of test_seed_dataset_metadata.py under unit tests.
  • Addition of two integration tests under test_seed_dataset_provider_integration.py in tests.integration.datasets to account for local and remote metadata population.

@ValbuenaVC ValbuenaVC requested a review from rlundeen2 March 13, 2026 20:02
@ValbuenaVC ValbuenaVC marked this pull request as ready for review March 13, 2026 20:02
@ValbuenaVC ValbuenaVC changed the title [DRAFT] FEAT: Dataset Loading Changes FEAT: Dataset Loading Changes Mar 13, 2026
ValbuenaVC and others added 7 commits March 18, 2026 13:23
Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
Co-authored-by: hannahwestra25 <hannahwestra@microsoft.com>
Copy link
Contributor

@hannahwestra25 hannahwestra25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks for addressing all my comments 🙂

@ValbuenaVC ValbuenaVC merged commit cfc56d1 into Azure:main Mar 19, 2026
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants