Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR adds two new source types—DataFrameSource and ListSource—to enable streaming data from pandas DataFrames and Python lists with customizable tagging.
- Introduces DataFrameSource for iterating over DataFrame rows as tagged packets
- Introduces ListSource for iterating over list elements as tagged packets
- Updates
src/orcapod/core/sources.pywith new imports and class definitions
Comments suppressed due to low confidence (3)
src/orcapod/core/sources.py:336
- The docstring refers to 'expected_keys' but the parameter is named 'expected_tag_keys'. Update the doc to match the code.
If expected_keys are provided, they will be used instead of the default keys.
src/orcapod/core/sources.py:494
- The docstring refers to 'expected_keys' but the parameter is named 'expected_tag_keys'. Update the doc to match the code.
If expected_keys are provided, they will be used instead of the default keys.
src/orcapod/core/sources.py:207
- New classes DataFrameSource and ListSource have been introduced without accompanying unit tests. Please add tests to cover the core functionality and edge cases.
class DataFrameSource(Source):
| from typing import Any, Literal | ||
|
|
||
| import pandas as pd | ||
| import polars as pl |
There was a problem hiding this comment.
The import 'polars as pl' is not used in this file. Consider removing it to keep imports clean.
| for idx, row in self.dataframe.iterrows(): | ||
| tag = self.tag_function(row, idx) | ||
| packet = {col: row[col] for col in self.columns} |
There was a problem hiding this comment.
Using DataFrame.iterrows() can be slow for large data sets. Consider iterating with itertuples or another vectorized approach for better performance.
| for idx, row in self.dataframe.iterrows(): | |
| tag = self.tag_function(row, idx) | |
| packet = {col: row[col] for col in self.columns} | |
| for row in self.dataframe.itertuples(index=True, name=None): | |
| idx, *values = row | |
| row_series = pd.Series(values, index=self.dataframe.columns) | |
| tag = self.tag_function(row_series, idx) | |
| packet = {col: row_series[col] for col in self.columns} |
| if missing_columns: | ||
| raise ValueError(f"Columns not found in DataFrame: {missing_columns}") | ||
|
|
||
| if tag_function is None: |
There was a problem hiding this comment.
[nitpick] Similar tag_function initialization logic appears in both DataFrameSource and ListSource—consider abstracting this into a shared helper or the base class to reduce duplication.
I manually rebased the brian branch from the dev branch and added this code back in,
I addressed your previous comments regarding specifying expected_tag_keys as row_index for the dataframesource and element_index for the ListSource. There was one comment you made that I didn't know how to interpret, made on the line specifying the tag_function in the init section of the dataframeSource class:
"Would you mind adding a feature where if a list of strings are passed, they are interpreted as columns whose values should be used as the tags?"
I'm assuming you mean the tags should have the same keys as the column named passed in? Just double checking.