Conditional value selection using CASE WHEN semantics — mirrors pandas.Series.case_when() (pandas 2.2+).
+
+
+
1 — Basic grade classification
+
caseWhen(series, caselist) applies an ordered list of [condition, replacement] pairs. The first matching condition determines the output; if no condition matches the original value is kept.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
2 — Using boolean Series as conditions
+
Conditions can be boolean Series objects (e.g. from comparison operations).
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
3 — Using predicate functions
+
Conditions can be predicate functions (value, index) => boolean.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
4 — Series as replacement values
+
Replacements can be Series objects — the matching positional value is used.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
5 — Unmatched rows keep original values
+
Any row not matched by any condition retains its original value — there is no implicit "else" replacement.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
6 — First matching condition wins
+
When multiple conditions match the same row, the first one in caselist takes effect — just like CASE WHEN … THEN … WHEN … THEN … END in SQL.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
7 — Positional index in predicate
+
Predicate functions receive both the value and its positional index as the second argument.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
8 — String Series classification
+
caseWhen works on any Series type — numbers, strings, booleans, or mixed.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
9 — Comparison with where / mask
+
caseWhen generalises whereSeries to multiple branches. Use whereSeries for a single condition; use caseWhen for multi-branch logic.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
+
+
diff --git a/playground/flags.html b/playground/flags.html
new file mode 100644
index 00000000..18c8cbf6
--- /dev/null
+++ b/playground/flags.html
@@ -0,0 +1,300 @@
+
+
+
+
+
+ tsb — Flags: metadata for DataFrame and Series
+
+
+
+
+
Reshape with aggregation. pivot() for unique reshaping; pivotTable() for aggregation (mean/sum/count/min/max/first/last) with fill_value and dropna support.
Metadata flags for DataFrame and Series. The flags getter returns a Flags object with allowsDuplicateLabels property. Setting allowsDuplicateLabels = false on an object with duplicate index labels raises DuplicateLabelError. Mirrors pandas.DataFrame.flags / pandas.core.flags.Flags.
readTable(text, opts?) — parse delimiter-separated text into a DataFrame. Defaults to tab separator; all ReadCsvOptions forwarded. Mirrors pandas.read_table().
Reshape wide-format data to long format using named column groups —
+ mirrors pandas.lreshape().
+ Edit any code block below and press ▶ Run
+ (or Ctrl+Enter) to execute it live in your browser.
+
+
+
+
+
1 · Basic lreshape
+
Stack two wide columns (v1, v2) into a single long
+ column v, repeating the id column for each block.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
2 · Multiple groups
+
Reshape with multiple output columns simultaneously. Each output column is
+ fed from a separate list of input columns.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
3 · dropna option
+
By default rows where any value column is null/NaN
+ are dropped. Pass dropna: false to keep them.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
4 · Real-world: survey scores
+
Stack multiple rounds of survey scores into a long-format table.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
API Reference
+
Reshape wide-format data to long format by explicitly naming which input
+ columns map to each output column.
All input columns not mentioned in groups
+ become identity (id) columns and are repeated for each block. All group lists must
+ have the same length k; the result has nRows × k rows
+ (before applying dropna).
+ Parse delimiter-separated text into a DataFrame
+ with readTable(). Mirrors
+ pandas
+ read_table() — identical to readCsv() but defaults
+ to a tab (\t) separator.
+ Edit any code block below and press ▶ Run
+ (or Ctrl+Enter) to execute it live in your browser.
+
+
+
+
+
1 · Basic tab-separated file
+
By default readTable() splits on tabs, infers column dtypes,
+ and returns a DataFrame.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
2 · Custom separator
+
Pass sep to use any delimiter — pipe, semicolon, or
+ multi-character strings.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
3 · Handling missing values
+
readTable() recognises common NA strings (NA,
+ N/A, null, …) and converts them to
+ NaN. Extend the list with naValues.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
4 · Index column, row limits & skip rows
+
Use indexCol to promote a column to the row index.
+ nRows caps the number of data rows read; skipRows
+ skips rows after the header.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
API Reference
+
Parse a delimiter-separated text string into a DataFrame.
+ Defaults to tab (\t) unlike readCsv which uses
+ a comma.
+
readTable(text: string, options?: ReadTableOptions): DataFrame
+
+interface ReadTableOptions {
+ sep?: string; // separator (default: "\t")
+ header?: number | null; // header row index (default: 0)
+ indexCol?: string | number | null; // column to use as row index
+ dtype?: Record<string, DtypeName>; // force dtype for named columns
+ naValues?: readonly string[]; // extra NA string values
+ skipRows?: number; // data rows to skip after header
+ nRows?: number; // maximum data rows to read
+}
+ readSql, readSqlQuery, readSqlTable, and toSql
+ mirror pandas
+ read_sql() and
+ DataFrame.to_sql().
+ Because tsb has zero runtime dependencies, you pass
+ a SqlConnection adapter for your database driver.
+ Edit any code block below and press ▶ Run
+ (or Ctrl+Enter) to execute it live in your browser.
+
+
+
+
+
1 · readSqlQuery — run a SELECT statement
+
Pass a SQL string and a SqlConnection adapter. The result is a
+ DataFrame. An optional indexCol promotes a column to the row
+ index.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
2 · readSqlTable — load an entire table
+
Pass a table name (not a SQL string). Use columns to select a subset,
+ or indexCol to set the row index.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
3 · readSql — auto-detect query vs table name
+
readSql inspects the first argument: if it looks like a SQL statement
+ it calls readSqlQuery; otherwise it calls readSqlTable.
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
4 · toSql — write a DataFrame to a SQL table
+
Writes rows from a DataFrame into the database. Returns the number of
+ rows written. The ifExists option controls what happens when the table
+ already exists: "fail", "replace", or
+ "append".
+
+
+ TypeScript
+
+
+
+
+
+
+
+
Click ▶ Run to execute
+
Ctrl+Enter to run · Tab to indent
+
+
+
+
+
+
API Reference
+
All four functions accept a SqlConnection adapter — implement
+ query() plus optional listTables() and insert()
+ for your database driver.
Read and write Stata DTA files from TypeScript.
+ toStata(df) serializes a DataFrame to a Stata DTA v118 binary buffer.
+ readStata(buf, options) parses the buffer back into a DataFrame.
+ Numeric missing values are represented as null. Mirrors
+ pandas.read_stata() and DataFrame.to_stata().
+ Edit any code block below and press ▶ Run
+ (or Ctrl+Enter) to execute it live in your browser.
+
+
+
+
+
1 · Basic round-trip — write and read back
+
Create a DataFrame, serialize it to a Stata DTA v118 binary buffer with
+ toStata(), then parse it back with readStata().
+ All columns, values, and shape are preserved.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
2 · Missing values — null round-trip
+
Stata represents missing numeric values as special sentinel bit patterns.
+ readStata maps all missing sentinels to null.
+ toStata writes the standard Stata system-missing value for each type.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
3 · Options — dataLabel & variableLabels
+
Embed a dataset description with dataLabel and per-column annotations
+ with variableLabels. These metadata fields are stored in the DTA header
+ and are visible in Stata's describe command.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
4 · Options — usecols, nRows, indexCol
+
Restrict columns with usecols, limit rows with nRows,
+ and promote a column to the DataFrame index with indexCol.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
5 · Boolean columns
+
Boolean values are stored as Stata byte (int8) with
+ true → 1 and false → 0. Reading converts
+ them back to numbers; use .map() or comparison operators
+ to recover booleans if needed.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
6 · writeIndex — include the row index
+
Pass writeIndex: true to include the DataFrame's row index
+ as an extra _index column in the DTA file.
Parse XML text into a DataFrame with
+ auto-detection of row elements, attribute and child-element columns, entity decoding,
+ CDATA support, namespace stripping, and numeric coercion. Serialize any DataFrame
+ back to well-formed XML with full formatting control. Mirrors
+ pandas.read_xml() and pandas.DataFrame.to_xml().
+ Edit any code block below and press ▶ Run
+ (or Ctrl+Enter) to execute it live in your browser.
+
+
+
+
+
1 · Basic readXml — child-element rows
+
The most common XML layout: a root element containing repeating row elements,
+ each with child elements as columns. readXml auto-detects the row
+ tag and coerces numeric strings automatically.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
2 · Attribute rows
+
XML elements can carry data as attributes instead of (or in addition to) child
+ elements. Use attribs: true (the default) to include them.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
3 · usecols, nrows, indexCol
+
Restrict the columns returned with usecols, limit rows with
+ nrows, and promote a column to the index with indexCol.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
4 · naValues — custom NA strings
+
Built-in NA strings include "", "NA", "NaN",
+ "N/A", "null", "None", "nan".
+ Use naValues to add your own.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
5 · Entities & CDATA
+
Named entities (&, <, …), decimal/hex
+ character references (A, A), and
+ CDATA sections (<![CDATA[…]]>) are all handled transparently.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
6 · toXml — child elements (default)
+
toXml(df) produces a well-formed XML document with an XML declaration,
+ a configurable root element, and one child element per row containing one sub-element
+ per column.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
7 · toXml — attribs mode
+
Set attribs: true to emit column values as XML attributes on each
+ row element instead of as child elements — produces more compact output.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
8 · toXml — namespaces & CDATA columns
+
Declare XML namespace prefixes on the root element with namespaces.
+ Wrap sensitive columns in CDATA sections with cdataCols to preserve
+ special characters literally.
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
9 · Round-trip: toXml → readXml
+
Serializing a DataFrame to XML and reading it back should produce an identical
+ DataFrame (shape and values).
+
+
+ TypeScript
+
+
+
+
+
+
+
Click ▶ Run to execute
+
+
+
+
+
+
diff --git a/src/core/flags.ts b/src/core/flags.ts
new file mode 100644
index 00000000..546cb031
--- /dev/null
+++ b/src/core/flags.ts
@@ -0,0 +1,186 @@
+/**
+ * Flags — metadata flags for DataFrame and Series objects.
+ *
+ * Mirrors `pandas.core.flags.Flags`. Provides the `allowsDuplicateLabels`
+ * flag that controls whether duplicate row/column labels are permitted in the
+ * associated DataFrame or Series.
+ *
+ * @example
+ * ```ts
+ * import { DataFrame, DuplicateLabelError } from "tsb";
+ *
+ * const df = DataFrame.fromColumns({ a: [1, 2, 3] });
+ * df.flags.allowsDuplicateLabels; // true (default)
+ *
+ * df.flags.allowsDuplicateLabels = false;
+ * // Setting false on a DataFrame with no duplicates is fine.
+ *
+ * const dfDup = new DataFrame(
+ * new Map([["a", df.col("a")]]),
+ * df.index.append(df.index), // duplicate index
+ * );
+ * dfDup.flags.allowsDuplicateLabels = false; // throws DuplicateLabelError
+ * ```
+ *
+ * @packageDocumentation
+ */
+
+import { DuplicateLabelError } from "../errors.ts";
+
+// ---------------------------------------------------------------------------
+// Structural interfaces (no imports from frame.ts / series.ts)
+// ---------------------------------------------------------------------------
+
+/**
+ * Minimal structural interface satisfied by any `Index` instance.
+ * Defined here (instead of importing from base-index.ts) to avoid circular
+ * imports — frame.ts → flags.ts must not require flags.ts → frame.ts.
+ */
+interface IndexLike {
+ readonly values: readonly unknown[];
+ readonly size: number;
+}
+
+/**
+ * Structural interface satisfied by both `DataFrame` and `Series`.
+ * Used as the WeakMap key so flags.ts never imports the concrete classes.
+ */
+export interface FlaggedObject {
+ /** Row index of the object. */
+ readonly index: IndexLike;
+}
+
+// ---------------------------------------------------------------------------
+// Internal state registry
+// ---------------------------------------------------------------------------
+
+interface FlagsState {
+ allowsDuplicateLabels: boolean;
+}
+
+const registry = new WeakMap();
+
+function getState(obj: FlaggedObject): FlagsState {
+ let state = registry.get(obj);
+ if (state === undefined) {
+ state = { allowsDuplicateLabels: true };
+ registry.set(obj, state);
+ }
+ return state;
+}
+
+// ---------------------------------------------------------------------------
+// Flags class
+// ---------------------------------------------------------------------------
+
+/**
+ * Metadata flags for a `DataFrame` or `Series`.
+ *
+ * Accessible via `df.flags` or `series.flags`. Mutations are reflected
+ * immediately on the underlying object because state is stored in a
+ * module-level WeakMap keyed by the object reference.
+ *
+ * ### pandas reference
+ * `pandas.core.flags.Flags`
+ */
+export class Flags {
+ private readonly _obj: FlaggedObject;
+
+ /**
+ * @param obj - The DataFrame or Series this Flags object is bound to.
+ * @param opts.allowsDuplicateLabels - Initial value for `allowsDuplicateLabels`.
+ * Defaults to `true` when not previously set.
+ */
+ constructor(obj: FlaggedObject, opts: { allowsDuplicateLabels?: boolean } = {}) {
+ this._obj = obj;
+ if (opts.allowsDuplicateLabels !== undefined) {
+ getState(obj).allowsDuplicateLabels = opts.allowsDuplicateLabels;
+ }
+ }
+
+ // ── allowsDuplicateLabels ─────────────────────────────────────────────────
+
+ /**
+ * Whether duplicate labels (along any axis) are allowed.
+ *
+ * Defaults to `true`. When set to `false`, any existing duplicate labels
+ * trigger a `DuplicateLabelError` immediately. Future operations that would
+ * produce duplicate labels also raise.
+ *
+ * @example
+ * ```ts
+ * df.flags.allowsDuplicateLabels; // true
+ * df.flags.allowsDuplicateLabels = false;
+ * df.flags.allowsDuplicateLabels; // false
+ * ```
+ */
+ get allowsDuplicateLabels(): boolean {
+ return getState(this._obj).allowsDuplicateLabels;
+ }
+
+ set allowsDuplicateLabels(value: boolean) {
+ getState(this._obj).allowsDuplicateLabels = value;
+ if (!value) {
+ this._validateNoDuplicates();
+ }
+ }
+
+ // ── helpers ───────────────────────────────────────────────────────────────
+
+ /**
+ * Raise `DuplicateLabelError` if the bound object currently has duplicate
+ * row-index labels.
+ */
+ private _validateNoDuplicates(): void {
+ const { values } = this._obj.index;
+ const seen = new Set();
+ for (const label of values) {
+ if (seen.has(label)) {
+ throw new DuplicateLabelError(`Index has duplicate keys: [${String(label)}]`);
+ }
+ seen.add(label);
+ }
+ }
+
+ /**
+ * Raise `DuplicateLabelError` if `allowsDuplicateLabels` is `false` and
+ * the bound object has duplicate labels. Called by DataFrame/Series methods
+ * after operations that could introduce duplicates.
+ */
+ raiseOnDuplicates(): void {
+ if (!this.allowsDuplicateLabels) {
+ this._validateNoDuplicates();
+ }
+ }
+
+ /**
+ * Return a copy of this Flags object bound to the **same** underlying object.
+ *
+ * The returned `Flags` shares state with the original — mutations to either
+ * are reflected in both (they both write to the same WeakMap entry).
+ */
+ copy(): Flags {
+ return new Flags(this._obj);
+ }
+
+ /** Human-readable representation mirroring pandas' `repr(df.flags)`. */
+ toString(): string {
+ return ``;
+ }
+}
+
+// ---------------------------------------------------------------------------
+// Registry accessor (used by DataFrame.flags / Series.flags getters)
+// ---------------------------------------------------------------------------
+
+/**
+ * Return (or lazily create) the `Flags` wrapper for the given object.
+ *
+ * Each call creates a *new* `Flags` wrapper object, but all wrappers for the
+ * same `obj` share the same state via the module-level WeakMap registry.
+ *
+ * @param obj - The DataFrame or Series to get flags for.
+ */
+export function getFlags(obj: FlaggedObject): Flags {
+ return new Flags(obj);
+}
diff --git a/src/core/frame.ts b/src/core/frame.ts
index ec18d144..e21c341e 100644
--- a/src/core/frame.ts
+++ b/src/core/frame.ts
@@ -26,6 +26,8 @@ import type { ExpandingOptions } from "../window/index.ts";
import { Rolling } from "../window/index.ts";
import type { RollingOptions } from "../window/index.ts";
import { Index } from "./base-index.ts";
+import { getFlags } from "./flags.ts";
+import type { Flags } from "./flags.ts";
import { RangeIndex } from "./range-index.ts";
import { Series } from "./series.ts";
@@ -245,6 +247,21 @@ export class DataFrame {
return this.index.size === 0 || this.columns.size === 0;
}
+ /**
+ * Metadata flags for this DataFrame.
+ *
+ * Controls behaviour such as whether duplicate labels are allowed.
+ *
+ * @example
+ * ```ts
+ * df.flags.allowsDuplicateLabels; // true (default)
+ * df.flags.allowsDuplicateLabels = false;
+ * ```
+ */
+ get flags(): Flags {
+ return getFlags(this);
+ }
+
// ─── column access ────────────────────────────────────────────────────────
/**
diff --git a/src/core/index.ts b/src/core/index.ts
index 130c748e..2ac9ba64 100644
--- a/src/core/index.ts
+++ b/src/core/index.ts
@@ -151,3 +151,6 @@ export type {
ExtensionDtypeConstructor,
ExtensionArrayConstructor,
} from "./extensions.ts";
+
+export { Flags, getFlags } from "./flags.ts";
+export type { FlaggedObject } from "./flags.ts";
diff --git a/src/core/series.ts b/src/core/series.ts
index 29063e91..03815a8b 100644
--- a/src/core/series.ts
+++ b/src/core/series.ts
@@ -21,6 +21,8 @@ import type { CatSeriesLike } from "./cat_accessor.ts";
import { DatetimeAccessor } from "./datetime_accessor.ts";
import type { DatetimeSeriesLike } from "./datetime_accessor.ts";
import { Dtype } from "./dtype.ts";
+import { getFlags } from "./flags.ts";
+import type { Flags } from "./flags.ts";
import { RangeIndex } from "./range-index.ts";
import { StringAccessor } from "./string_accessor.ts";
import type { StringSeriesLike } from "./string_accessor.ts";
@@ -286,6 +288,21 @@ export class Series {
return this._values.length === 0;
}
+ /**
+ * Metadata flags for this Series.
+ *
+ * Controls behaviour such as whether duplicate labels are allowed.
+ *
+ * @example
+ * ```ts
+ * s.flags.allowsDuplicateLabels; // true (default)
+ * s.flags.allowsDuplicateLabels = false;
+ * ```
+ */
+ get flags(): Flags {
+ return getFlags(this);
+ }
+
/** Snapshot of the underlying values as a plain array. */
get values(): readonly T[] {
return this._values;
diff --git a/src/errors.ts b/src/errors.ts
index 4ea24681..83099389 100644
--- a/src/errors.ts
+++ b/src/errors.ts
@@ -86,6 +86,19 @@ export class EmptyDataError extends Error {
}
}
+/**
+ * Raised when an operation would produce (or encounters) duplicate labels
+ * on an object where `flags.allowsDuplicateLabels` is `false`.
+ *
+ * Equivalent to `pandas.errors.DuplicateLabelError`.
+ */
+export class DuplicateLabelError extends ValueError {
+ override readonly name = "DuplicateLabelError";
+ constructor(message = "Index has duplicates") {
+ super(message);
+ }
+}
+
/** Raised when casting to integer would lose data due to NaN values. */
export class IntCastingNaNError extends Error {
override readonly name = "IntCastingNaNError";
@@ -233,6 +246,7 @@ export const errors = {
DatabaseError,
DataError,
DtypeWarning,
+ DuplicateLabelError,
EmptyDataError,
IntCastingNaNError,
InvalidColumnName,
diff --git a/src/index.ts b/src/index.ts
index 2f49842f..d0048033 100644
--- a/src/index.ts
+++ b/src/index.ts
@@ -62,6 +62,26 @@ export { toJsonDenormalize, toJsonRecords, toJsonSplit, toJsonIndex } from "./io
export type { JsonDenormalizeOptions, JsonSplitOptions, JsonSplitResult } from "./io/index.ts";
export { readHtml } from "./io/index.ts";
export type { ReadHtmlOptions } from "./io/index.ts";
+export { readXml, toXml } from "./io/index.ts";
+export type { ReadXmlOptions, ToXmlOptions } from "./io/index.ts";
+export { readTable } from "./io/index.ts";
+export type { ReadTableOptions } from "./io/index.ts";
+export { readSql, readSqlQuery, readSqlTable, toSql } from "./io/index.ts";
+export { TableExistsError, TableNotFoundError } from "./io/index.ts";
+export { readStata, toStata } from "./io/index.ts";
+export type { ReadStataOptions, ToStataOptions } from "./io/index.ts";
+export type {
+ SqlValue,
+ SqlRow,
+ SqlResult,
+ SqlConnection,
+ IfExistsStrategy,
+ ReadSqlBaseOptions,
+ ReadSqlQueryOptions,
+ ReadSqlTableOptions,
+ ReadSqlOptions,
+ ToSqlOptions,
+} from "./io/index.ts";
export { pearsonCorr, dataFrameCorr, dataFrameCov } from "./stats/index.ts";
export type { CorrMethod, CorrOptions, CovOptions } from "./stats/index.ts";
export { Rolling } from "./window/index.ts";
@@ -103,6 +123,8 @@ export { wideToLong } from "./reshape/index.ts";
export type { WideToLongOptions } from "./reshape/index.ts";
export { pivotTableFull } from "./reshape/index.ts";
export type { PivotTableFullOptions } from "./reshape/index.ts";
+export { lreshape } from "./reshape/index.ts";
+export type { LreshapeGroups, LreshapeOptions } from "./reshape/index.ts";
export { MultiIndex } from "./core/index.ts";
export type { MultiIndexOptions } from "./core/index.ts";
export { rankSeries, rankDataFrame } from "./stats/index.ts";
@@ -783,3 +805,8 @@ export {
IndexError,
} from "./errors.ts";
export type { PandasError } from "./errors.ts";
+export { DuplicateLabelError } from "./errors.ts";
+export { caseWhen } from "./stats/index.ts";
+export type { CaseWhenBranch, CaseWhenPredicate } from "./stats/index.ts";
+export { Flags, getFlags } from "./core/index.ts";
+export type { FlaggedObject } from "./core/index.ts";
diff --git a/src/io/csv.ts b/src/io/csv.ts
index 687355f0..331ee944 100644
--- a/src/io/csv.ts
+++ b/src/io/csv.ts
@@ -144,6 +144,7 @@ function isNaRaw(raw: string, naSet: ReadonlySet): boolean {
/** Infer the most specific dtype for a column from its raw string values. */
function inferColumnDtype(raws: readonly string[], naSet: ReadonlySet): DtypeName {
const nonNa = raws.filter((r) => !isNaRaw(r, naSet));
+ const hasNa = nonNa.length < raws.length;
if (nonNa.length === 0) {
return "object";
}
@@ -153,18 +154,23 @@ function inferColumnDtype(raws: readonly string[], naSet: ReadonlySet):
}
const allInt = nonNa.every((r) => RE_INT.test(r));
if (allInt) {
- return "int64";
+ // Upgrade to float64 when NAs are present so NaN can represent missing values.
+ return hasNa ? "float64" : "int64";
}
const allFloat = nonNa.every((r) => RE_FLOAT.test(r));
if (allFloat) {
return "float64";
}
- return "string";
+ return "object";
}
/** Parse a raw string to a Scalar for an inferred dtype. */
function parseInferred(raw: string, dtype: DtypeName, naSet: ReadonlySet): Scalar {
if (isNaRaw(raw, naSet)) {
+ // Numeric columns use NaN so callers can detect missing values via Number.isNaN().
+ if (dtype === "float64" || dtype === "int64") {
+ return Number.NaN;
+ }
return null;
}
if (dtype === "bool") {
diff --git a/src/io/index.ts b/src/io/index.ts
index 6c5edea0..93f3060d 100644
--- a/src/io/index.ts
+++ b/src/io/index.ts
@@ -23,6 +23,28 @@ export type {
} from "./to_json_normalize.ts";
export { readHtml } from "./read_html.ts";
export type { ReadHtmlOptions } from "./read_html.ts";
+export { readXml, toXml } from "./xml.ts";
+export type { ReadXmlOptions, ToXmlOptions } from "./xml.ts";
+export { readTable } from "./read_table.ts";
+export type { ReadTableOptions } from "./read_table.ts";
+
+export { readSql, readSqlQuery, readSqlTable, toSql } from "./sql.ts";
+export { TableExistsError, TableNotFoundError } from "./sql.ts";
+
+export { readStata, toStata } from "./stata.ts";
+export type { ReadStataOptions, ToStataOptions } from "./stata.ts";
+export type {
+ SqlValue,
+ SqlRow,
+ SqlResult,
+ SqlConnection,
+ IfExistsStrategy,
+ ReadSqlBaseOptions,
+ ReadSqlQueryOptions,
+ ReadSqlTableOptions,
+ ReadSqlOptions,
+ ToSqlOptions,
+} from "./sql.ts";
// readExcel / xlsxSheetNames use node:zlib and cannot be bundled for the
// browser. Import them directly from "tsb/io/read_excel" when running in
diff --git a/src/io/read_table.ts b/src/io/read_table.ts
new file mode 100644
index 00000000..0290afa1
--- /dev/null
+++ b/src/io/read_table.ts
@@ -0,0 +1,52 @@
+/**
+ * readTable — read a general delimiter-separated text file into a DataFrame.
+ *
+ * Mirrors `pandas.read_table()`:
+ * - Same signature as `readCsv` but defaults `sep` to `"\t"`.
+ * - Handles any single-character (or multi-character) delimiter.
+ * - All `ReadCsvOptions` are supported; when `sep` is omitted it falls back
+ * to `"\t"` (tab), distinguishing this function from `readCsv` (whose
+ * default is `","`).
+ *
+ * @module
+ */
+
+import type { DataFrame } from "../core/index.ts";
+import { readCsv } from "./csv.ts";
+import type { ReadCsvOptions } from "./csv.ts";
+
+// ─── public types ─────────────────────────────────────────────────────────────
+
+/**
+ * Options for {@link readTable}.
+ *
+ * Identical to {@link ReadCsvOptions} except the default `sep` is `"\t"`.
+ */
+export interface ReadTableOptions extends ReadCsvOptions {
+ /** Column separator. Default: `"\t"` (tab). */
+ readonly sep?: string;
+}
+
+// ─── implementation ───────────────────────────────────────────────────────────
+
+/**
+ * Parse a delimiter-separated text string into a {@link DataFrame}.
+ *
+ * Equivalent to `pandas.read_table()` — the same as {@link readCsv} but
+ * defaults to a tab separator instead of a comma.
+ *
+ * ```ts
+ * import { readTable } from "tsb";
+ *
+ * const tsv = "name\tage\tscity\nAlice\t30\tNY\nBob\t25\tLA";
+ * const df = readTable(tsv);
+ * // DataFrame with columns: name, age, city
+ * ```
+ *
+ * @param text Raw text content of the file.
+ * @param options Parsing options (see {@link ReadTableOptions}).
+ */
+export function readTable(text: string, options: ReadTableOptions = {}): DataFrame {
+ const sep = options.sep ?? "\t";
+ return readCsv(text, { ...options, sep });
+}
diff --git a/src/io/sql.ts b/src/io/sql.ts
new file mode 100644
index 00000000..2e5ace04
--- /dev/null
+++ b/src/io/sql.ts
@@ -0,0 +1,654 @@
+/**
+ * read_sql / to_sql — SQL I/O for DataFrame.
+ *
+ * Mirrors the pandas SQL I/O API:
+ * - {@link readSqlQuery} — execute a SQL SELECT and return a DataFrame
+ * - {@link readSqlTable} — read an entire table into a DataFrame
+ * - {@link readSql} — auto-detect query vs table name
+ * - {@link toSql} — write a DataFrame to a SQL table
+ *
+ * Because tsb has zero runtime dependencies, this module does **not** ship a
+ * database driver. Instead it defines the {@link SqlConnection} adapter
+ * interface. Pass a conforming adapter for your driver of choice
+ * (better-sqlite3, postgres, mysql2, …) to any of the functions here.
+ *
+ * @example
+ * ```ts
+ * import type { SqlConnection, SqlResult, SqlValue } from "tsb";
+ * import { readSql, toSql } from "tsb";
+ *
+ * // Minimal in-memory adapter (illustrative — not a real DB)
+ * class MockAdapter implements SqlConnection {
+ * query(sql: string): SqlResult {
+ * return { columns: ["id", "name"], rows: [{ id: 1, name: "Alice" }] };
+ * }
+ * }
+ *
+ * const db = new MockAdapter();
+ * const df = readSql("SELECT * FROM users", db);
+ * ```
+ *
+ * @module
+ */
+
+import { DataFrame } from "../core/index.ts";
+import { Index } from "../core/index.ts";
+import type { Label, Scalar } from "../types.ts";
+
+// ─── SQL value types ──────────────────────────────────────────────────────────
+
+/**
+ * A scalar value that may be returned from a SQL query column.
+ *
+ * Covers the common ground across DB drivers: numbers, strings, booleans,
+ * `null` (SQL NULL), and raw byte buffers (SQL BLOB / BYTEA).
+ */
+export type SqlValue = string | number | boolean | null | Uint8Array;
+
+/**
+ * A single row from a SQL result set, mapping column name → value.
+ */
+export type SqlRow = Record;
+
+/**
+ * The complete result of executing a SQL query.
+ */
+export interface SqlResult {
+ /** Ordered list of column names as returned by the database. */
+ readonly columns: readonly string[];
+ /** All data rows. Each row is an object keyed by column name. */
+ readonly rows: readonly SqlRow[];
+}
+
+// ─── connection adapter interface ─────────────────────────────────────────────
+
+/**
+ * Strategy for handling a pre-existing table in {@link toSql}.
+ *
+ * - `"fail"` — throw {@link TableExistsError} if the table already exists (default).
+ * - `"replace"` — drop and recreate the table, then insert all rows.
+ * - `"append"` — insert rows into the existing table without dropping it.
+ */
+export type IfExistsStrategy = "fail" | "replace" | "append";
+
+/**
+ * Adapter interface for a SQL database connection.
+ *
+ * Implement this interface for your specific database driver and pass instances
+ * to {@link readSql}, {@link readSqlQuery}, {@link readSqlTable}, and
+ * {@link toSql}.
+ *
+ * Only {@link query} is required; all other methods are optional and enable
+ * more efficient or richer behaviour.
+ *
+ * @example
+ * ```ts
+ * // Minimal adapter wrapping better-sqlite3
+ * import Database from "better-sqlite3";
+ * import type { SqlConnection, SqlResult } from "tsb";
+ *
+ * class BetterSqlite3Adapter implements SqlConnection {
+ * constructor(private readonly db: Database.Database) {}
+ *
+ * query(sql: string, params?: readonly SqlValue[]): SqlResult {
+ * const stmt = this.db.prepare(sql);
+ * const rows = stmt.all(...(params ?? [])) as SqlRow[];
+ * const columns = rows.length > 0 ? Object.keys(rows[0]!) : [];
+ * return { columns, rows };
+ * }
+ *
+ * listTables(): string[] {
+ * return (this.db.prepare(
+ * "SELECT name FROM sqlite_master WHERE type='table'",
+ * ).all() as { name: string }[]).map((r) => r.name);
+ * }
+ * }
+ * ```
+ */
+export interface SqlConnection {
+ /**
+ * Execute a SQL query and return the result set.
+ *
+ * @param sql SQL string, which may include `?` (positional) or `$N`
+ * (numbered) placeholders — semantics depend on the driver.
+ * @param params Optional positional parameters bound to the placeholders.
+ */
+ query(sql: string, params?: readonly SqlValue[]): SqlResult;
+
+ /**
+ * Return the names of all tables visible through this connection.
+ *
+ * Used by {@link readSqlTable} to validate that the requested table exists.
+ * When omitted, no up-front validation is performed.
+ */
+ listTables?(): readonly string[];
+
+ /**
+ * Insert rows into a table, applying the specified {@link IfExistsStrategy}.
+ *
+ * When provided, {@link toSql} delegates bulk insertion to this method,
+ * allowing the adapter to use database-native batch APIs.
+ * When omitted, {@link toSql} falls back to individual `INSERT INTO …`
+ * statements executed via {@link query}.
+ *
+ * @param tableName Target table.
+ * @param rows Row objects — each key is a column name.
+ * @param columns Ordered column names (matches keys in `rows`).
+ * @param ifExists How to handle a pre-existing table.
+ * @returns Number of rows inserted.
+ */
+ insert?(
+ tableName: string,
+ rows: readonly SqlRow[],
+ columns: readonly string[],
+ ifExists: IfExistsStrategy,
+ ): number;
+}
+
+// ─── public option types ──────────────────────────────────────────────────────
+
+/**
+ * Options shared by all read functions.
+ */
+export interface ReadSqlBaseOptions {
+ /**
+ * Column name or zero-based position to use as the DataFrame row index.
+ * When a string is given the column must exist in the result.
+ * When a number is given it selects by position.
+ * Default: `null` — a default `RangeIndex` is used.
+ */
+ readonly indexCol?: string | number | null;
+
+ /**
+ * Column names to parse as timestamps.
+ * Values are converted to milliseconds-since-epoch using `Date.parse()`.
+ * Non-parseable values are left as-is.
+ */
+ readonly parseDates?: readonly string[];
+}
+
+/**
+ * Options for {@link readSqlQuery}.
+ */
+export interface ReadSqlQueryOptions extends ReadSqlBaseOptions {
+ /**
+ * Positional parameter bindings for the SQL query.
+ * Passed verbatim to {@link SqlConnection.query}.
+ */
+ readonly params?: readonly SqlValue[];
+}
+
+/**
+ * Options for {@link readSqlTable}.
+ */
+export interface ReadSqlTableOptions extends ReadSqlBaseOptions {
+ /**
+ * Schema qualifier to prefix the table name (e.g. `"public"` in PostgreSQL).
+ * When provided the query uses `""."
"`.
+ */
+ readonly schema?: string;
+
+ /**
+ * Subset of columns to retrieve. When omitted all columns are returned.
+ */
+ readonly columns?: readonly string[];
+}
+
+/**
+ * Options for {@link readSql}.
+ * Combines {@link ReadSqlQueryOptions} and {@link ReadSqlTableOptions}.
+ */
+export interface ReadSqlOptions extends ReadSqlQueryOptions, ReadSqlTableOptions {}
+
+/**
+ * Options for {@link toSql}.
+ */
+export interface ToSqlOptions {
+ /**
+ * Behaviour when a table named `name` already exists.
+ * Default: `"fail"`.
+ */
+ readonly ifExists?: IfExistsStrategy;
+
+ /**
+ * Whether to write the DataFrame's row index as a column.
+ * Default: `true`.
+ */
+ readonly index?: boolean;
+
+ /**
+ * Column label to use for the written index column.
+ * Only effective when `index` is `true`.
+ * Default: the index name when set, otherwise `"index"`.
+ */
+ readonly indexLabel?: string | null;
+
+ /**
+ * Number of rows to insert per batch.
+ * Ignored when the adapter provides {@link SqlConnection.insert}.
+ * Default: all rows in a single batch.
+ */
+ readonly chunksize?: number;
+}
+
+// ─── errors ───────────────────────────────────────────────────────────────────
+
+/**
+ * Thrown by {@link toSql} when `ifExists: "fail"` (the default) and the
+ * target table already exists.
+ */
+export class TableExistsError extends Error {
+ /** @param tableName The table that already exists. */
+ constructor(tableName: string) {
+ super(`Table "${tableName}" already exists. Use ifExists: "replace" or "append".`);
+ this.name = "TableExistsError";
+ }
+}
+
+/**
+ * Thrown by {@link readSqlTable} when the requested table is not found.
+ */
+export class TableNotFoundError extends Error {
+ /** @param tableName The table that was not found. */
+ constructor(tableName: string) {
+ super(`Table "${tableName}" not found in the database.`);
+ this.name = "TableNotFoundError";
+ }
+}
+
+// ─── internal helpers ─────────────────────────────────────────────────────────
+
+/** Convert a {@link SqlValue} to a tsb {@link Scalar}. */
+function sqlValueToScalar(v: SqlValue): Scalar {
+ if (v instanceof Uint8Array) {
+ // Represent BLOB as a JSON string of the hex encoding so it can sit in a
+ // string-typed Series without losing data.
+ return Buffer.from(v).toString("hex");
+ }
+ return v;
+}
+
+/**
+ * Build a DataFrame from a {@link SqlResult}, applying common options.
+ *
+ * @internal
+ */
+function resultToDataFrame(result: SqlResult, options: ReadSqlBaseOptions): DataFrame {
+ const { indexCol = null, parseDates } = options;
+
+ // Resolve the index column name (if any).
+ let idxColName: string | null = null;
+ if (indexCol !== null && indexCol !== undefined) {
+ if (typeof indexCol === "number") {
+ const col = result.columns[indexCol];
+ if (col !== undefined) {
+ idxColName = col;
+ }
+ } else {
+ idxColName = indexCol;
+ }
+ }
+
+ // Build column arrays, excluding the index column.
+ const dataColumns: string[] = [];
+ const columnData: Record = {};
+
+ for (const col of result.columns) {
+ if (col === idxColName) continue;
+ dataColumns.push(col);
+ columnData[col] = [];
+ }
+
+ // Populate column arrays.
+ for (const row of result.rows) {
+ for (const col of dataColumns) {
+ const arr = columnData[col];
+ if (arr !== undefined) {
+ const raw = row[col];
+ arr.push(raw !== undefined ? sqlValueToScalar(raw) : null);
+ }
+ }
+ }
+
+ // Parse date columns (convert to ms-since-epoch numbers).
+ if (parseDates !== undefined) {
+ for (const col of parseDates) {
+ const arr = columnData[col];
+ if (arr !== undefined) {
+ for (let i = 0; i < arr.length; i++) {
+ const v = arr[i];
+ if (v !== null && v !== undefined && typeof v === "string") {
+ const ms = Date.parse(v);
+ arr[i] = Number.isNaN(ms) ? v : ms;
+ }
+ }
+ }
+ }
+ }
+
+ // Build the row index.
+ const indexVals: Label[] = [];
+ if (idxColName !== null) {
+ for (const row of result.rows) {
+ const raw = row[idxColName];
+ const v: SqlValue = raw !== undefined ? raw : null;
+ if (v instanceof Uint8Array) {
+ indexVals.push(Buffer.from(v).toString("hex"));
+ } else {
+ indexVals.push(v);
+ }
+ }
+ }
+
+ const rowIndex = idxColName !== null ? new Index(indexVals, idxColName) : undefined;
+
+ return DataFrame.fromColumns(
+ columnData as Record,
+ rowIndex !== undefined ? { index: rowIndex } : {},
+ );
+}
+
+/** Quote an identifier with double-quotes (ANSI SQL). */
+function quoteIdent(name: string): string {
+ return `"${name.replace(/"/g, '""')}"`;
+}
+
+/** Build a SELECT statement for {@link readSqlTable}. */
+function buildSelectQuery(tableName: string, options: ReadSqlTableOptions): string {
+ const { schema, columns } = options;
+
+ const qualifiedTable =
+ schema !== undefined ? `${quoteIdent(schema)}.${quoteIdent(tableName)}` : quoteIdent(tableName);
+
+ const colList =
+ columns !== undefined && columns.length > 0 ? columns.map(quoteIdent).join(", ") : "*";
+
+ return `SELECT ${colList} FROM ${qualifiedTable}`;
+}
+
+/**
+ * Heuristic: does the string look like a SQL query (contains whitespace) or a
+ * plain table name?
+ */
+function looksLikeQuery(sqlOrTable: string): boolean {
+ return /\s/.test(sqlOrTable.trim());
+}
+
+// ─── public API ───────────────────────────────────────────────────────────────
+
+/**
+ * Execute a SQL SELECT query and return the result as a {@link DataFrame}.
+ *
+ * Mirrors `pandas.read_sql_query()`.
+ *
+ * ```ts
+ * import { readSqlQuery } from "tsb";
+ *
+ * const df = readSqlQuery("SELECT id, name FROM users WHERE active = ?", db, {
+ * params: [1],
+ * indexCol: "id",
+ * });
+ * ```
+ *
+ * @param sql SQL SELECT string (may include parameter placeholders).
+ * @param conn Database adapter implementing {@link SqlConnection}.
+ * @param options See {@link ReadSqlQueryOptions}.
+ */
+export function readSqlQuery(
+ sql: string,
+ conn: SqlConnection,
+ options: ReadSqlQueryOptions = {},
+): DataFrame {
+ const { params } = options;
+ const result = params !== undefined ? conn.query(sql, params) : conn.query(sql);
+ return resultToDataFrame(result, options);
+}
+
+/**
+ * Read an entire database table into a {@link DataFrame}.
+ *
+ * Mirrors `pandas.read_sql_table()`.
+ *
+ * ```ts
+ * import { readSqlTable } from "tsb";
+ *
+ * const df = readSqlTable("products", db, {
+ * schema: "inventory",
+ * columns: ["id", "name", "price"],
+ * });
+ * ```
+ *
+ * @param tableName Name of the table to read.
+ * @param conn Database adapter implementing {@link SqlConnection}.
+ * @param options See {@link ReadSqlTableOptions}.
+ */
+export function readSqlTable(
+ tableName: string,
+ conn: SqlConnection,
+ options: ReadSqlTableOptions = {},
+): DataFrame {
+ if (conn.listTables !== undefined) {
+ const tables = conn.listTables();
+ const tableNameLower = tableName.toLowerCase();
+ const found = tables.some((t) => t.toLowerCase() === tableNameLower);
+ if (!found) {
+ throw new TableNotFoundError(tableName);
+ }
+ }
+
+ const sql = buildSelectQuery(tableName, options);
+ const result = conn.query(sql);
+ return resultToDataFrame(result, options);
+}
+
+/**
+ * Read a SQL query **or** table name into a {@link DataFrame}.
+ *
+ * Mirrors `pandas.read_sql()`.
+ *
+ * - If `sqlOrTable` contains whitespace it is treated as a SQL query string
+ * and executed via {@link readSqlQuery}.
+ * - Otherwise it is treated as a table name and delegated to
+ * {@link readSqlTable}.
+ *
+ * ```ts
+ * import { readSql } from "tsb";
+ *
+ * // Using a query
+ * const df1 = readSql("SELECT * FROM orders WHERE status = 'open'", db);
+ *
+ * // Using a table name
+ * const df2 = readSql("orders", db);
+ * ```
+ *
+ * @param sqlOrTable SQL query string or bare table name.
+ * @param conn Database adapter implementing {@link SqlConnection}.
+ * @param options See {@link ReadSqlOptions}.
+ */
+export function readSql(
+ sqlOrTable: string,
+ conn: SqlConnection,
+ options: ReadSqlOptions = {},
+): DataFrame {
+ if (looksLikeQuery(sqlOrTable)) {
+ return readSqlQuery(sqlOrTable, conn, options);
+ }
+ return readSqlTable(sqlOrTable, conn, options);
+}
+
+/**
+ * Write a {@link DataFrame} to a SQL table.
+ *
+ * Mirrors `pandas.DataFrame.to_sql()`.
+ *
+ * When the adapter provides an {@link SqlConnection.insert} method, writes are
+ * delegated to it (enabling driver-native batching). Otherwise each row is
+ * written via an individual `INSERT INTO` statement through
+ * {@link SqlConnection.query}.
+ *
+ * ```ts
+ * import { toSql } from "tsb";
+ *
+ * const rowsWritten = toSql(df, "staging_data", db, { ifExists: "replace" });
+ * ```
+ *
+ * @param df Source DataFrame.
+ * @param tableName Destination table name.
+ * @param conn Database adapter implementing {@link SqlConnection}.
+ * @param options See {@link ToSqlOptions}.
+ * @returns Number of rows written.
+ */
+export function toSql(
+ df: DataFrame,
+ tableName: string,
+ conn: SqlConnection,
+ options: ToSqlOptions = {},
+): number {
+ const { ifExists = "fail", index = true, indexLabel = null, chunksize } = options;
+
+ // Build ordered column list.
+ const dataCols = [...df.columns.values] as string[];
+ const allCols: string[] = [];
+ let idxLabel = "index";
+ if (index) {
+ const nameFromIndex = df.index.name;
+ if (indexLabel !== null && indexLabel !== undefined) {
+ idxLabel = indexLabel;
+ } else if (typeof nameFromIndex === "string" && nameFromIndex.length > 0) {
+ idxLabel = nameFromIndex;
+ }
+ allCols.push(idxLabel);
+ }
+ for (const c of dataCols) {
+ allCols.push(c);
+ }
+
+ // Build row objects.
+ const records = df.toRecords();
+ const indexValues = [...df.index.values] as Label[];
+ const rows: SqlRow[] = [];
+
+ for (let i = 0; i < records.length; i++) {
+ const rec = records[i];
+ const row: SqlRow = {};
+ if (index) {
+ const idxVal = indexValues[i];
+ row[idxLabel] = labelToSqlValue(idxVal !== undefined ? idxVal : null);
+ }
+ if (rec !== undefined) {
+ for (const col of dataCols) {
+ const v = rec[col];
+ row[col] = scalarToSqlValue(v !== undefined ? v : null);
+ }
+ }
+ rows.push(row);
+ }
+
+ if (conn.insert !== undefined) {
+ return conn.insert(tableName, rows, allCols, ifExists);
+ }
+
+ // Fallback: emit INSERT statements via query().
+ return insertViaQuery(tableName, rows, allCols, ifExists, chunksize, conn);
+}
+
+// ─── helpers for toSql ────────────────────────────────────────────────────────
+
+/** Convert a {@link Label} to a {@link SqlValue}. */
+function labelToSqlValue(label: Label): SqlValue {
+ if (label === null) return null;
+ if (typeof label === "boolean") return label;
+ if (typeof label === "number") return label;
+ if (typeof label === "string") return label;
+ if (label instanceof Date) return label.toISOString();
+ return String(label);
+}
+
+/** Convert a tsb {@link Scalar} to a {@link SqlValue}. */
+function scalarToSqlValue(s: Scalar): SqlValue {
+ if (s === null || s === undefined) return null;
+ if (typeof s === "boolean") return s;
+ if (typeof s === "number") return s;
+ if (typeof s === "string") return s;
+ if (typeof s === "bigint") return Number(s);
+ if (s instanceof Date) return s.toISOString();
+ // TimedeltaLike — store as total milliseconds
+ if (typeof s === "object" && "totalMs" in s) return s.totalMs;
+ return null;
+}
+
+/**
+ * Escape a string for inclusion in a SQL literal.
+ * Only used in the fallback query path.
+ */
+function escapeSqlString(s: string): string {
+ return s.replace(/'/g, "''");
+}
+
+/** Format a {@link SqlValue} as a SQL literal for the fallback path. */
+function sqlLiteral(v: SqlValue): string {
+ if (v === null) return "NULL";
+ if (typeof v === "boolean") return v ? "1" : "0";
+ if (typeof v === "number") {
+ if (Number.isNaN(v)) return "NULL";
+ if (!Number.isFinite(v)) return "NULL";
+ return String(v);
+ }
+ if (typeof v === "string") return `'${escapeSqlString(v)}'`;
+ // Uint8Array (blob): represent as hex literal (SQLite: X'…')
+ return `X'${Buffer.from(v).toString("hex")}'`;
+}
+
+/**
+ * Insert rows by emitting individual INSERT statements through
+ * {@link SqlConnection.query}. Falls back for adapters that don't implement
+ * {@link SqlConnection.insert}.
+ */
+function insertViaQuery(
+ tableName: string,
+ rows: readonly SqlRow[],
+ columns: readonly string[],
+ ifExists: IfExistsStrategy,
+ chunksize: number | undefined,
+ conn: SqlConnection,
+): number {
+ if (rows.length === 0) return 0;
+
+ const quotedTable = quoteIdent(tableName);
+ const colList = columns.map(quoteIdent).join(", ");
+
+ // Check for pre-existing table when strategy is "fail".
+ if (ifExists === "fail" && conn.listTables !== undefined) {
+ const tables = conn.listTables();
+ const tl = tableName.toLowerCase();
+ if (tables.some((t) => t.toLowerCase() === tl)) {
+ throw new TableExistsError(tableName);
+ }
+ }
+
+ // "replace": attempt DROP TABLE first.
+ if (ifExists === "replace") {
+ try {
+ conn.query(`DROP TABLE IF EXISTS ${quotedTable}`);
+ } catch {
+ // Some minimal adapters may not support DDL via query().
+ }
+ }
+
+ const batchSize = chunksize !== undefined && chunksize > 0 ? chunksize : rows.length;
+ let written = 0;
+
+ for (let start = 0; start < rows.length; start += batchSize) {
+ const end = Math.min(start + batchSize, rows.length);
+
+ for (let i = start; i < end; i++) {
+ const row = rows[i];
+ if (row === undefined) continue;
+ const valList = columns.map((col) => sqlLiteral(row[col] ?? null)).join(", ");
+ conn.query(`INSERT INTO ${quotedTable} (${colList}) VALUES (${valList})`);
+ written += 1;
+ }
+ }
+
+ return written;
+}
diff --git a/src/io/stata.ts b/src/io/stata.ts
new file mode 100644
index 00000000..b5151660
--- /dev/null
+++ b/src/io/stata.ts
@@ -0,0 +1,1149 @@
+/**
+ * readStata / toStata — Stata DTA file I/O for DataFrame.
+ *
+ * Mirrors `pandas.read_stata()` and `DataFrame.to_stata()`:
+ * - `readStata(data, options?)` — parse a Stata DTA binary buffer into a DataFrame
+ * - `toStata(df, options?)` — serialize a DataFrame to a Stata DTA binary buffer
+ *
+ * Supported DTA versions:
+ * - Reading: v114/v115 (old binary format, auto-detects byte order)
+ * - Reading: v117/v118/v119 (new XML-tagged format, auto-detects byte order)
+ * - Writing: v118 (new format, little-endian)
+ *
+ * Column types handled:
+ * - byte (int8), int (int16), long (int32), float (float32), double (float64)
+ * - str1..str2045 (fixed-width strings), strl (long strings, v117+)
+ * - Missing values → `null`
+ * - Value labels optionally applied with `convertCategoricals: true`
+ *
+ * @module
+ */
+
+import { DataFrame } from "../core/frame.ts";
+import { Index } from "../core/index.ts";
+import type { Label, Scalar } from "../types.ts";
+
+// ─── Public Types ─────────────────────────────────────────────────────────────
+
+/** Options for {@link readStata}. */
+export interface ReadStataOptions {
+ /**
+ * Column name or 0-based index to use as the row index.
+ * Default: `null` (RangeIndex).
+ */
+ readonly indexCol?: string | number | null;
+ /** Maximum number of data rows to read. Default: unlimited. */
+ readonly nRows?: number;
+ /**
+ * Apply value labels to integer columns that have them, replacing
+ * numeric codes with their string labels. Default: `false`.
+ */
+ readonly convertCategoricals?: boolean;
+ /**
+ * Only include these column names. `null` = all columns.
+ * Default: `null`.
+ */
+ readonly usecols?: readonly string[] | null;
+}
+
+/** Options for {@link toStata}. */
+export interface ToStataOptions {
+ /** Dataset label (up to 80 characters). Default: `""`. */
+ readonly dataLabel?: string;
+ /**
+ * Write the DataFrame's row index as a column named `"_index"`.
+ * Default: `false`.
+ */
+ readonly writeIndex?: boolean;
+ /**
+ * Map of column name → variable label (up to 80 characters).
+ * Default: `{}`.
+ */
+ readonly variableLabels?: Readonly>;
+}
+
+// ─── Internal Types ───────────────────────────────────────────────────────────
+
+/** Column descriptor parsed from a DTA file. */
+interface ColDesc {
+ readonly name: string;
+ /** Raw Stata type code. */
+ readonly code: number;
+ /** Byte width of this column in the data section. */
+ readonly width: number;
+ /** True if this column holds a strl reference (v117+). */
+ readonly isStrl: boolean;
+}
+
+/** Internal representation of a fully parsed DTA file. */
+interface DtaData {
+ readonly cols: ColDesc[];
+ readonly rows: Scalar[][];
+ readonly lblNames: string[];
+ readonly varLabels: string[];
+ readonly valueLabels: Map>;
+}
+
+// ─── Constants ────────────────────────────────────────────────────────────────
+
+/** New-format (v117+) numeric type codes. */
+const TC_DOUBLE = 65526;
+const TC_FLOAT = 65527;
+const TC_LONG = 65528;
+const TC_INT = 65529;
+const TC_BYTE = 65530;
+const TC_STRL = 32768;
+
+/** Missing-value sentinels for integer types. */
+const MISS_BYTE = 101; // int8 >= 101 is missing
+const MISS_INT = 32741; // int16 >= 32741 is missing
+const MISS_LONG = 2147483621; // int32 >= 2147483621 is missing
+
+/** Stata float missing: bit pattern 0x7f000000 or higher. */
+const MISS_F32_BITS = 0x7f000000;
+/** Stata double missing: high-32-bit pattern 0x7fe00000 or higher. */
+const MISS_F64_HI = 0x7fe00000;
+/** Stata double missing written as uint32 pair (LE). */
+const MISS_F64_LO32 = 0x00000000;
+const MISS_F64_HI32 = 0x7fe00000;
+
+// ─── Missing Value Helpers ────────────────────────────────────────────────────
+
+function isMissF32(view: DataView, pos: number, le: boolean): boolean {
+ const bits = view.getUint32(pos, le);
+ // Stata float missing values have sign=0 and bits >= 0x7f000000.
+ // Negative floats have bit 31 set (bits >= 0x80000000) and must not be treated as missing.
+ return bits >= MISS_F32_BITS && bits < 0x80000000;
+}
+
+function isMissF64(view: DataView, pos: number, le: boolean): boolean {
+ const hiOff = le ? pos + 4 : pos;
+ const hi = view.getUint32(hiOff, le);
+ // Stata double missing values have sign=0 and high bits >= 0x7fe00000.
+ // Negative doubles have bit 31 set (hi >= 0x80000000) and must not be treated as missing.
+ return hi >= MISS_F64_HI && hi < 0x80000000;
+}
+
+// ─── Text Codecs ──────────────────────────────────────────────────────────────
+
+const ENC = new TextEncoder();
+const LATIN1 = new TextDecoder("latin1");
+const UTF8D = new TextDecoder("utf-8");
+
+// ─── BinReader ────────────────────────────────────────────────────────────────
+
+class BinReader {
+ pos = 0;
+ /** Byte order: `true` = little-endian, `false` = big-endian. Mutable. */
+ le: boolean;
+ private readonly view: DataView;
+ readonly u8: Uint8Array;
+
+ constructor(data: Uint8Array | ArrayBuffer, le = true) {
+ if (data instanceof ArrayBuffer) {
+ this.u8 = new Uint8Array(data);
+ this.view = new DataView(data);
+ } else {
+ this.u8 = data;
+ this.view = new DataView(data.buffer, data.byteOffset, data.byteLength);
+ }
+ this.le = le;
+ }
+
+ seek(p: number): void {
+ this.pos = p;
+ }
+
+ skip(n: number): void {
+ this.pos += n;
+ }
+
+ readU8(): number {
+ return this.view.getUint8(this.pos++);
+ }
+
+ readI8(): number {
+ return this.view.getInt8(this.pos++);
+ }
+
+ readU16(): number {
+ const v = this.view.getUint16(this.pos, this.le);
+ this.pos += 2;
+ return v;
+ }
+
+ readI16(): number {
+ const v = this.view.getInt16(this.pos, this.le);
+ this.pos += 2;
+ return v;
+ }
+
+ readU32(): number {
+ const v = this.view.getUint32(this.pos, this.le);
+ this.pos += 4;
+ return v;
+ }
+
+ readI32(): number {
+ const v = this.view.getInt32(this.pos, this.le);
+ this.pos += 4;
+ return v;
+ }
+
+ readF32(): number {
+ const v = this.view.getFloat32(this.pos, this.le);
+ this.pos += 4;
+ return v;
+ }
+
+ readF64(): number {
+ const v = this.view.getFloat64(this.pos, this.le);
+ this.pos += 8;
+ return v;
+ }
+
+ /** Read uint64 as a JS number (safe for values ≤ 2^53). */
+ readU64(): number {
+ const a = this.view.getUint32(this.pos, this.le);
+ const b = this.view.getUint32(this.pos + 4, this.le);
+ this.pos += 8;
+ return this.le ? a + b * 4294967296 : b + a * 4294967296;
+ }
+
+ readBytes(n: number): Uint8Array {
+ const s = this.u8.subarray(this.pos, this.pos + n);
+ this.pos += n;
+ return s;
+ }
+
+ /** Read a fixed-width field as a null-terminated Latin-1 string. */
+ readCStr(fieldLen: number): string {
+ const b = this.readBytes(fieldLen);
+ let end = 0;
+ while (end < b.length && (b[end] ?? 0) !== 0) {
+ end++;
+ }
+ return LATIN1.decode(b.subarray(0, end));
+ }
+
+ /** Read a fixed-width field, trim trailing null bytes and spaces. */
+ readTrimStr(fieldLen: number): string {
+ const b = this.readBytes(fieldLen);
+ let end = b.length;
+ while (end > 0 && ((b[end - 1] ?? 0) === 0 || (b[end - 1] ?? 0) === 0x20)) {
+ end--;
+ }
+ return LATIN1.decode(b.subarray(0, end));
+ }
+
+ /** Read and verify an ASCII tag. Throws on mismatch. */
+ expectTag(tag: string): void {
+ const tb = ENC.encode(tag);
+ for (let i = 0; i < tb.length; i++) {
+ if ((this.u8[this.pos + i] ?? -1) !== (tb[i] ?? 0)) {
+ const got = LATIN1.decode(this.u8.subarray(this.pos, this.pos + tb.length));
+ throw new Error(`Stata DTA: expected "${tag}", got "${got}" at offset ${this.pos}`);
+ }
+ }
+ this.pos += tb.length;
+ }
+
+ /** Scan forward until the given ASCII tag is found and consumed. */
+ skipToTag(tag: string): void {
+ const tb = ENC.encode(tag);
+ const len = tb.length;
+ for (let i = this.pos; i + len <= this.u8.length; i++) {
+ let ok = true;
+ for (let j = 0; j < len; j++) {
+ if (this.u8[i + j] !== tb[j]) {
+ ok = false;
+ break;
+ }
+ }
+ if (ok) {
+ this.pos = i + len;
+ return;
+ }
+ }
+ throw new Error(`Stata DTA: tag "${tag}" not found`);
+ }
+
+ get dataView(): DataView {
+ return this.view;
+ }
+}
+
+// ─── BinWriter ────────────────────────────────────────────────────────────────
+
+class BinWriter {
+ private buf: Uint8Array;
+ private _pos = 0;
+ private view: DataView;
+ readonly le: boolean;
+
+ constructor(capacity = 8192, le = true) {
+ this.buf = new Uint8Array(capacity);
+ this.view = new DataView(this.buf.buffer);
+ this.le = le;
+ }
+
+ get pos(): number {
+ return this._pos;
+ }
+
+ private grow(need: number): void {
+ if (this._pos + need <= this.buf.length) return;
+ let next = this.buf.length * 2;
+ while (this._pos + need > next) next *= 2;
+ const nb = new Uint8Array(next);
+ nb.set(this.buf.subarray(0, this._pos));
+ this.buf = nb;
+ this.view = new DataView(nb.buffer);
+ }
+
+ writeU8(v: number): void {
+ this.grow(1);
+ this.view.setUint8(this._pos++, v);
+ }
+
+ writeI8(v: number): void {
+ this.grow(1);
+ this.view.setInt8(this._pos++, v);
+ }
+
+ writeU16(v: number): void {
+ this.grow(2);
+ this.view.setUint16(this._pos, v, this.le);
+ this._pos += 2;
+ }
+
+ writeI16(v: number): void {
+ this.grow(2);
+ this.view.setInt16(this._pos, v, this.le);
+ this._pos += 2;
+ }
+
+ writeU32(v: number): void {
+ this.grow(4);
+ this.view.setUint32(this._pos, v, this.le);
+ this._pos += 4;
+ }
+
+ writeI32(v: number): void {
+ this.grow(4);
+ this.view.setInt32(this._pos, v, this.le);
+ this._pos += 4;
+ }
+
+ writeF32(v: number): void {
+ this.grow(4);
+ this.view.setFloat32(this._pos, v, this.le);
+ this._pos += 4;
+ }
+
+ writeF64(v: number): void {
+ this.grow(8);
+ this.view.setFloat64(this._pos, v, this.le);
+ this._pos += 8;
+ }
+
+ writeU64(v: number): void {
+ this.grow(8);
+ const lo = v >>> 0;
+ const hi = Math.floor(v / 4294967296) >>> 0;
+ if (this.le) {
+ this.view.setUint32(this._pos, lo, true);
+ this.view.setUint32(this._pos + 4, hi, true);
+ } else {
+ this.view.setUint32(this._pos, hi, false);
+ this.view.setUint32(this._pos + 4, lo, false);
+ }
+ this._pos += 8;
+ }
+
+ /** Overwrite a previously-written uint64 value at `offset`. */
+ patchU64(offset: number, v: number): void {
+ const lo = v >>> 0;
+ const hi = Math.floor(v / 4294967296) >>> 0;
+ if (this.le) {
+ this.view.setUint32(offset, lo, true);
+ this.view.setUint32(offset + 4, hi, true);
+ } else {
+ this.view.setUint32(offset, hi, false);
+ this.view.setUint32(offset + 4, lo, false);
+ }
+ }
+
+ writeBytes(b: Uint8Array): void {
+ this.grow(b.length);
+ this.buf.set(b, this._pos);
+ this._pos += b.length;
+ }
+
+ writeAscii(s: string): void {
+ this.writeBytes(ENC.encode(s));
+ }
+
+ /** Write a null-padded fixed-length ASCII field of exactly `fieldLen` bytes. */
+ writeFixed(s: string, fieldLen: number): void {
+ this.grow(fieldLen);
+ const b = ENC.encode(s);
+ const n = Math.min(b.length, fieldLen);
+ for (let i = 0; i < n; i++) this.view.setUint8(this._pos + i, b[i] ?? 0);
+ for (let i = n; i < fieldLen; i++) this.view.setUint8(this._pos + i, 0);
+ this._pos += fieldLen;
+ }
+
+ finalize(): Uint8Array {
+ return this.buf.slice(0, this._pos);
+ }
+}
+
+// ─── Old Format Parser (v114/v115) ────────────────────────────────────────────
+
+function parseOldFormat(u8: Uint8Array, version: number): DtaData {
+ const byteOrderCode = u8[1] ?? 2;
+ const le = byteOrderCode === 2; // 2 = LOHI (little-endian), 1 = HILO (big-endian)
+ const r = new BinReader(u8, le);
+
+ r.skip(4); // ds_format, byte_order, filetype, padding
+ const nvar = r.readU16();
+ const nobs = r.readU32();
+ r.readCStr(81); // data_label (ignored)
+ r.readCStr(18); // time_stamp (ignored)
+ // offset = 109
+
+ // typlist: 1 byte per column
+ const stataTypes: number[] = [];
+ for (let i = 0; i < nvar; i++) stataTypes.push(r.readU8());
+
+ // varlist
+ const colSize = version > 113 ? 33 : 10;
+ const names: string[] = [];
+ for (let i = 0; i < nvar; i++) names.push(r.readCStr(colSize));
+
+ // srtlist (skip)
+ r.skip((nvar + 1) * 2);
+
+ // fmtlist (skip)
+ const fmtSize = version > 113 ? 49 : 13;
+ r.skip(nvar * fmtSize);
+
+ // lbllist (value label names)
+ const lblSize = version > 113 ? 33 : 10;
+ const lblNames: string[] = [];
+ for (let i = 0; i < nvar; i++) lblNames.push(r.readCStr(lblSize));
+
+ // variable_labels
+ const varLabels: string[] = [];
+ for (let i = 0; i < nvar; i++) varLabels.push(r.readCStr(81));
+
+ // characteristics: skip until end marker (type == 0)
+ while (r.pos + 2 < u8.length) {
+ const chType = r.readU16();
+ if (chType === 0) break;
+ r.skip(colSize); // varname
+ r.skip(colSize); // charname
+ const len = r.readU32();
+ r.skip(len);
+ }
+
+ // Build column descriptors
+ const cols: ColDesc[] = [];
+ for (let i = 0; i < nvar; i++) {
+ const t = stataTypes[i] ?? 255;
+ let width: number;
+ if (t <= 244) {
+ width = t; // str
+ } else if (t === 251) {
+ width = 1; // byte
+ } else if (t === 252) {
+ width = 2; // int
+ } else if (t === 253 || t === 254) {
+ width = 4; // long or float
+ } else {
+ width = 8; // double (255) or unknown
+ }
+ cols.push({ name: names[i] ?? `var${i}`, code: t, width, isStrl: false });
+ }
+
+ // Read data rows
+ const dv = r.dataView;
+ const rows: Scalar[][] = [];
+ for (let row = 0; row < nobs; row++) {
+ const rowData: Scalar[] = [];
+ for (const col of cols) {
+ const t = col.code;
+ if (t <= 244) {
+ rowData.push(r.readTrimStr(t));
+ } else if (t === 251) {
+ // byte (int8): missing if >= MISS_BYTE
+ const v = r.readI8();
+ rowData.push(v >= MISS_BYTE ? null : v);
+ } else if (t === 252) {
+ // int (int16): missing if >= MISS_INT
+ const v = r.readI16();
+ rowData.push(v >= MISS_INT ? null : v);
+ } else if (t === 253) {
+ // long (int32): missing if >= MISS_LONG
+ const v = r.readI32();
+ rowData.push(v >= MISS_LONG ? null : v);
+ } else if (t === 254) {
+ // float (float32): check bit pattern
+ const missing = isMissF32(dv, r.pos, le);
+ const v = r.readF32();
+ rowData.push(missing ? null : v);
+ } else {
+ // double (float64): check bit pattern
+ const missing = isMissF64(dv, r.pos, le);
+ const v = r.readF64();
+ rowData.push(missing ? null : v);
+ }
+ }
+ rows.push(rowData);
+ }
+
+ const valueLabels = parseOldValueLabels(r, version);
+ return { cols, rows, lblNames, varLabels, valueLabels };
+}
+
+function parseOldValueLabels(r: BinReader, version: number): Map> {
+ const result = new Map>();
+ const lblSize = version > 113 ? 33 : 10;
+
+ while (r.pos + lblSize + 11 < r.u8.length) {
+ const labname = r.readCStr(lblSize);
+ r.skip(3); // padding
+ const n = r.readU32();
+ const txtlen = r.readU32();
+ if (labname.length === 0 || n === 0 || txtlen === 0) break;
+ if (r.pos + n * 8 + txtlen > r.u8.length) break;
+
+ const offsets: number[] = [];
+ for (let i = 0; i < n; i++) offsets.push(r.readU32());
+ const values: number[] = [];
+ for (let i = 0; i < n; i++) values.push(r.readI32());
+ const txt = r.readBytes(txtlen);
+
+ const map = new Map();
+ for (let i = 0; i < n; i++) {
+ const off = offsets[i] ?? 0;
+ let end = off;
+ while (end < txt.length && (txt[end] ?? 0) !== 0) end++;
+ const label = LATIN1.decode(txt.subarray(off, end));
+ const val = values[i];
+ if (val !== undefined) map.set(val, label);
+ }
+ result.set(labname, map);
+ }
+ return result;
+}
+
+// ─── New Format Parser (v117/v118/v119) ───────────────────────────────────────
+
+function parseNewFormat(u8: Uint8Array, version: number): DtaData {
+ const r = new BinReader(u8, true); // initially LE; updated after reading byteorder
+
+ r.expectTag("");
+ r.expectTag("");
+ r.expectTag("");
+ r.skip(3); // 3-byte ASCII version string
+ r.expectTag("");
+ r.expectTag("");
+ const bo = LATIN1.decode(r.readBytes(3));
+ r.le = bo !== "MSF"; // "LSF" = little-endian, "MSF" = big-endian
+ r.expectTag("");
+ r.expectTag("");
+ const nvar = r.readU16();
+ r.expectTag("");
+ r.expectTag("");
+ const nobs = version >= 119 ? r.readU64() : r.readU32();
+ r.expectTag("");
+ r.expectTag("");
+ r.expectTag("");
+ const tsLen = version > 117 ? r.readU16() : r.readU8();
+ r.skip(tsLen);
+ r.expectTag("");
+ r.expectTag("");
+
+ // Map: 14 × uint64 file offsets
+ r.expectTag("");
+
+ // variable_types
+ const seekVT = mapOff[2] ?? 0;
+ if (seekVT > 0) r.seek(seekVT);
+ r.expectTag("");
+ const varCodes: number[] = [];
+ for (let i = 0; i < nvar; i++) varCodes.push(r.readU16());
+ r.expectTag("");
+
+ // varnames
+ const seekVN = mapOff[3] ?? 0;
+ if (seekVN > 0) r.seek(seekVN);
+ r.expectTag("");
+ const varNameLen = version >= 119 ? 129 : 33;
+ const names: string[] = [];
+ for (let i = 0; i < nvar; i++) names.push(r.readCStr(varNameLen));
+ r.expectTag("");
+
+ // value_label_names (skip sortlist and formats)
+ const seekVLN = mapOff[6] ?? 0;
+ if (seekVLN > 0) r.seek(seekVLN);
+ r.expectTag("");
+ const vlNameLen = version >= 119 ? 129 : 33;
+ const lblNames: string[] = [];
+ for (let i = 0; i < nvar; i++) lblNames.push(r.readCStr(vlNameLen));
+ r.expectTag("");
+
+ // variable_labels
+ const seekVL = mapOff[7] ?? 0;
+ if (seekVL > 0) r.seek(seekVL);
+ r.expectTag("");
+ const varLabels: string[] = [];
+ for (let i = 0; i < nvar; i++) varLabels.push(r.readCStr(81));
+ r.expectTag("");
+
+ // Build column descriptors
+ const cols: ColDesc[] = [];
+ for (let i = 0; i < nvar; i++) {
+ const code = varCodes[i] ?? TC_DOUBLE;
+ let width: number;
+ let isStrl = false;
+ if (code <= 2045) {
+ width = code; // str (fixed string of that length)
+ } else if (code === TC_STRL) {
+ // strl reference: uint16 v + uint32 o (v117) or uint64 o (v118+)
+ width = version >= 118 ? 10 : 6;
+ isStrl = true;
+ } else if (code === TC_BYTE) {
+ width = 1;
+ } else if (code === TC_INT) {
+ width = 2;
+ } else if (code === TC_LONG || code === TC_FLOAT) {
+ width = 4;
+ } else {
+ width = 8; // TC_DOUBLE or unknown
+ }
+ cols.push({ name: names[i] ?? `var${i}`, code, width, isStrl });
+ }
+
+ // Read strls section if any strl columns exist
+ const strlMap = new Map(); // "v,o" → string value
+ const seekST = mapOff[10] ?? 0;
+ if (seekST > 0 && cols.some((c) => c.isStrl)) {
+ r.seek(seekST);
+ r.expectTag("");
+ while (r.pos + 3 <= r.u8.length) {
+ if ((r.u8[r.pos] ?? 0) === 0x3c) break; // '<' = start of
+ // Check for "GSO" magic
+ if (
+ (r.u8[r.pos] ?? 0) !== 0x47 ||
+ (r.u8[r.pos + 1] ?? 0) !== 0x53 ||
+ (r.u8[r.pos + 2] ?? 0) !== 0x4f
+ ) {
+ break;
+ }
+ r.skip(3); // "GSO"
+ const gsoV = r.readU16();
+ const gsoO = version >= 118 ? r.readU64() : r.readU32();
+ const t = r.readU8(); // 129=binary, 130=string
+ const len = r.readU32();
+ const data = r.readBytes(len);
+ if (t === 130) {
+ // string: null-terminated UTF-8
+ let end = 0;
+ while (end < data.length && (data[end] ?? 0) !== 0) end++;
+ strlMap.set(`${gsoV},${gsoO}`, UTF8D.decode(data.subarray(0, end)));
+ }
+ }
+ r.skipToTag("");
+ }
+
+ // Read data section
+ const seekDA = mapOff[9] ?? 0;
+ if (seekDA > 0) r.seek(seekDA);
+ r.expectTag("");
+ const dv = r.dataView;
+ const rows: Scalar[][] = [];
+ for (let row = 0; row < nobs; row++) {
+ const rowData: Scalar[] = [];
+ for (const col of cols) {
+ const code = col.code;
+ if (code <= 2045) {
+ rowData.push(r.readTrimStr(code));
+ } else if (col.isStrl) {
+ const gv = r.readU16();
+ const go = version >= 118 ? r.readU64() : r.readU32();
+ rowData.push(strlMap.get(`${gv},${go}`) ?? null);
+ } else if (code === TC_BYTE) {
+ const v = r.readI8();
+ rowData.push(v >= MISS_BYTE ? null : v);
+ } else if (code === TC_INT) {
+ const v = r.readI16();
+ rowData.push(v >= MISS_INT ? null : v);
+ } else if (code === TC_LONG) {
+ const v = r.readI32();
+ rowData.push(v >= MISS_LONG ? null : v);
+ } else if (code === TC_FLOAT) {
+ const missing = isMissF32(dv, r.pos, r.le);
+ const v = r.readF32();
+ rowData.push(missing ? null : v);
+ } else {
+ // TC_DOUBLE
+ const missing = isMissF64(dv, r.pos, r.le);
+ const v = r.readF64();
+ rowData.push(missing ? null : v);
+ }
+ }
+ rows.push(rowData);
+ }
+ r.expectTag("");
+
+ // Value labels
+ const seekVA = mapOff[11] ?? 0;
+ if (seekVA > 0) r.seek(seekVA);
+ const valueLabels = parseNewValueLabels(r, version);
+ return { cols, rows, lblNames, varLabels, valueLabels };
+}
+
+function parseNewValueLabels(r: BinReader, version: number): Map> {
+ const result = new Map>();
+ const lblSize = version >= 119 ? 129 : 33;
+
+ r.expectTag("");
+ while (r.pos + 5 < r.u8.length) {
+ if ((r.u8[r.pos] ?? 0) === 0x3c && (r.u8[r.pos + 1] ?? 0) === 0x2f) break; // ""
+ r.expectTag("");
+ r.readU32(); // total byte length (informational)
+ const labname = r.readCStr(lblSize);
+ r.skip(3); // padding
+ const n = r.readU32();
+ const txtlen = r.readU32();
+ const offsets: number[] = [];
+ for (let i = 0; i < n; i++) offsets.push(r.readU32());
+ const values: number[] = [];
+ for (let i = 0; i < n; i++) values.push(r.readI32());
+ const txt = r.readBytes(txtlen);
+ r.expectTag("");
+
+ if (labname.length > 0 && n > 0) {
+ const map = new Map();
+ for (let i = 0; i < n; i++) {
+ const off = offsets[i] ?? 0;
+ let end = off;
+ while (end < txt.length && (txt[end] ?? 0) !== 0) end++;
+ const label = UTF8D.decode(txt.subarray(off, end));
+ const val = values[i];
+ if (val !== undefined) map.set(val, label);
+ }
+ result.set(labname, map);
+ }
+ }
+ return result;
+}
+
+// ─── DataFrame Builder ────────────────────────────────────────────────────────
+
+function isLabel(v: Scalar): v is Label {
+ return (
+ v === null ||
+ typeof v === "number" ||
+ typeof v === "string" ||
+ typeof v === "boolean" ||
+ v instanceof Date
+ );
+}
+
+function buildDataFrame(data: DtaData, opts: ReadStataOptions): DataFrame {
+ const { cols, rows, lblNames, valueLabels } = data;
+ const { indexCol = null, nRows, convertCategoricals = false, usecols = null } = opts;
+ const limit = nRows !== undefined ? Math.min(nRows, rows.length) : rows.length;
+
+ // Determine active column indices
+ let activeIdx = cols.map((_, i) => i);
+ if (usecols !== null) {
+ const keep = new Set(usecols);
+ activeIdx = activeIdx.filter((i) => keep.has(cols[i]?.name ?? ""));
+ }
+
+ // Build column arrays from rows
+ const arrays: Scalar[][] = activeIdx.map(() => []);
+ for (let ri = 0; ri < limit; ri++) {
+ const row = rows[ri];
+ if (row === undefined) continue;
+ for (let ci = 0; ci < activeIdx.length; ci++) {
+ const colIdx = activeIdx[ci] ?? 0;
+ (arrays[ci] ?? []).push(row[colIdx] ?? null);
+ }
+ }
+
+ // Apply value labels (convertCategoricals)
+ if (convertCategoricals) {
+ for (let ci = 0; ci < activeIdx.length; ci++) {
+ const colIdx = activeIdx[ci] ?? 0;
+ const lblName = lblNames[colIdx] ?? "";
+ if (lblName.length === 0) continue;
+ const lblMap = valueLabels.get(lblName);
+ if (lblMap === undefined) continue;
+ const arr = arrays[ci];
+ if (arr === undefined) continue;
+ for (let ri = 0; ri < arr.length; ri++) {
+ const v = arr[ri];
+ if (typeof v === "number") {
+ const label = lblMap.get(v);
+ if (label !== undefined) arr[ri] = label;
+ }
+ }
+ }
+ }
+
+ // Build column data record
+ const colData: Record = {};
+ for (let ci = 0; ci < activeIdx.length; ci++) {
+ const colIdx = activeIdx[ci] ?? 0;
+ colData[cols[colIdx]?.name ?? `var${colIdx}`] = arrays[ci] ?? [];
+ }
+
+ // Handle indexCol
+ let idxName: string | null = null;
+ if (typeof indexCol === "string") {
+ idxName = indexCol;
+ } else if (typeof indexCol === "number") {
+ const mapped = activeIdx[indexCol];
+ if (mapped !== undefined) idxName = cols[mapped]?.name ?? null;
+ }
+
+ if (idxName !== null && idxName in colData) {
+ const idxData = (colData[idxName] ?? []).filter(isLabel);
+ const rest: Record = {};
+ for (const [k, v] of Object.entries(colData)) {
+ if (k !== idxName) rest[k] = v;
+ }
+ return DataFrame.fromColumns(rest, { index: new Index(idxData) });
+ }
+
+ return DataFrame.fromColumns(colData);
+}
+
+// ─── readStata ────────────────────────────────────────────────────────────────
+
+/**
+ * Parse a Stata DTA file into a {@link DataFrame}.
+ *
+ * Supports DTA versions 114/115 (old binary format) and 117/118/119
+ * (new XML-tagged format). Numeric missing values are represented as `null`.
+ *
+ * @example
+ * ```ts
+ * import { readStata } from "tsb";
+ * const buf = await Bun.file("data.dta").arrayBuffer();
+ * const df = readStata(buf);
+ * df.shape; // [nobs, nvar]
+ * df.columns.toArray(); // ["age", "income", ...]
+ * ```
+ */
+export function readStata(
+ data: Uint8Array | ArrayBuffer,
+ options: ReadStataOptions = {},
+): DataFrame {
+ const u8 = data instanceof Uint8Array ? data : new Uint8Array(data);
+ if (u8.length < 4) throw new Error("Stata DTA: buffer too small");
+
+ let parsed: DtaData;
+ const firstByte = u8[0] ?? 0;
+
+ if (firstByte === 0x3c) {
+ // New format: starts with ""
+ const header100 = LATIN1.decode(u8.subarray(0, Math.min(100, u8.length)));
+ const m = /(\d+)<\/release>/.exec(header100);
+ const version = m?.[1] !== undefined ? Number.parseInt(m[1], 10) : 118;
+ parsed = parseNewFormat(u8, version);
+ } else {
+ // Old binary format: first byte is the version number
+ const version = firstByte;
+ if (version < 104 || version > 115) {
+ throw new Error(`Stata DTA: unsupported version byte ${version}`);
+ }
+ parsed = parseOldFormat(u8, version);
+ }
+
+ return buildDataFrame(parsed, options);
+}
+
+// ─── toStata ─────────────────────────────────────────────────────────────────
+
+/**
+ * Serialize a {@link DataFrame} to a Stata DTA v118 binary file.
+ *
+ * Column type mapping:
+ * - `number` → `double` (float64)
+ * - `boolean` → `byte` (int8, stored as 0/1)
+ * - `string` → `str` (fixed-width, up to 2045 bytes; longer strings truncated)
+ * - `null` / `undefined` → Stata missing value for the column's type
+ *
+ * @example
+ * ```ts
+ * import { DataFrame, toStata } from "tsb";
+ * const df = DataFrame.fromColumns({
+ * age: [25, 30, null],
+ * name: ["Alice", "Bob", "Carol"],
+ * });
+ * const buf = toStata(df);
+ * await Bun.write("data.dta", buf);
+ * ```
+ */
+export function toStata(df: DataFrame, options: ToStataOptions = {}): Uint8Array {
+ const { dataLabel = "", writeIndex = false, variableLabels = {} } = options;
+
+ // Collect columns
+ const colNames: string[] = [];
+ const colArrays: Scalar[][] = [];
+
+ if (writeIndex) {
+ colNames.push("_index");
+ colArrays.push([...df.index.toArray()]);
+ }
+ for (const name of df.columns.values) {
+ colNames.push(name);
+ colArrays.push([...df.col(name).toArray()]);
+ }
+
+ const nvar = colNames.length;
+ const nobs = df.shape[0];
+
+ // Determine Stata type for each column
+ const stataTypes: number[] = [];
+ for (let ci = 0; ci < nvar; ci++) {
+ const arr = colArrays[ci] ?? [];
+ let hasStr = false;
+ let maxStrLen = 0;
+ let allBoolOrNum = true;
+ let allBool = true;
+ for (const v of arr) {
+ if (v === null || v === undefined) continue;
+ if (typeof v === "string") {
+ hasStr = true;
+ allBoolOrNum = false;
+ allBool = false;
+ const len = ENC.encode(v).length;
+ if (len > maxStrLen) maxStrLen = len;
+ } else if (typeof v !== "boolean") {
+ allBool = false;
+ }
+ }
+ if (hasStr) {
+ stataTypes.push(Math.max(1, Math.min(maxStrLen, 2045)));
+ } else if (allBool && allBoolOrNum) {
+ stataTypes.push(TC_BYTE);
+ } else {
+ stataTypes.push(TC_DOUBLE);
+ }
+ }
+
+ // Compute row width
+ let rowWidth = 0;
+ for (const t of stataTypes) {
+ if (t <= 2045) rowWidth += t;
+ else if (t === TC_BYTE) rowWidth += 1;
+ else if (t === TC_INT) rowWidth += 2;
+ else if (t === TC_LONG || t === TC_FLOAT) rowWidth += 4;
+ else rowWidth += 8; // TC_DOUBLE
+ }
+
+ // Encode data label (UTF-8, max 80 bytes)
+ const labelRaw = dataLabel.length > 80 ? dataLabel.slice(0, 80) : dataLabel;
+ const labelBytes = ENC.encode(labelRaw);
+
+ // Format timestamp: "dd Mon YYYY HH:MM" (always 17 bytes)
+ const now = new Date();
+ const mos = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"];
+ const tsStr = [
+ String(now.getUTCDate()).padStart(2, " "),
+ mos[now.getUTCMonth()] ?? "Jan",
+ String(now.getUTCFullYear()),
+ `${String(now.getUTCHours()).padStart(2, "0")}:${String(now.getUTCMinutes()).padStart(2, "0")}`,
+ ].join(" ");
+ const tsBytes = ENC.encode(tsStr);
+
+ const w = new BinWriter(65536);
+ const mapSlots: number[] = []; // positions of each map uint64 in the output
+
+ // Track offsets as we write sections
+ const sectionOffs = new Array(14).fill(0);
+ sectionOffs[0] = 0; //
+
+ // ── ──
+ w.writeAscii("");
+
+ // ── ──
+ w.writeAscii("");
+ w.writeAscii("118");
+ w.writeAscii("LSF");
+ w.writeAscii("");
+ w.writeU16(nvar);
+ w.writeAscii("");
+ w.writeAscii("");
+ w.writeU32(nobs);
+ w.writeAscii("");
+ w.writeAscii("");
+ w.writeAscii("");
+ w.writeU16(tsBytes.length);
+ w.writeBytes(tsBytes);
+ w.writeAscii("");
+ w.writeAscii("");
+
+ // ── ──
+ sectionOffs[12] = w.pos; // end-of-data marker
+ w.writeAscii("");
+
+ // Patch the map with actual section offsets
+ for (let i = 0; i < 14; i++) {
+ const slotPos = mapSlots[i];
+ if (slotPos !== undefined) {
+ w.patchU64(slotPos, sectionOffs[i] ?? 0);
+ }
+ }
+
+ return w.finalize();
+}
diff --git a/src/io/xml.ts b/src/io/xml.ts
new file mode 100644
index 00000000..d343e916
--- /dev/null
+++ b/src/io/xml.ts
@@ -0,0 +1,523 @@
+/**
+ * readXml / toXml — XML I/O for DataFrame.
+ *
+ * Mirrors `pandas.read_xml()` and `DataFrame.to_xml()`:
+ * - `readXml(text, options?)` — parse an XML string into a DataFrame
+ * - `toXml(df, options?)` — serialize a DataFrame to an XML string
+ *
+ * Implemented without any external dependencies — uses a hand-rolled
+ * zero-dependency XML tokenizer that handles:
+ * - Attributes on row elements
+ * - Text-content child elements as columns
+ * - xmlns namespace prefixes (stripped for column names)
+ * - CDATA sections
+ * - XML comments (skipped)
+ * - Entity references (& < > ' " N; N;)
+ * - nrows, usecols, xpath-like row selection (element name filter)
+ * - naValues, converters (auto-numeric coercion)
+ * - indexCol
+ *
+ * @module
+ */
+
+import { DataFrame } from "../core/frame.ts";
+import { Index } from "../core/index.ts";
+import type { Label, Scalar } from "../types.ts";
+
+function isLabel(v: Scalar): v is Label {
+ return (
+ v === null ||
+ typeof v === "number" ||
+ typeof v === "string" ||
+ typeof v === "boolean" ||
+ v instanceof Date
+ );
+}
+
+// ─── public types ─────────────────────────────────────────────────────────────
+
+/** Options for {@link readXml}. */
+export interface ReadXmlOptions {
+ /**
+ * Local-name of the element to treat as a row. Defaults to the first
+ * repeating child element name found inside the document root.
+ */
+ readonly rowTag?: string;
+
+ /**
+ * Column name or 0-based column index to use as the row index.
+ * Defaults to a plain RangeIndex.
+ */
+ readonly indexCol?: string | number | null;
+
+ /**
+ * Only include these column names (subset). `null` = all columns.
+ */
+ readonly usecols?: readonly string[] | null;
+
+ /**
+ * Extra strings to treat as NaN in addition to the built-in defaults
+ * (`""`, `"NA"`, `"NaN"`, `"N/A"`, `"null"`, `"None"`, `"nan"`).
+ */
+ readonly naValues?: readonly string[];
+
+ /**
+ * Whether to try to coerce column values to numbers. Defaults to `true`.
+ */
+ readonly converters?: boolean;
+
+ /**
+ * Maximum number of rows to read. Defaults to unlimited.
+ */
+ readonly nrows?: number;
+
+ /**
+ * Whether to read element attributes as columns. Defaults to `true`.
+ */
+ readonly attribs?: boolean;
+
+ /**
+ * Whether to read child element text content as columns. Defaults to `true`.
+ */
+ readonly elems?: boolean;
+}
+
+/** Options for {@link toXml}. */
+export interface ToXmlOptions {
+ /**
+ * Name of the document root element. Defaults to `"data"`.
+ */
+ readonly rootName?: string;
+
+ /**
+ * Name of each row element. Defaults to `"row"`.
+ */
+ readonly rowName?: string;
+
+ /**
+ * Emit column values as XML attributes instead of child elements.
+ * Defaults to `false`.
+ */
+ readonly attribs?: boolean;
+
+ /**
+ * Whether to include the `` declaration.
+ * Defaults to `true`.
+ */
+ readonly xmlDeclaration?: boolean;
+
+ /**
+ * Map of prefix → namespace URI to declare on the root element.
+ * E.g. `{ xsi: "http://www.w3.org/2001/XMLSchema-instance" }`.
+ */
+ readonly namespaces?: Readonly>;
+
+ /**
+ * Indentation string (spaces or `"\t"`). Defaults to `" "` (2 spaces).
+ * Set to `""` or `null` to disable indentation.
+ */
+ readonly indent?: string | null;
+
+ /**
+ * Names of columns whose values should be wrapped in a CDATA section.
+ */
+ readonly cdataCols?: readonly string[];
+}
+
+// ─── default NA strings ───────────────────────────────────────────────────────
+
+const DEFAULT_NA: readonly string[] = ["", "NA", "NaN", "N/A", "null", "None", "nan"];
+
+// ─── entity decoding ──────────────────────────────────────────────────────────
+
+const NAMED_ENTITIES: Readonly> = {
+ amp: "&",
+ lt: "<",
+ gt: ">",
+ apos: "'",
+ quot: '"',
+ nbsp: "\u00a0",
+};
+
+function decodeEntities(s: string): string {
+ return s.replace(/&([^;]+);/g, (_, ref: string) => {
+ if (ref.startsWith("#x") || ref.startsWith("#X")) {
+ const cp = Number.parseInt(ref.slice(2), 16);
+ return Number.isNaN(cp) ? `&${ref};` : String.fromCodePoint(cp);
+ }
+ if (ref.startsWith("#")) {
+ const cp = Number.parseInt(ref.slice(1), 10);
+ return Number.isNaN(cp) ? `&${ref};` : String.fromCodePoint(cp);
+ }
+ return NAMED_ENTITIES[ref] ?? `&${ref};`;
+ });
+}
+
+// ─── entity encoding ──────────────────────────────────────────────────────────
+
+function encodeEntities(s: string): string {
+ return s
+ .replace(/&/g, "&")
+ .replace(//g, ">")
+ .replace(/"/g, """)
+ .replace(/'/g, "'");
+}
+
+// ─── local name (strip namespace prefix) ──────────────────────────────────────
+
+function localName(qname: string): string {
+ const colon = qname.indexOf(":");
+ return colon === -1 ? qname : qname.slice(colon + 1);
+}
+
+// ─── sanitize column name for use as an XML element/attribute name ────────────
+
+/**
+ * Convert a column name to a valid XML Name token.
+ *
+ * XML Name start character: letter or `_` (colon excluded for simplicity).
+ * XML Name character: letter, digit, `.`, `-`, `_`.
+ * Any invalid character is replaced with `_`.
+ */
+function toXmlName(name: string): string {
+ if (name.length === 0) {
+ return "_empty";
+ }
+ const sanitized = name.replace(/[^A-Za-z0-9._-]/g, "_");
+ // If the first character is a digit or hyphen/dot it's an invalid start char.
+ return /^[A-Za-z_]/.test(sanitized) ? sanitized : `_${sanitized}`;
+}
+
+type Token =
+ | { kind: "open"; name: string; attrs: Record; selfClose: boolean }
+ | { kind: "close"; name: string }
+ | { kind: "text"; text: string }
+ | { kind: "pi" }
+ | { kind: "comment" }
+ | { kind: "doctype" };
+
+function tokenize(xml: string): Token[] {
+ const tokens: Token[] = [];
+ let pos = 0;
+ const len = xml.length;
+
+ while (pos < len) {
+ if (xml[pos] !== "<") {
+ // text node
+ const end = xml.indexOf("<", pos);
+ const raw = end === -1 ? xml.slice(pos) : xml.slice(pos, end);
+ tokens.push({ kind: "text", text: decodeEntities(raw) });
+ pos = end === -1 ? len : end;
+ continue;
+ }
+ // starts with <
+ if (xml.startsWith("", pos + 4);
+ tokens.push({ kind: "comment" });
+ pos = end === -1 ? len : end + 3;
+ continue;
+ }
+ if (xml.startsWith("", pos + 9);
+ const text = end === -1 ? xml.slice(pos + 9) : xml.slice(pos + 9, end);
+ tokens.push({ kind: "text", text });
+ pos = end === -1 ? len : end + 3;
+ continue;
+ }
+ if (xml.startsWith("", pos)) {
+ const end = xml.indexOf("?>", pos + 2);
+ tokens.push({ kind: "pi" });
+ pos = end === -1 ? len : end + 2;
+ continue;
+ }
+ if (xml.startsWith("", pos + 2);
+ tokens.push({ kind: "doctype" });
+ pos = end === -1 ? len : end + 1;
+ continue;
+ }
+ if (xml[pos + 1] === "/") {
+ // closing tag
+ const end = xml.indexOf(">", pos + 2);
+ const raw = end === -1 ? xml.slice(pos + 2) : xml.slice(pos + 2, end);
+ tokens.push({ kind: "close", name: raw.trim() });
+ pos = end === -1 ? len : end + 1;
+ continue;
+ }
+ // opening tag
+ const end = xml.indexOf(">", pos + 1);
+ if (end === -1) {
+ pos = len;
+ continue;
+ }
+ const inner = xml.slice(pos + 1, end);
+ const selfClose = inner.endsWith("/");
+ const tagContent = selfClose ? inner.slice(0, -1) : inner;
+ // parse tag name and attributes
+ const match = /^([^\s/]+)([\s\S]*)$/.exec(tagContent.trim());
+ if (!match) {
+ pos = end + 1;
+ continue;
+ }
+ const [, rawName = "", attrStr = ""] = match;
+ const attrs: Record = {};
+ // parse attributes: name="value" or name='value'
+ const attrRe = /([^\s=]+)\s*=\s*(?:"([^"]*)"|'([^']*)')/g;
+ let am: RegExpExecArray | null;
+ while ((am = attrRe.exec(attrStr)) !== null) {
+ const [, attrName = "", dq = "", sq = ""] = am;
+ attrs[localName(attrName)] = decodeEntities(dq || sq);
+ }
+ tokens.push({ kind: "open", name: rawName.trim(), attrs, selfClose });
+ pos = end + 1;
+ }
+ return tokens;
+}
+
+// ─── readXml ──────────────────────────────────────────────────────────────────
+
+/**
+ * Parse an XML string into a DataFrame.
+ *
+ * @example
+ * ```ts
+ * const xml = `
+ * Alice30
+ * Bob25
+ * `;
+ * const df = readXml(xml);
+ * df.columns.toArray(); // ["id", "name", "age"]
+ * df.shape; // [2, 3]
+ * ```
+ */
+export function readXml(text: string, options: ReadXmlOptions = {}): DataFrame {
+ const {
+ rowTag,
+ indexCol = null,
+ usecols = null,
+ naValues: extraNa = [],
+ converters = true,
+ nrows,
+ attribs = true,
+ elems = true,
+ } = options;
+
+ const naSet = new Set([...DEFAULT_NA, ...extraNa]);
+
+ const tokens = tokenize(text);
+ const rows: Array> = [];
+
+ // Discover rowTag from first repeating child of root if not specified
+ let resolvedRowTag = rowTag;
+ if (!resolvedRowTag) {
+ const childCounts: Map = new Map();
+ let depth = 0;
+ for (const tok of tokens) {
+ if (tok.kind === "open") {
+ depth++;
+ if (depth === 2) {
+ const n = localName(tok.name);
+ childCounts.set(n, (childCounts.get(n) ?? 0) + 1);
+ }
+ if (tok.selfClose && depth === 2) depth--;
+ } else if (tok.kind === "close") {
+ depth--;
+ }
+ }
+ // pick the element with the highest count (most repeated child of root)
+ let best = "";
+ let bestCount = 0;
+ for (const [name, count] of childCounts) {
+ if (count > bestCount) {
+ bestCount = count;
+ best = name;
+ }
+ }
+ resolvedRowTag = best || "row";
+ }
+
+ // Parse rows
+ let depth = 0;
+ let inRow = false;
+ let currentRow: Record = {};
+ let currentElem = "";
+ let currentText = "";
+ let rowCount = 0;
+
+ for (const tok of tokens) {
+ if (tok.kind === "open") {
+ depth++;
+ if (!inRow && depth >= 2 && localName(tok.name) === resolvedRowTag) {
+ inRow = true;
+ currentRow = {};
+ if (attribs) {
+ for (const [k, v] of Object.entries(tok.attrs)) {
+ currentRow[k] = v;
+ }
+ }
+ if (tok.selfClose) {
+ inRow = false;
+ rows.push({ ...currentRow });
+ rowCount++;
+ if (nrows !== undefined && rowCount >= nrows) break;
+ }
+ } else if (inRow && elems) {
+ currentElem = localName(tok.name);
+ currentText = "";
+ // self-closing child elem → null
+ if (tok.selfClose) {
+ currentRow[currentElem] = null;
+ currentElem = "";
+ }
+ }
+ if (tok.selfClose) depth--;
+ } else if (tok.kind === "text") {
+ if (inRow && currentElem) {
+ currentText += tok.text;
+ }
+ } else if (tok.kind === "close") {
+ const cln = localName(tok.name);
+ if (inRow && elems && currentElem && cln === currentElem) {
+ currentRow[currentElem] = currentText;
+ currentElem = "";
+ currentText = "";
+ } else if (inRow && cln === resolvedRowTag) {
+ inRow = false;
+ rows.push({ ...currentRow });
+ rowCount++;
+ if (nrows !== undefined && rowCount >= nrows) break;
+ }
+ depth--;
+ }
+ }
+
+ if (rows.length === 0) {
+ return DataFrame.fromColumns({});
+ }
+
+ // Collect all column names in order of first appearance
+ const colSet = new Set();
+ for (const row of rows) {
+ for (const k of Object.keys(row)) colSet.add(k);
+ }
+ let cols = [...colSet];
+ if (usecols) cols = cols.filter((c) => usecols.includes(c));
+
+ // Build column arrays
+ const colData: Record = {};
+ for (const col of cols) {
+ colData[col] = rows.map((row) => {
+ const raw = row[col] ?? null;
+ if (raw === null || naSet.has(raw)) return null;
+ if (converters) {
+ const n = Number(raw);
+ if (!Number.isNaN(n) && raw.trim() !== "") return n;
+ }
+ return raw;
+ });
+ }
+
+ // Determine index
+ let idxCol: string | null = null;
+ if (typeof indexCol === "string") {
+ idxCol = indexCol;
+ } else if (typeof indexCol === "number" && indexCol < cols.length) {
+ idxCol = cols[indexCol] ?? null;
+ }
+
+ if (idxCol !== null && cols.includes(idxCol)) {
+ const idxData = colData[idxCol] ?? [];
+ const dataColNames = cols.filter((c) => c !== idxCol);
+ const dataColData: Record = {};
+ for (const c of dataColNames) {
+ dataColData[c] = colData[c] ?? [];
+ }
+ const idx = new Index(idxData.filter(isLabel));
+ return DataFrame.fromColumns(dataColData, { index: idx });
+ }
+
+ return DataFrame.fromColumns(colData);
+}
+
+// ─── toXml ────────────────────────────────────────────────────────────────────
+
+/**
+ * Serialize a DataFrame to an XML string.
+ *
+ * @example
+ * ```ts
+ * const df = DataFrame.fromColumns({ name: ["Alice", "Bob"], age: [30, 25] });
+ * console.log(toXml(df));
+ * //
+ * //
+ * // Alice30
+ * // Bob25
+ * //
+ * ```
+ */
+export function toXml(df: DataFrame, options: ToXmlOptions = {}): string {
+ const {
+ rootName = "data",
+ rowName = "row",
+ attribs = false,
+ xmlDeclaration = true,
+ namespaces = {},
+ indent = " ",
+ cdataCols = [],
+ } = options;
+
+ const ind = indent ?? "";
+ const nl = ind ? "\n" : "";
+
+ const lines: string[] = [];
+
+ if (xmlDeclaration) {
+ lines.push('');
+ }
+
+ // Root element opening with optional namespace declarations
+ const nsAttrs = Object.entries(namespaces)
+ .map(([prefix, uri]) => ` xmlns:${prefix}="${encodeEntities(uri)}"`)
+ .join("");
+ lines.push(`<${rootName}${nsAttrs}>`);
+
+ const columns = df.columns.toArray();
+ const nRows = df.shape[0];
+
+ for (let i = 0; i < nRows; i++) {
+ const rowValues: string[] = [];
+ for (const col of columns) {
+ const series = df.col(col);
+ const val = series.iloc(i);
+ rowValues.push(val === null || val === undefined ? "" : String(val));
+ }
+
+ if (attribs) {
+ // emit as attributes on the row element
+ const attrStr = columns
+ .map((c, j) => `${toXmlName(c)}="${encodeEntities(rowValues[j] ?? "")}"`)
+ .join(" ");
+ lines.push(`${ind}<${rowName} ${attrStr}/>`);
+ } else {
+ // emit as child elements
+ const childLines: string[] = [];
+ for (let j = 0; j < columns.length; j++) {
+ const col = columns[j] ?? "";
+ const tag = toXmlName(col);
+ const raw = rowValues[j] ?? "";
+ const isCdata = cdataCols.includes(col);
+ const content = isCdata ? `` : encodeEntities(raw);
+ childLines.push(`${ind}${ind}<${tag}>${content}${tag}>`);
+ }
+ if (childLines.length === 0) {
+ lines.push(`${ind}<${rowName}/>`);
+ } else {
+ lines.push(`${ind}<${rowName}>${nl}${childLines.join(nl)}${nl}${ind}${rowName}>`);
+ }
+ }
+ }
+
+ lines.push(`${rootName}>`);
+ return lines.join(nl) + nl;
+}
diff --git a/src/reshape/index.ts b/src/reshape/index.ts
index 6e03a5c3..3f132c43 100644
--- a/src/reshape/index.ts
+++ b/src/reshape/index.ts
@@ -14,3 +14,5 @@ export { wideToLong } from "./wide_to_long.ts";
export type { WideToLongOptions } from "./wide_to_long.ts";
export { pivotTableFull } from "./pivot_table.ts";
export type { PivotTableFullOptions } from "./pivot_table.ts";
+export { lreshape } from "./lreshape.ts";
+export type { LreshapeGroups, LreshapeOptions } from "./lreshape.ts";
diff --git a/src/reshape/lreshape.ts b/src/reshape/lreshape.ts
new file mode 100644
index 00000000..ff89fdd1
--- /dev/null
+++ b/src/reshape/lreshape.ts
@@ -0,0 +1,197 @@
+/**
+ * lreshape — reshape wide-format data to long format using named column groups.
+ *
+ * Mirrors `pandas.lreshape(data, groups, dropna=True)`:
+ * - `data`: source DataFrame
+ * - `groups`: mapping from long-format column name → list of wide-format column names
+ * - `dropna`: when `true` (default), drop rows where any value column is `null`/`undefined`/`NaN`
+ *
+ * Each key in `groups` becomes a column in the output. The values (lists of column
+ * names) must all have the same length. The function stacks them vertically such
+ * that the first element of each list forms the first block of rows, the second
+ * element forms the second block, and so on.
+ *
+ * All columns in `data` that are **not** mentioned in any group value list become
+ * identity (id) columns — they are repeated for each block.
+ *
+ * @example
+ * ```ts
+ * const df = DataFrame.fromColumns({
+ * hr: [14, 7],
+ * team: ["Red", "Blue"],
+ * v1: [1, 3],
+ * v2: [2, 4],
+ * });
+ * lreshape(df, { v: ["v1", "v2"] });
+ * // hr team v
+ * // 14 Red 1
+ * // 7 Blue 3
+ * // 14 Red 2
+ * // 7 Blue 4
+ * ```
+ *
+ * @module
+ */
+
+import { DataFrame } from "../core/index.ts";
+import type { Index } from "../core/index.ts";
+import { RangeIndex } from "../core/index.ts";
+import type { Label, Scalar } from "../types.ts";
+
+// ─── public types ──────────────────────────────────────────────────────────────
+
+/**
+ * Groups argument for {@link lreshape}.
+ *
+ * Maps each output column name to an ordered list of input column names.
+ * All lists must have the same length.
+ */
+export type LreshapeGroups = Record;
+
+/** Options for {@link lreshape}. */
+export interface LreshapeOptions {
+ /**
+ * When `true` (default), rows where **any** value column is `null`,
+ * `undefined`, or `NaN` are dropped from the result.
+ */
+ readonly dropna?: boolean;
+}
+
+// ─── helpers ──────────────────────────────────────────────────────────────────
+
+/** True when a scalar is considered missing: null, undefined, or NaN. */
+function isMissing(v: Scalar): boolean {
+ return v === null || v === undefined || (typeof v === "number" && Number.isNaN(v));
+}
+
+// ─── lreshape ─────────────────────────────────────────────────────────────────
+
+/**
+ * Reshape wide-format data to long format.
+ *
+ * Each entry in `groups` maps an output column name to a list of input column
+ * names that should be stacked into that output column. The input lists must
+ * all have the same length `k`; the function produces `nRows * k` output rows.
+ *
+ * Columns not mentioned in any group value list are treated as id columns and
+ * are repeated for every block.
+ *
+ * @param data - Source DataFrame (wide format).
+ * @param groups - Mapping from long-format column name → wide-format column list.
+ * @param options - {@link LreshapeOptions}
+ * @returns A new long-format DataFrame.
+ *
+ * @example
+ * ```ts
+ * const df = DataFrame.fromColumns({
+ * A: ["a", "b"],
+ * B1: [1, 2],
+ * B2: [3, 4],
+ * });
+ * lreshape(df, { B: ["B1", "B2"] });
+ * // A B
+ * // a 1
+ * // b 2
+ * // a 3
+ * // b 4
+ * ```
+ */
+export function lreshape(
+ data: DataFrame,
+ groups: LreshapeGroups,
+ options?: LreshapeOptions,
+): DataFrame {
+ const dropna = options?.dropna ?? true;
+
+ const groupKeys = Object.keys(groups);
+
+ if (groupKeys.length === 0) {
+ // No groups → return a copy with only id columns (same as no value cols)
+ return data;
+ }
+
+ // Validate: all group lists must have the same length
+ const firstKey = groupKeys[0] as string;
+ const firstList = groups[firstKey] as readonly string[];
+ const k = firstList.length;
+
+ for (const key of groupKeys) {
+ const list = groups[key] as readonly string[];
+ if (list.length !== k) {
+ throw new Error(
+ `lreshape: all group lists must have the same length, but "${firstKey}" has length ${k} and "${key}" has length ${list.length}`,
+ );
+ }
+ }
+
+ // Validate: all referenced columns must exist in `data`
+ const allGroupCols = new Set();
+ for (const key of groupKeys) {
+ const list = groups[key] as readonly string[];
+ for (const col of list) {
+ allGroupCols.add(col);
+ if (!data.columns.values.includes(col)) {
+ throw new Error(`lreshape: column "${col}" not found in DataFrame`);
+ }
+ }
+ }
+
+ // Determine id columns: all data columns NOT mentioned in any group
+ const idCols = data.columns.values.filter((c) => !allGroupCols.has(c));
+
+ const nRows = data.index.size;
+
+ // Output arrays: id columns + group output columns
+ const outData: Record = {};
+ for (const id of idCols) {
+ outData[id] = [];
+ }
+ for (const key of groupKeys) {
+ outData[key] = [];
+ }
+ let totalRows = 0;
+
+ // Iterate block by block (one block per position in each group list)
+ for (let blockIdx = 0; blockIdx < k; blockIdx++) {
+ // For each row in the source
+ for (let ri = 0; ri < nRows; ri++) {
+ // Collect value-column values for this row in this block
+ const blockValues: Scalar[] = [];
+ for (const key of groupKeys) {
+ const list = groups[key] as readonly string[];
+ const srcCol = list[blockIdx] as string;
+ const val: Scalar = data.col(srcCol).iat(ri);
+ blockValues.push(val);
+ }
+
+ // Apply dropna filter
+ if (dropna && blockValues.some((v) => isMissing(v))) {
+ continue;
+ }
+
+ totalRows++;
+
+ // Id columns
+ for (const id of idCols) {
+ const col = outData[id];
+ if (col !== undefined) {
+ col.push(data.col(id).iat(ri));
+ }
+ }
+
+ // Value columns
+ for (let vi = 0; vi < groupKeys.length; vi++) {
+ const key = groupKeys[vi] as string;
+ const col = outData[key];
+ if (col !== undefined) {
+ const bv = blockValues[vi];
+ col.push(bv !== undefined ? bv : null);
+ }
+ }
+ }
+ }
+
+ const resultIndex: Index