ray-project
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎_toc.yml‎
Lines changed: 0 additions & 3 deletions b/‎_toc.yml‎
Lines changed: 0 additions & 3 deletions
diff --git a/‎courses/00_Developer_Intro_to_Ray/04b_Intro_Ray_Data_Structured.ipynb‎
Lines changed: 3 additions & 3 deletions b/‎courses/00_Developer_Intro_to_Ray/04b_Intro_Ray_Data_Structured.ipynb‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_02.ipynb‎
Lines changed: 13 additions & 0 deletions b/‎courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_02.ipynb‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_03.ipynb‎
Lines changed: 162 additions & 5 deletions b/‎courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_03.ipynb‎
Lines changed: 162 additions & 5 deletions
@@ -1,2 +1,4 @@
 venv/
 _build/
+
+.DS_Store
@@ -47,9 +47,6 @@ parts:
   - file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_05.ipynb
   - file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_06.ipynb
   - file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_07.ipynb
-  - file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_08.ipynb
-  - file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_09.ipynb
-  - file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_10.ipynb
   - file: courses/00_Developer_Intro_to_Ray/output/04c_Intro_Ray_Data_Unstructured_01.ipynb
   - file: courses/00_Developer_Intro_to_Ray/output/04c_Intro_Ray_Data_Unstructured_02.ipynb
   - file: courses/00_Developer_Intro_to_Ray/output/04c_Intro_Ray_Data_Unstructured_03.ipynb
 
@@ -71,7 +71,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 1. How to Use Ray Data?\n",
+    "### 1. How to Use Ray Data?\n",
     "\n",
     "You typically should use the Ray Data API in this way:\n",
     "\n",
@@ -705,7 +705,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 7. Ray Data in Production\n",
+    "### 7. Ray Data in Production\n",
     "\n",
     "1. Runway AI is using Ray Data to scale its ML workloads. See [this interview with Runway AI](https://siliconangle.com/2024/10/02/runway-transforming-ai-driven-filmmaking-innovative-tools-techniques-raysummit/) to learn more.\n",
     "2. Netflix is using Ray Data for multi-modal batch inference pipelines. See [this talk at the Ray Summit 2024](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog/session/1722028596844001bCg0) to learn more.\n",
@@ -716,7 +716,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 8. Upcoming Features in Ray Data\n",
+    "### 8. Upcoming Features in Ray Data\n",
     "\n",
     "Here are some relevant upcoming features in Ray Data:\n",
     "\n",
 
@@ -10,6 +10,19 @@
     "\n",
     "It is built on top of Ray, a fast and simple framework for building and running distributed applications. Ray Data is designed to be easy to use, scalable, and fault-tolerant."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1. How to Use Ray Data?\n",
+    "\n",
+    "You typically should use the Ray Data API in this way:\n",
+    "\n",
+    "1. **Create a Ray Dataset** from external storage or in-memory data.\n",
+    "2. **Apply transformations** to the data.\n",
+    "3. **Write the outputs** to external storage or **feed the outputs** to training workers.\n"
+   ]
   }
  ],
  "metadata": {
 
@@ -4,13 +4,170 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 1. How to Use Ray Data?\n",
+    "## 2. Loading Data\n",
     "\n",
-    "You typically should use the Ray Data API in this way:\n",
+    "Our Dataset is the New York City Taxi & Limousine Commission's Trip Record Data\n",
     "\n",
-    "1. **Create a Ray Dataset** from external storage or in-memory data.\n",
-    "2. **Apply transformations** to the data.\n",
-    "3. **Write the outputs** to external storage or **feed the outputs** to training workers.\n"
+    "**Dataset features**\n",
+    "\n",
+    "| Column | Description | \n",
+    "| ------ | ----------- |\n",
+    "| `trip_distance` | Float representing trip distance in miles. |\n",
+    "| `passenger_count` | The number of passengers |\n",
+    "| `PULocationID` | TLC Taxi Zone in which the taximeter was engaged | \n",
+    "| `DOLocationID` | TLC Taxi Zone in which the taximeter was disengaged | \n",
+    "| `payment_type` | A numeric code signifying how the passenger paid for the trip. |\n",
+    "| `tolls_amount` | Total amount of all tolls paid in trip. | \n",
+    "| `tip_amount` | Tip amount \u2013 This field is automatically populated for credit card tips. Cash tips are not included. | \n",
+    "| `total_amount` | The total amount charged to passengers. Does not include cash tips. |\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "COLUMNS = [\n",
+    "    \"trip_distance\",\n",
+    "    \"passenger_count\",\n",
+    "    \"PULocationID\",\n",
+    "    \"DOLocationID\",\n",
+    "    \"payment_type\",\n",
+    "    \"tolls_amount\",\n",
+    "    \"tip_amount\",\n",
+    "    \"total_amount\",\n",
+    "]\n",
+    "\n",
+    "DATA_PATH = \"s3://anyscale-public-materials/nyc-taxi-cab\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's read the data for a single month. It takes up to 2 minutes to run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_parquet(\n",
+    "    f\"{DATA_PATH}/yellow_tripdata_2011-05.parquet\",\n",
+    "    columns=COLUMNS,\n",
+    ")\n",
+    "\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's check how much memory the dataset is using."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.memory_usage(deep=True).sum().sum() / 1024**2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's check how many files there are in the dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!aws s3 ls s3://anyscale-public-materials/nyc-taxi-cab/ --human-readable | wc -l"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We are not making use of all the columns and are already consuming ~1GB of data per file -> will quickly become a problem if you want to scale to entire dataset (~155 files) if we are running on a small node."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's instead make use of a distributed data preprocessing library like Ray Data to load the full dataset in a distributed manner."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds = ray.data.read_parquet(\n",
+    "    DATA_PATH,\n",
+    "    columns=COLUMNS,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are Ray data equivalents for common pandas functions like `read_csv`, `read_parquet`, `read_json`, etc.\n",
+    "\n",
+    "Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of read functions."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Dataset\n",
+    "\n",
+    "Let's view our dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ray Data by default adopts **lazy execution** this means that the data is not loaded into memory until it is needed. Instead only a small part of the dataset is loaded into memory to infer the schema."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A Dataset specifies a sequence of transformations that will be applied to the data. \n",
+    "\n",
+    "The data itself will be organized into blocks, where each block is a collection of rows.\n",
+    "\n",
+    "The following figure visualizes a tabular dataset with three blocks, each block holding 1000 rows each:\n",
+    "\n",
+    "<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-arch.svg' width=50%/>\n",
+    "\n",
+    "Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets."
    ]
   }
  ],
-Original file line number
+Diff line change
@@ @@ -1,2 +1,4 @@ @@
 venv/
 _build/
++
 +.DS_Store