Skip to content

Commit 9ad2df0

Browse files
committed
split data notebook
Signed-off-by: Max Pumperla <[email protected]>
1 parent d47c5de commit 9ad2df0

13 files changed

+375
-454
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
venv/
22
_build/
3+
4+
.DS_Store

_toc.yml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,9 +47,6 @@ parts:
4747
- file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_05.ipynb
4848
- file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_06.ipynb
4949
- file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_07.ipynb
50-
- file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_08.ipynb
51-
- file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_09.ipynb
52-
- file: courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_10.ipynb
5350
- file: courses/00_Developer_Intro_to_Ray/output/04c_Intro_Ray_Data_Unstructured_01.ipynb
5451
- file: courses/00_Developer_Intro_to_Ray/output/04c_Intro_Ray_Data_Unstructured_02.ipynb
5552
- file: courses/00_Developer_Intro_to_Ray/output/04c_Intro_Ray_Data_Unstructured_03.ipynb

courses/00_Developer_Intro_to_Ray/04b_Intro_Ray_Data_Structured.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@
7171
"cell_type": "markdown",
7272
"metadata": {},
7373
"source": [
74-
"## 1. How to Use Ray Data?\n",
74+
"### 1. How to Use Ray Data?\n",
7575
"\n",
7676
"You typically should use the Ray Data API in this way:\n",
7777
"\n",
@@ -705,7 +705,7 @@
705705
"cell_type": "markdown",
706706
"metadata": {},
707707
"source": [
708-
"## 7. Ray Data in Production\n",
708+
"### 7. Ray Data in Production\n",
709709
"\n",
710710
"1. Runway AI is using Ray Data to scale its ML workloads. See [this interview with Runway AI](https://siliconangle.com/2024/10/02/runway-transforming-ai-driven-filmmaking-innovative-tools-techniques-raysummit/) to learn more.\n",
711711
"2. Netflix is using Ray Data for multi-modal batch inference pipelines. See [this talk at the Ray Summit 2024](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog/session/1722028596844001bCg0) to learn more.\n",
@@ -716,7 +716,7 @@
716716
"cell_type": "markdown",
717717
"metadata": {},
718718
"source": [
719-
"## 8. Upcoming Features in Ray Data\n",
719+
"### 8. Upcoming Features in Ray Data\n",
720720
"\n",
721721
"Here are some relevant upcoming features in Ray Data:\n",
722722
"\n",

courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_02.ipynb

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,19 @@
1010
"\n",
1111
"It is built on top of Ray, a fast and simple framework for building and running distributed applications. Ray Data is designed to be easy to use, scalable, and fault-tolerant."
1212
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"### 1. How to Use Ray Data?\n",
19+
"\n",
20+
"You typically should use the Ray Data API in this way:\n",
21+
"\n",
22+
"1. **Create a Ray Dataset** from external storage or in-memory data.\n",
23+
"2. **Apply transformations** to the data.\n",
24+
"3. **Write the outputs** to external storage or **feed the outputs** to training workers.\n"
25+
]
1326
}
1427
],
1528
"metadata": {

courses/00_Developer_Intro_to_Ray/output/04b_Intro_Ray_Data_Structured_03.ipynb

Lines changed: 162 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,170 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## 1. How to Use Ray Data?\n",
7+
"## 2. Loading Data\n",
88
"\n",
9-
"You typically should use the Ray Data API in this way:\n",
9+
"Our Dataset is the New York City Taxi & Limousine Commission's Trip Record Data\n",
1010
"\n",
11-
"1. **Create a Ray Dataset** from external storage or in-memory data.\n",
12-
"2. **Apply transformations** to the data.\n",
13-
"3. **Write the outputs** to external storage or **feed the outputs** to training workers.\n"
11+
"**Dataset features**\n",
12+
"\n",
13+
"| Column | Description | \n",
14+
"| ------ | ----------- |\n",
15+
"| `trip_distance` | Float representing trip distance in miles. |\n",
16+
"| `passenger_count` | The number of passengers |\n",
17+
"| `PULocationID` | TLC Taxi Zone in which the taximeter was engaged | \n",
18+
"| `DOLocationID` | TLC Taxi Zone in which the taximeter was disengaged | \n",
19+
"| `payment_type` | A numeric code signifying how the passenger paid for the trip. |\n",
20+
"| `tolls_amount` | Total amount of all tolls paid in trip. | \n",
21+
"| `tip_amount` | Tip amount \u2013 This field is automatically populated for credit card tips. Cash tips are not included. | \n",
22+
"| `total_amount` | The total amount charged to passengers. Does not include cash tips. |\n"
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": null,
28+
"metadata": {},
29+
"outputs": [],
30+
"source": [
31+
"COLUMNS = [\n",
32+
" \"trip_distance\",\n",
33+
" \"passenger_count\",\n",
34+
" \"PULocationID\",\n",
35+
" \"DOLocationID\",\n",
36+
" \"payment_type\",\n",
37+
" \"tolls_amount\",\n",
38+
" \"tip_amount\",\n",
39+
" \"total_amount\",\n",
40+
"]\n",
41+
"\n",
42+
"DATA_PATH = \"s3://anyscale-public-materials/nyc-taxi-cab\""
43+
]
44+
},
45+
{
46+
"cell_type": "markdown",
47+
"metadata": {},
48+
"source": [
49+
"Let's read the data for a single month. It takes up to 2 minutes to run."
50+
]
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": null,
55+
"metadata": {},
56+
"outputs": [],
57+
"source": [
58+
"df = pd.read_parquet(\n",
59+
" f\"{DATA_PATH}/yellow_tripdata_2011-05.parquet\",\n",
60+
" columns=COLUMNS,\n",
61+
")\n",
62+
"\n",
63+
"df.head()"
64+
]
65+
},
66+
{
67+
"cell_type": "markdown",
68+
"metadata": {},
69+
"source": [
70+
"Let's check how much memory the dataset is using."
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"execution_count": null,
76+
"metadata": {},
77+
"outputs": [],
78+
"source": [
79+
"df.memory_usage(deep=True).sum().sum() / 1024**2"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
"Let's check how many files there are in the dataset"
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": null,
92+
"metadata": {},
93+
"outputs": [],
94+
"source": [
95+
"!aws s3 ls s3://anyscale-public-materials/nyc-taxi-cab/ --human-readable | wc -l"
96+
]
97+
},
98+
{
99+
"cell_type": "markdown",
100+
"metadata": {},
101+
"source": [
102+
"We are not making use of all the columns and are already consuming ~1GB of data per file -> will quickly become a problem if you want to scale to entire dataset (~155 files) if we are running on a small node."
103+
]
104+
},
105+
{
106+
"cell_type": "markdown",
107+
"metadata": {},
108+
"source": [
109+
"Let's instead make use of a distributed data preprocessing library like Ray Data to load the full dataset in a distributed manner."
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": null,
115+
"metadata": {},
116+
"outputs": [],
117+
"source": [
118+
"ds = ray.data.read_parquet(\n",
119+
" DATA_PATH,\n",
120+
" columns=COLUMNS,\n",
121+
")"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"metadata": {},
127+
"source": [
128+
"There are Ray data equivalents for common pandas functions like `read_csv`, `read_parquet`, `read_json`, etc.\n",
129+
"\n",
130+
"Refer to the [Input/Output docs](https://docs.ray.io/en/latest/data/api/input_output.html) for a comprehensive list of read functions."
131+
]
132+
},
133+
{
134+
"cell_type": "markdown",
135+
"metadata": {},
136+
"source": [
137+
"### Dataset\n",
138+
"\n",
139+
"Let's view our dataset"
140+
]
141+
},
142+
{
143+
"cell_type": "code",
144+
"execution_count": null,
145+
"metadata": {},
146+
"outputs": [],
147+
"source": [
148+
"ds"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"Ray Data by default adopts **lazy execution** this means that the data is not loaded into memory until it is needed. Instead only a small part of the dataset is loaded into memory to infer the schema."
156+
]
157+
},
158+
{
159+
"cell_type": "markdown",
160+
"metadata": {},
161+
"source": [
162+
"A Dataset specifies a sequence of transformations that will be applied to the data. \n",
163+
"\n",
164+
"The data itself will be organized into blocks, where each block is a collection of rows.\n",
165+
"\n",
166+
"The following figure visualizes a tabular dataset with three blocks, each block holding 1000 rows each:\n",
167+
"\n",
168+
"<img src='https://docs.ray.io/en/releases-2.6.1/_images/dataset-arch.svg' width=50%/>\n",
169+
"\n",
170+
"Since a Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets."
14171
]
15172
}
16173
],

0 commit comments

Comments
 (0)