Skip to content

Conversation

@tgrigsby-sc
Copy link
Contributor

@tgrigsby-sc tgrigsby-sc commented May 21, 2021

-- these changes are reasonably well grouped by commit --

  • Use rawr tile size to determine metatile build zoom levels
  • Change instance type to M5 - reduce mem/cpu by 4GB for cheaper instances with more memory than we need
  • Switch to on-demand instance type
  • Big jobs code to spread jobs across (zoom_max, split_zoom) range instead of just one or the other
  • Add TileSpecifier. It allows more complex behavior around specifying mem reqs in a file, but for now it just uses the output of the big jobs code to dictate the order we enqueue high zoom metatile jobs.
  • Change zoom_max to 6 to allow grouping of the less intense high zoom jobs. Reduces variance of job runtimes which allows better utilization of EC2 resources
  • Bug fixes around usage of /usr/bin/time

@tgrigsby-sc tgrigsby-sc requested review from iandees and peitili May 21, 2021 23:01
@tgrigsby-sc tgrigsby-sc force-pushed the travisg/20210520-optimizations-from-rawr-tile-size branch from 613942a to 778c09a Compare May 21, 2021 23:04

# Create the spot fleet role
# https://docs.aws.amazon.com/batch/latest/userguide/spot_fleet_IAM_role.html
spotIamFleetRoleName = "AmazonEC2SpotFleetRole"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be neat if we could think of a way to keep support for spot instance types for folks where Spot is cheaper than on-demand. On the other hand, it might be enough to point to this commit and say "undo this one".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can certainly do it, it's just some plumbing

'cost_sub_feature': "Tile Build",
'cost_resource_group': run_id,
},
bidPercentage=60,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be removed, too.

width = 1 << dz

size = 0
sizes = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be easier to read. {} always looks like an empty dict to me, not a set.

Suggested change
sizes = {}
sizes = set()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a dict! I need to track which tile has that size so I can combine them correctly at lower zooms

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, well nevermind then. Maybe dict() then? 😄

count_at_this_zoom = counts_at_zoom[z]
zoom_10_equiv_count = count_at_this_zoom * (4 ** (10 - z))
counts_at_zoom_sum += zoom_10_equiv_count
if counts_at_zoom_sum == 4**10:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this 10 be one of the zoom values that gets passed in? What happens when we want to switch away from zoom 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh. Yes definitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it really just needs to be a number we're comfortable we're not going to go over. It could be zoom 20 and everything would be fine (except overflow?). At hardcoded zoom 10, if we ended up grouping into zoom 11 jobs then we'd be doing a 4^-1 calc which would then compare an int to a float, and things might get bad


def viable_container_overrides(mem_mb):
"""
Turns a number into the next highest even multiple that AWS will accept, and the min number of CPUs you need for that amount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth mentioning that this is a workaround for what seems to be a bug in Batch that prevents arbitrary memory/CPU requests.


# now that we know what we want, pick something AWS actually supports
viable_mem_request, required_min_cpus = viable_container_overrides(adjusted_mem)
print("REMOVEME: [%s] enqueueing %s at %s mem mb and %s cpus" % (time.ctime(), coord_line, viable_mem_request, required_min_cpus))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove REMOVEME? This looks like a useful thing to print maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are A LOT of these. 25k with this configuration

self.read_metas_to_file(missing_meta_file, compress=True)

print("Splitting into high and low zoom lists")
print("[%s] Splitting into high and low zoom lists" % (time.ctime()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefix the log with [make_meta_tiles] too

for line in fh:
c = deserialize_coord(line)
if c.zoom < split_zoom:
this_coord = deserialize_coord(line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to rebase master

for coord in missing_high:
fh.write(serialize_coord(coord) + "\n")

print("[%s] Done splitting into high and low zoom lists" % (time.ctime()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefix log with [make_meta_tiles]

parser.add_argument('--allowed-missing-tiles', default=2, type=int,
help='The maximum number of missing metatiles allowed '
'to continue the build process.')
parser.add_argument('--tile-specifier-file',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this arg used?


return 0

def get_mem_reqs_mb(self, coord_str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the name mib is probably more precise to mean 1024

or zoom_max (usually lower, e.g: 7) depending on whether big_jobs
contains a truthy value for the RAWR tile. The big_jobs are looked up
at zoom_max.
High zoom jobs are output between split_zoom (RAWR tile granularity) and zoom_max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, give a concrete example. As a new hire in the team, this doc is probably still not easy to follow


reordered_lines = tile_specifier.reorder(coord_lines)

print("[%s] Starting to enqueue %d tile batches" % (time.ctime(), len(reordered_lines)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: log prefix of the module name

Look up the RAWR tiles in the rawr_bucket under the prefix and with the
given key format, group the RAWR tiles (usually at zoom 10) by the job
group zoom (usually 7) and sum their sizes. Return an ordered list of job coordinates
by descending raw size sum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rawr size sum

Provides the ability to sort tiles based on an ordering and specify memory reqs
"""

def __init__(self, default_mem_gb=8, spec_dict={}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we probably should have a central place for all the default values such as 8 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants