Apache flink autoscaling

THÔNG TIN TÀI LIỆU

Nội dung

Design Declarative resource management In order to support activereactive mode as well as auto scaling we propose to introduce declarative resource management In the model of declarative resource man.

Design Declarative resource management In order to support active/reactive mode as well as auto scaling we propose to introduce declarative resource management In the model of declarative resource management, the JobMaster no longer asks for each slot individually but instead announces to the ResourceManager how many slots of which type it needs The ResourceManager will then try to fulfill these demands as good as possible by allocating slots for the JobMaster The resource requirements should consists of a quadruplet of minimum required resources, target value, maximum required resources and resource spec (min, target, max, rs) The target value is what the JobMaster would like to get The minimum value defines how many resources one needs at least in order to begin executing the job Flip-6 design of Job master and Resource manager interaction can be changed for this purpose Currently the Job master asks the Resource manager proactively for the required slots Instead the Job master can periodically derive the target number of slots from the user settings, job graph and operator autoscaling policies The target number of slots can be regularly sent to the Resource manager declaring the desired state of resource consumption, e.g in heartbeating If the desired number of slots is higher than already allocated one for this JobMaster, to cover the missing slots, the Resource manager can - either start respective number of new Task executors in active container mode, - or allocate slots from available free slots in reactive container mode The dedicated Task executors will as usually register with the Job master forming its view of really available resources which might meet the required number or not Based on the required and available number of slots, the JobMaster can start the job, recover from failures and periodically trigger up- or downscaling With declarative resource management, the difference between the active and reactive mode boils down to setting the target value to the maximum parallelism in the reactive mode That way we ensure that all operators are resource greedy and scale up to the maximum of available resources and the configured maximum resource consumption Slot allocation protocol Currently, the slot allocation is triggered by scheduling the ExecutionGraph This will allocate for each Execution a slot If the SlotPool has not enough slots available, it will ask the ResourceManager for more This signal triggers the allocation of new containers The protocol can be slightly changed Instead of requesting each slot individually from the ResourceManager, the JobMaster will announce its resource requirements Based on that the ResourceManager will assign slots to the respective JobMaster Once the JobMaster has received enough slots to to fulfill the start or rescaling conditions, the ExecutionGraph will be started or rescaled respectively The difference is that we separate the resource announcement and the scheduling of the ExecutionGraph into two steps Moreover, this design does not require a strict matching between slots and slot requests, thus, it might make the AllocationID superfluous Scheduler component In order to execute a JobGraph, we first need to extract the resource requirements from each operator ● Streaming: Extract requirements from all operators ● Batch: Depending on the execution mode only the requirements of the first stage needs to be known Next these requirements are given to the SlotPool which sends it to the ResourceManager if it does not hold enough slots Once the SlotPool gets new resources assigned, it will notify the Scheduler about these changes The scheduler will then check whether the current set of resources fulfills the minimum resource requirements If this is the case, then it will start scheduling the ExecutionGraph with the available set of slots If the ExecutionGraph is already running (with fewer resources than its target value), then the Scheduler can decide to scale up in order to make use of the additional resources The scheduler might wait for the resources to stabilize in order to avoid too many successive rescaling operations, especially, at start up of the job Requirements ● ● ● ● Decide on scheduling strategy and announce required resources Receive notifications of changed resources ○ Wait for stable resources ○ Trigger scheduling of executions once the requirement has been fulfilled Manage rescaling policies ○ Periodically querying the rescaling policies to update target values Decide on rescaling (when is it time to scale up/down) Separation of scheduler/deployment and ExecutionGraph In order to make the scheduling of the ExecutionGraph more flexible and to support different implementations we should decouple the scheduling from the ExecutionGraph The ExecutionGraph should be the data structure which tracks the current state of the job and is updated by the JobMaster The scheduling of the ExecutionGraph should be the responsibility of a dedicated component, the scheduler The scheduler will take the ExecutionGraph and check which Executions need to be executed It will then acquire the required slots from the SlotPool and deploy the individual Executions Separating the scheduling from the ExecutionGraph could allow us to transform the ExecutionGraph into a synchronous data structure at some point in time Making it single-threaded would simplify future maintenance considerably Take also FLINK-10240 – Pluggable scheduling strategy for batch jobs into account when designing the future scheduler component Calculation of target number of slots In order to calculate the resource requirements, the scheduler/other component needs to iterate over all operators of the JobGraph Each operator should return the resource requirement quadruplet (min, target, max, rs) Summing these values up with respect to the ResourceSpec gives the resource requirements which will be announced to the ResourceManager Slot sharing If a set of operators can share the same slot, then we only need to take the maximum over all resource requirements of these operators and sum up the ResourceSpecs: (minmax, targetmax, maxmax, sum(rsi)) Re-scaling policies In order to support autoscaling, we can allow the target value of an operator to change dynamically This can be achieved by introducing a RescalingPolicy which the user can specify for each operator The RescalingPolicy can be periodically queried to learn about the current target value Changes in the target value will be propagated to the ResourceManager resulting in a changed resource set Once the SlotPool gets new resources assigned, the Scheduler could trigger the rescaling The current behaviour could be achieved by using a FixedRescalingPolicy which simply returns always the initial target value Start and rescaling actions The scheduler is responsible for starting and rescaling the running job It gets notified about resource changes by the SlotPool If the slot number is stable for some time, then it can decide to rescale the job: - If not running yet, start the job with maximum possible parallelism if all operators can achieve their minimum parallelism - If running, upscale if … - If running, downscale if … Fault recovery In case of a fault the scheduler should check whether it needs to down-scale in order to run the job (e.g if a TaskManager died) The job should only be restarted if the minimum resource requirements are fulfilled Configuration Currently (prior to 1.7.0), Flink resolves parallelism of every operator, holds it in the job graph and starts job with this exactly resolved parallelism How the parallelism of an operator is resolved in Flink is currently: - fixed one if user called setParallelism() - otherwise to the job parallelism if it is set in cli (-p) - or to the default job parallelism from Flink config For the declarative resource management, the user needs to be able to specify the minimum, initial target and the maximum resource value If the user did not explicitly set the parallelism via setParallelism of an operator we could set the resource requirements to ● Active mode: (1, -p or cluster default, max parallelism) ● Reactive mode: (1, max parallelism, max parallelism) If the user defined the parallelism via setParallelism(p): (p, p, p) Implementation scope and steps of this feature iteration for Flink 1.9.0 The first version of this feature will be implemented under the following assumptions: A passive job cluster which runs a job in eager scheduling mode (= streaming jobs) With the assumption of the passive job cluster (e.g standalone mode) we know that all resources in the cluster will be allocatable by the JobManager Moreover, the ResourceManager cannot start new TaskManagers and, thus, we don’t need to start the execution of the ExecutionGraph to kick this of Consequently, we can already reason about the available set of resources without changing the slot allocation protocol and requiring that usable slots are registered in the SlotPool Moreover, by only considering the eager scheduling mode, we effectively exclude batch jobs from these changes This narrows the scope and will make it easier to implement a first working prototype Implementation steps Decouple ExecutionGraph from JobMaster [FLINK-10498] With declarative resource management we want to react to the set of available resources Thus, we need a component which is responsible for scaling the ExecutionGraph accordingly In order to better this and separate concerns, it is beneficial to introduce a Scheduler/ExecutionGraphDriver component which is in charge of the ExecutionGraph This component owns the ExecutionGraph and is allowed to modify it In the first version, this component will simply accommodate all the existing logic of the JobMaster and the respective JobMaster methods are forwarded to this component This new component should not change the existing behaviour of Flink Later this component will be in charge of announcing the required resources, deciding when to rescale and executing the rescaling operation Introduce declarative resource management switch [FLINK-10499] In order to not affect Flink’s behaviour, we propose to add a feature flag to turn on/off the declarative resource management In the beginning this feature flag should only be activated if running a streaming job in per-job mode The switch should control which type of ExecutionGraphDriver will be instantiated The declarative resource management should become the default once it is fully implemented Let ExecutionGraphDriver react to fail signal [FLINK-10500] In order to scale down when there are not enough resources available or if TMs died, the ExecutionGraphDriver needs to learn about a failure Depending on the failure type and the available set of resources, it can then decide to scale the job down or simply restart In the scope of this issue, the ExecutionGraphDriver should simply call into the RestartStrategy Obtain resource overview of cluster [FLINK-10501] In order to decide with which parallelism to run, the ExecutionGraphDriver needs to obtain an overview over all available resources This includes the resources managed by the SlotPool as well as not yet allocated resources on the ResourceManager This is a temporary workaround until we adapted the slot allocation protocol to support resource declaration Once this is done, we will only take the SlotPool’s slots into account Periodically check for new resources [FLINK-10503] In order to decide when to start scheduling or to rescale, we need to periodically check for new resources (slots) Wait for resource stabilization [FLINK-10502] Add functionality to wait for resource stabilization The available set of resources is considered stable if it did not change for a given time Only if the resource set is stable we should consider to trigger the initial scheduling or rescaling actions Decide actual parallelism based on available resources [FLINK-10504] Check if a JobGraph can be scheduled with the available set of resources (slots) If the minimum parallelism is fulfilled, then distribute the available set of slots across all available slot sharing groups in order to decide on the actual runtime parallelism In the absence of minimum, target and maximum parallelism, assume minimum = target = maximum = parallelism defined in the JobGraph Ideally, we make the slot assignment strategy pluggable Treat fail signal as scheduling event [FLINK-10505] Instead of simply calling into the RestartStrategy which restarts the existing ExecutionGraph with the same parallelism, the ExecutionGraphDriver should treat a recovery similar to the initial scheduling operation First, one needs to decide on the new parallelism of the ExecutionGraph (scale up/scale down) wrt to the available set of resources Only if the minimum configuration is fulfilled, the potentially rescaled ExecutionGraph will be restarted Introduce minimum, target and maximum parallelism to JobGraph [FLINK-10506] In order to run a job with a variable parallelism, one needs to be able to define the minimum and maximum parallelism for an operator as well as the current target value In the first implementation, minimum could be and maximum the max parallelism of the operator if no explicit parallelism has been specified for an operator If a parallelism p has been specified (via setParallelism(p)), then minimum = maximum = p The target value could be the command line parameter -p or the default parallelism Scale job up if new resources are available [FLINK-9957] Add a rescaling strategy to the ExecutionGraphDriver which decides when to rescale the existing job The simplest implementation could be to rescale whenever this is possible and after a grace period between successive rescaling events has passed Set target parallelism to maximum when using the standalone job cluster mode [FLINK-10507] In order to enable the reactive container mode, we should set the target value to the maximum parallelism if we run in standalone job cluster mode That way, we will always use all available resources and scale up if new resources are being added Future roadmap tbd Appendix: WIP - Autoscaling in passive container mode (previous design) Scope of this feature iteration for Flink 1.7.0 This document outlines design for the first iteration of auto-scaling feature design, mostly in the context of passive container mode The passive container environment means that Flink job does not have an active control over resource allocation or destruction in the cluster (task manager workers) It can detect that new available resources have appeared in the cluster or some old are gone (failed) In active container mode, Flink RM can request cluster for resources proactively Downscaling can be relevant also for a failing/overloaded active mode cluster, where resources cannot be temporarily allocated Current assumptions: - There is only one job running in the cluster (single job mode) - Job is in the streaming mode Upon initial start, Flink bootstraps the job with the default parallelism as before If Flink detects a change in the number of available slots, it tries to fit max possible parallelism for the given number of slots and automatically restart job with this new parallelism if possible Flink should also: - preserve the fixed by user parallelism (setParallelism()) - uniformly distribute new slots between slot groups with non-fixed parallelism - respect restart strategy Implementation design Configuration - Enable autoscaling flag in the cluster (could be separate for up- and downscaling activation) - The job parallelism (cli run -p or config default) is now not fixed but just initial one to bootstrap the job - User has to activate checkpointing, the activated checkpointing mode should be checked for the support of key group rescaling upon restoration - Optionally, user can configure the minimum parallelism for the job to run, the job will not be run with less parallelism and might optionally fail completely if enough slots are still unavailable after e.g some timeout The default parallelism is one High level view Resource manager The resource manager can forward slot reports from joining task managers to the job manager It can happen always or only if upscaling is activated The resource manager already holds the view of available task managers It updates the view when new task managers join and detects their failure by heartbeating Therefore the job master can just query the resource manager for the currently available number of slots, e.g.: ResourceManagerGateway.requestResourceOverview( ).getNumberRegisteredSlots() Job manager Job graph modifications We have to incorporate a fixed parallelism flag into the JobVertex of JobGraph if user explicitly sets it for the corresponding operator by calling setParalellism() This flag will allow the job rescaler to respect it while reevaluating changed parallelism Alternatively resolve it on server side and keep parallelism in JobVertex -1 if it is not fixed Job rescaling Upscaling of running job JM listens to notifications of new slots from RM and passes them to the Job upscaler Job upscaler (re)starts debounce timer upon this notification Alternatively, the upscaler can just pull the available slot number from the RM with fixed debounce delay When the debounce timer elapses, the job rescaler fetches the number of currently available slots from the RM and asks the new max possible parallelism from the re-scaling strategy If parallelism increased, the job upscaler cancels the job and triggers restart operation with the new parallelism Failure restart and downscaling This type of restart/rescaling can be also relevant for the active container mode, e.g if it could not allocate more TMs instead of failed upon restart because of the previous failure We can create special restart strategy in case of auto-downscaling to substitute or wrap and intercept configured by user restart strategy in execution graph It will forward global failures to the job downscaler for a new rescaling attempt or final global failure Failure cases: - - - Just restart if any failure but not NoResourceAvailableException The starting or running job can fail this way We not know whether it was a user attempt of downscaling or some other, maybe temporary failure The job has to be given chance to restart with the same parallelism and request new resources instead of failed in active mode or try the old ones again in passive mode The original restart strategy should be consulted whether to continue restarting attempts Downscale if NoResourceAvailableException The job start failed, most probably because the available resources decreased or cannot be requested in passive mode after the previous failure The job downscaler can ask the rescaling strategy for the new (most probably decreased) parallelism The job can be restarted with the new parallelism immediately or with less timeout to reduce downtime, with or potentially w/o consulting the original restart strategy, as the downscaler will eventually reach the parallelism where the job can not be run Below parallelism The job downscaler does not start the job with a parallelism less than the minimum It should wait for enough slots to become available It can optionally fail completely if the slot number does not increase after several checks with fixed delay Speed up downscaling to decrease downtime: If the job needs to downscale, firstly, some TMs will die because user will usually try to downscale killing them in passive mode or cluster cannot provide resources in active mode The running job will fail with some error but we not know why exactly The next restart with the same parallelism will fail with NoResourceAvailableException and we can be sure to downscale The failure of unavailable slot requests will take some time (currently timeout for queued requests in FLIP-6 code) The actual downscale will be delayed by it, increasing the job downtime This might need some tuning of this slot request timeout to decrease downtime ... 1.7.0), Flink resolves parallelism of every operator, holds it in the job graph and starts job with this exactly resolved parallelism How the parallelism of an operator is resolved in Flink is... executing the rescaling operation Introduce declarative resource management switch [? ?FLINK- 10499] In order to not affect Flink? ??s behaviour, we propose to add a feature flag to turn on/off the declarative... new resources [? ?FLINK- 10503] In order to decide when to start scheduling or to rescale, we need to periodically check for new resources (slots) Wait for resource stabilization [? ?FLINK- 10502]

Ngày đăng: 09/09/2022, 09:40