1
00:00:07,450 --> 00:00:11,760
After watching this video, you will be able
to recognize the importance of embracing failure,

2
00:00:11,760 --> 00:00:14,160
explain the importance of quick recovery from
failure,

3
00:00:14,160 --> 00:00:17,960
describe how retry patterns, circuit breaker
patterns, and bulkhead patterns help make

4
00:00:17,960 --> 00:00:22,900
applications resistant to failure, and
describe chaos engineering.

5
00:00:22,900 --> 00:00:27,700
Once you design your application as a collection
of stateless microservices, there are a lot

6
00:00:27,700 --> 00:00:31,930
of moving parts, which means that there is
a lot that can go wrong.

7
00:00:31,930 --> 00:00:36,050
Services will occasionally be slow to respond
or even have outages so you can’t always

8
00:00:36,050 --> 00:00:38,399
rely on them being available when you need
them.

9
00:00:38,399 --> 00:00:43,480
Hopefully, these incidents are very short-lived,
but you don’t want your application failing

10
00:00:43,480 --> 00:00:47,540
just because a dependent service is running
slow or there is a lot of network latency

11
00:00:47,540 --> 00:00:49,410
on a particular day.

12
00:00:49,410 --> 00:00:54,300
This is why you need to design for failure
at the application level.

13
00:00:54,300 --> 00:00:58,990
Since failure is inevitable, you must build
your software to be resistant to failure and

14
00:00:58,990 --> 00:01:00,000
scale horizontally.

15
00:01:00,000 --> 00:01:02,040
We must design for failure.

16
00:01:02,040 --> 00:01:04,260
We must embrace failure.

17
00:01:04,260 --> 00:01:06,420
Failure is the only constant.

18
00:01:06,420 --> 00:01:12,030
We must change our thinking, from moving from
how to avoid failure to how to identify failure

19
00:01:12,030 --> 00:01:14,829
when it happens, and what to do to recover
from it.

20
00:01:14,829 --> 00:01:19,049
This is one of the reasons why we moved DevOps
measurements from “mean time to failure”

21
00:01:19,049 --> 00:01:20,350
to “mean time to recovery.”

22
00:01:20,350 --> 00:01:22,350
It’s not about trying not to fail.

23
00:01:22,350 --> 00:01:27,670
It’s about making sure that when failure
happens, and it will, you can recover

24
00:01:27,670 --> 00:01:29,610
quickly.

25
00:01:29,610 --> 00:01:32,049
Application failure is no longer purely an
operational concern.

26
00:01:32,049 --> 00:01:33,719
It is a development concern as well.

27
00:01:33,719 --> 00:01:39,200
For the application to be resistant or resilient,
developers need to build that resilience

28
00:01:39,200 --> 00:01:40,200
right in from the start.

29
00:01:40,200 --> 00:01:43,700
And because microservices are always making
external calls to services that you don’t

30
00:01:43,700 --> 00:01:48,119
control, these services become especially
prone to problems.

31
00:01:48,119 --> 00:01:49,740
Plan to be throttled.

32
00:01:49,740 --> 00:01:53,800
You're going to pay for a certain level of
quality of service from your backing services in

33
00:01:53,800 --> 00:01:56,549
the cloud, and they will hold you to that
agreement.

34
00:01:56,549 --> 00:02:00,689
Let’s say you pick a plan that allows 20
database reads per second.

35
00:02:00,689 --> 00:02:03,390
When you exceed that limit, the service is
going to throttle you.

36
00:02:03,390 --> 00:02:09,600
You are going to get a 429_TOO_MANY_REQUESTS
error instead of 200_OK and you need

37
00:02:09,600 --> 00:02:10,830
to handle that.

38
00:02:10,830 --> 00:02:14,300
In the case, you would retry, right, in this case.

39
00:02:14,300 --> 00:02:16,940
This logic needs to be in your application
code.

40
00:02:16,940 --> 00:02:20,760
When you retry, you want to back off exponentially
when it fails.

41
00:02:20,760 --> 00:02:23,510
The idea is to degrade gracefully.

42
00:02:23,510 --> 00:02:27,879
If you can, cache where appropriate so that
you don’t always have to make these remote

43
00:02:27,879 --> 00:02:33,660
calls to these services if the answer isn’t
going to change.

44
00:02:33,660 --> 00:02:39,290
There are a number of patterns that are important
strategies to help you make applications resilient.

45
00:02:39,290 --> 00:02:42,580
I just want to go over a few of the popular
ones.

46
00:02:42,580 --> 00:02:45,450
The first one is the retry pattern.

47
00:02:45,450 --> 00:02:49,701
This enables an application to handle transient
failures when it tries to connect to a service

48
00:02:49,701 --> 00:02:54,780
or a network resource, by transparently retrying
and failing the operation.

49
00:02:54,780 --> 00:02:58,599
I've heard developers say you must deploy
the database before you start my service

50
00:02:58,599 --> 00:03:02,579
because it expects the database to be there
when it starts.

51
00:03:02,579 --> 00:03:07,319
That is a fragile design, which isn't appropriate
for cloud native applications.

52
00:03:07,319 --> 00:03:12,409
If the database is not there, your application
should wait patiently and then retry again.

53
00:03:12,409 --> 00:03:16,340
You must be able to connect, and reconnect,
and fail to connect and connect again.

54
00:03:16,340 --> 00:03:20,340
That is how you design robust cloud native
microservices.

55
00:03:20,340 --> 00:03:25,859
The key here is the retry pattern, to back
off exponentially, and delay longer in between

56
00:03:25,859 --> 00:03:26,859
each try.

57
00:03:26,859 --> 00:03:31,783
Instead of retying 10 times in a row and overwhelming
the service, you retry, it fails.

58
00:03:31,783 --> 00:03:33,430
And you wait a second you retry again.

59
00:03:33,430 --> 00:03:38,209
Then you wait 2 seconds, then you wait 4 seconds,
then you wait 8 seconds.

60
00:03:38,209 --> 00:03:42,430
Each time you retry, you increase the wait
time by some factor until all of the retries

61
00:03:42,430 --> 00:03:45,430
have been exhausted and then you return an
error condition.

62
00:03:45,430 --> 00:03:49,080
This gives the backend service time to recover
from whatever is causing the failure.

63
00:03:49,080 --> 00:03:51,390
It could be just temporary network latency.

64
00:03:51,390 --> 00:03:56,390
The circuit breaker pattern is similar to
the electrical circuit breakers in your home.

65
00:03:56,390 --> 00:03:59,030
You have probably experienced a circuit breaker
tripping in your house.

66
00:03:59,030 --> 00:04:03,670
You may have done something that exceeds the
power limit of the circuit and it causes the

67
00:04:03,670 --> 00:04:04,859
lights to go out.

68
00:04:04,859 --> 00:04:09,440
That’s when you go down to the basement
with a flashlight and you reset the circuit breaker

69
00:04:09,440 --> 00:04:11,099
to turn the lights back on.

70
00:04:11,099 --> 00:04:14,569
Well, this circuit breaker pattern works in the
same way.

71
00:04:14,569 --> 00:04:20,110
It is used to identify a problem and then
do something about it to avoid cascading failures.

72
00:04:20,110 --> 00:04:24,810
A cascading failure is when one service is
not available, and it causes a cascade of

73
00:04:24,810 --> 00:04:26,810
other services to fail.

74
00:04:26,810 --> 00:04:30,590
With the circuit breaker pattern, you can
avoid this by tripping the breaker and having

75
00:04:30,590 --> 00:04:34,830
an alternate path return something useful
until the original service recovers and the

76
00:04:34,830 --> 00:04:37,080
breaker closes again.

77
00:04:37,080 --> 00:04:41,178
The way it works is that everything flows
normally as long as the circuit breaker is

78
00:04:41,178 --> 00:04:42,178
closed.

79
00:04:42,466 --> 00:04:46,120
The circuit breaker is monitoring for failure up
to a certain limit.

80
00:04:46,120 --> 00:04:51,380
Once it reaches that limit threshold, right, that certain threshold,
the circuit breaker trips open, and all further

81
00:04:51,380 --> 00:04:56,210
calls to the circuit breaker return with an
error, without even calling the protected

82
00:04:56,210 --> 00:04:57,210
service.

83
00:04:57,210 --> 00:05:00,700
Then after a timeout, it enters this half-open
state where it tries to communicate with the

84
00:05:00,700 --> 00:05:01,700
service again.

85
00:05:01,700 --> 00:05:03,740
If it fails, it goes right back to open.

86
00:05:03,740 --> 00:05:08,470
But if it succeeds, it becomes fully closed again.

87
00:05:08,470 --> 00:05:13,230
The bulkhead pattern can be used to isolate
failing services to limit the scope of a failure.

88
00:05:13,230 --> 00:05:18,350
This is a pattern where using separate thread
pools can help to recover from a failed database

89
00:05:18,350 --> 00:05:23,000
connection by directing traffic to an alternate
thread pool that’s still active.

90
00:05:23,000 --> 00:05:26,190
Its name comes from the bulkhead design on
a ship.

91
00:05:26,190 --> 00:05:29,500
Compartments that are below the waterline
have walls called “bulkheads” between

92
00:05:29,500 --> 00:05:30,500
them.

93
00:05:30,500 --> 00:05:34,160
If the hull is breached, only one compartment
will fill with water.

94
00:05:34,160 --> 00:05:40,750
The bulkhead stops the water from affecting
the other compartments and sinking the ship.

95
00:05:40,750 --> 00:05:45,500
Using the bulkhead pattern isolates the consumers
from the services as cascading failures by

96
00:05:45,500 --> 00:05:49,270
allowing them to preserve some functionality
in the event of a service failure.

97
00:05:49,270 --> 00:05:52,640
Other services and features of the application
continue to work.

98
00:05:52,640 --> 00:05:57,560
Finally, there is chaos engineering, otherwise
known as monkey testing.

99
00:05:57,560 --> 00:06:02,312
While not a software design pattern, it is
a good practice to prove that all of your

100
00:06:02,312 --> 00:06:05,460
design patterns work as expected under failure.

101
00:06:05,460 --> 00:06:10,270
In chaos engineering, you deliberately kill
services to see how other services are

102
00:06:10,270 --> 00:06:11,270
affected.

103
00:06:11,270 --> 00:06:16,690
Netflix has a suite of failure-inducing tools
called The Simian Army.

104
00:06:16,690 --> 00:06:20,160
Chaos Monkey solely handles termination
of random instances.

105
00:06:20,160 --> 00:06:23,710
Netflix randomly kills things to see if they
come up and whether the system will recover

106
00:06:23,710 --> 00:06:24,710
gracefully.

107
00:06:24,710 --> 00:06:29,080
You cannot know how something will respond
in a failure in production until it actually

108
00:06:29,080 --> 00:06:30,400
fails in production.

109
00:06:30,400 --> 00:06:33,910
So, Netflix does this on purpose.

110
00:06:33,910 --> 00:06:39,270
All of these patterns can help you build more
robust software and respond gracefully to

111
00:06:39,270 --> 00:06:42,470
intermittent failures.

112
00:06:42,470 --> 00:06:47,330
In this video, you learned that failure is
inevitable, so we design for failure rather

113
00:06:47,330 --> 00:06:49,830
than trying to avoid failure.

114
00:06:49,830 --> 00:06:53,770
Developers need to build in resilience
to be able to recover quickly.

115
00:06:53,770 --> 00:06:57,680
Retry patterns work by retrying failed operations.

116
00:06:57,680 --> 00:07:01,780
Circuit breaker patterns are designed to avoid
cascading failures.

117
00:07:01,780 --> 00:07:05,580
Bulkhead patterns can be used to isolate failing
services.

118
00:07:05,580 --> 00:07:10,240
Chaos engineering is deliberately causing
services to fail to see how other services

119
00:07:10,240 --> 00:07:11,346
are affected.