OT: 512KDay - dalsi strasak Y2K?

Jaroslav Lukesh lukesh na seznam.cz
Středa Srpen 13 21:10:21 CEST 2014


http://www.theregister.co.uk/2014/08/13/512k_invited_us_out_to_play/


512KDay: Why the internet is BROKEN (Next time, big biz, listen to your 
network admin)
We failed the internet's management challenge
By Trevor Pott, 13 Aug 2014
Yesterday, 12 August, 2014, the internet hit an arbitrary limit of more than 
512K routes. This 512K route limit is something we have known about for some 
time.
The fix for Cisco devices – and possibly others – is fairly straightforward. 
Internet service providers and businesses around the world chose not to 
address this issue in advance, as a result causing major outages around the 
world.
As part of the outage, punters experienced patchy – or even no – internet 
connectivity and lost access to all sorts of cloud-based services. The 
LastPass outage is being blamed by many on 512KDay, though official 
confirmation of this is still pending. I have been tracking reports of 
inability to access cloud services such as Office365 through to more 
localised phenomena from around the world, many of which look very much like 
they are 512KDay related.
As an example of the latter, while I don't yet have official confirmation 
yet from Canadian ISP Shaw, all indications are that the "mystery routing 
sickness" which affected its network (and which continues at time of 
publishing) could be related to the "512KDay" issue.
It looks like the issues I experienced with Shaw are likely down to routers 
within Shaw hitting the 512K limit. These routers hit the magic number and 
then were unable to route individual protocols (such as RDP, for example, 
although we cannot confirm this is so in Shaw's case) to the Deep Packet 
Inspection (DPI) systems that the ISP uses to create a "slow lane" enhance 
our internet experience*.
As the fix for these issues can range from "applying a patch or config 
change and rebooting a core piece of critical network infrastructure" to 
"buy a new widget, the demand for which has just hit peak" there is every 
chance that 512KDay issues will continue for a few days (or even weeks) yet 
to come.
Others around the world have seen issues as well. Consider the issues 
reported by Jeff Bearer of Avere Systems who says "my firewall started 
noting packet loss between it and its upstream router. It wasn't that bad 
until employees started showing up for work, but then it jumped up quite a 
bit. We don't have any real evidence, but I did go back and forth with the 
ISP several times. It looks like it probably was [the 512KDay event] that 
caused this."
Awareness
Bearer asks a critical question: "Why wasn't this in the press, like Y2K or 
IPv4?".
Perhaps this is the ghost of Y2K. Globally, we handled the very real issues 
posed by computers being unable to comprehend the passing of the millennium 
so well that the average punter didn't notice the few systems that didn't 
get updated. IPv4 has been a highly publicised apocalypse that has dragged 
on for over a decade and the internet has yet to collapse.
512KDay is simply "yet another arbitrary limit issue" that has been for 
years filed away alongside the famous Y2K, IPv4 or 2038 problems. If you're 
interested in some of the others, Wikipedia has a brief overview of these 
"time formatting and storage bugs" that explains the big ones, but doesn't 
have a listing for all the known ones.
Do the media bear some of the blame? Perhaps. I have seen 512KDay issues 
raised in many IPv4 articles over the years, but rarely has it been 
discussed in a major publication as an issue in and of itself. Perhaps this 
is an example of crisis fatigue working its way into the technological 
sphere: as we rush from one manufactured "crisis" to another, we stop having 
brain space and resources to deal with the real issues that confront us.
The finger of blame
One thing I do know is that it is the job of network administrators to know 
about these issues and deal with them. What wasn't in the mainstream media 
has been in the networking-specific trade press, in vendor documentation and 
more.
I have been contacted by hundreds of network administrators in the past 12 
hours with tales of woe. The common thread among them is that they 
absolutely did raise the flag on this, with virtually all of them being told 
to leave the pointy-haired boss's sight immediately.
Based on the evidence so far, I absolutely do not accept the inevitable 
sacrifice of some junior systems administrator to the baying masses. 
Throwing nerds under the bus doesn't cut it. The finger of blame points 
squarely at ISPs and other companies using BGP routers improperly all across 
the internet.
It's easy to make a boogyman out of ISPs; they're among the most hated 
industries in the world, after all. It's easy to point the finger of blame 
at companies that chose not to update their infrastructure because I've 
spent a lifetime fighting that battle from the coalface and it has made me a 
bitter and spiteful person.
Not my problem
Unfortunately, despite any understandable anti-corporate angst I might 
maintain, 512KDay was completely avoidable, and – mark my words – this is 
the beginning, not the end.
Another looming problem is IPv6. Millions of small and medium businesses 
today use two internet links and a simple IPv4 NAT router to provide 
redundancy and failover. Everything behind the NAT router keeps the same IP; 
only the edge IP changes if things go pear-shaped.
With IPv6 NAT functionally banned by the ivory tower types who designed the 
protocol, currently, there is no neat solution to this in IPv6. Existing 
doctrine states that SMBs should simply get an AS number, get their own 
subnet and then manage and maintain their own BGP routers, announcing routes 
to both ISPs.
Putting aside for one moment that most SMB-facing ISPs won't allow BGP 
across low margin links, the above 512KDay issues should demonstrate 
adequately that the very concept is completely bleeping batbleep insane.
The largest companies in the world – and internet service providers around 
the world – failed to listen to the best trained network engineers in the 
world. Dolly's automated doughnut emporium and internet-driven robobakery is 
not going to have talent or resources anywhere near what those organisations 
have to offer.
Oh, there are proposals that state "in a perfect world where everyone throws 
out every computer, router and application they have, everything will 
understand the concept of multiple addresses, multiple paths, and 
redundancy. Assuming we all agree on this RFC and then implement it." We 
haven't all agreed on an RFC, and anyone who thinks the world is chucking 
its entire installed base of IT to achieve in IPv6 what was a 
fire-and-forget $150 dual-port WAN under IPv4 is barking mad.
Alternately, we could abandon the idea of maintaining address coherency 
internal to our networks altogether and become utterly and completely 
dependent on DNS, even integral to our own networks and even for such 
critical bits of our infrastructure as basic management and diagnostic 
tools.
And what if it all comes crumbling down?
Absolute DNS reliance is madness. DNS does fail. Even if you are made out of 
super-macho certifications and have nerd cred that ripples like a 
steroid-abusing superman's muscles at the gym. It fails for the same sorts 
of reasons that 512KDay bit us in our collective proverbial: human error.
Human beings make mistakes. We make the wrong call. We don't spend money 
where we should and we spend money where we shouldn't. Hindsight is 20/20 
and foresight far more hazy, and sometimes we take what we think is an 
"acceptable risk" only to have it turn out to be the thing that bites us.
If 512KDay should teach us anything, it is that no single point of failure 
is acceptable, no matter the size of your business. Yet we also need to bear 
in mind that not all businesses are Fortune 500 companies with the GDP of 
Brunei to spend on networking every year.
If you bet your company on cloud computing to be the thing that enables you 
to operate, there was a good chance that 512KDay was a cold splash of 
reality when compared to the smooth marketing promises of unbeatable 
reliability and 24/7 availability. What good is cloud anything if the 
network connecting you to it fails?
In a similar vein, the series of bad and worse choices for SMBs looking to 
implement IPv6 will lead many of them to choose single points of failure 
even if they don't want to. Absolute reliance on DNS or a single ISP could 
cost them in the future.
What of the other as yet unknown vulnerabilities? Arbitrary limits and the 
design-by-fiat decisions that simply tell significant (but usually not 
well-heeled) chunks of the world to take a hike? Will we, as an industry, 
learn anything from 512KDay, or will we continue along as though nothing is 
wrong, blaming anyone and everyone except those who actually have the 
ability to affect change?
I leave you with a quote from Upton Sinclair: "It is difficult to get a man 
to understand something, when his salary depends upon his not understanding 
it."
512KDay was avoidable. Will we choose to avoid the next one? Answers in the 
comments, please. ®



Další informace o konferenci Hw-list