September 14, 2008

Only NASA Would Create a Zero Fault Tolerant Email System

NASA NOMAD: Email Services Being Stopped at Johnson Space Center (JSC)

"Email, Calendaring, and Outlook Web Access (OWA) services will be stopped today in preparation for hurricane IKE for all customers located on the JSC servers. Customers located primarily at the following centers will be not be able to send or receive email messages.

1. Glenn Research Center (GRC)
2. White Sands Test Facility (WSTF)
3. Johnson Space Center (JSC)
4. NASA Shared Services Center (NSSC)
5. Dryden Flight Research Center (DFRC)

If you attempt to send messages to impacted customers, located primarily at the above Centers, your message will not be delivered until the email services are restarted."

Editor's note: I had to re-read this message several times to make sure that I understood the sheer idiocy it contains. Whoever designed this IT infrastructure for NASA should be fired. A month ago a power outage at Redstone Arsenal caused a NOMAD shut down. I thought that was a dumb IT set up. Well, now we learn that some genius decided to make the NASA NOMAD email system zero fault tolerant for natural disasters - and place email serves in a hurricane prone location - with no back up.

Here is how it seems to work: a hurricane threatens JSC - so NASA shuts off email and other services to a large chunk of the agency. Why? Because NASA deliberately set the system up such that other NASA centers - some of which are thousands of miles away and poised to offer assistance and keep the rest of the agency operating - have their email and other services routed out of JSC - and only JSC (or so it would seem). A few critical users have some service, but everyone else is out of luck for at least 48 hours. Would any self-respecting, profitable, commercial communications company do something as silly as this? No. They'd never stay in business. Only NASA would come up with such a flawed and stupid plan.

It looks like NASA learned nothing from either Katrina or 9-11.


Posted by kcowing at September 14, 2008 11:48 PM
Comments

You are right Keith, NASA has learned nothing. Admiral Gehman said it best in the CAIB Report, "NASA is not a learning organization". We are doomed to repeat our failures.

Posted by: Deja Vu at September 12, 2008 8:14 PM

"Only NASA would come up with such a flawed and stupid plan."
-----------------------------------------------------------

Touché. This statement could be applied to so many of NASA's programs/policies today.

Posted by: JM at September 12, 2008 8:31 PM

The really dumb thing is that the agency insists on putting critical IT infrastructure in a hurricane prone area in the first place. This stuff should be located far inland. Someplace like GRC. Instead it goes to Houston, or the Shared Services Center.

Posted by: Dan at September 12, 2008 8:32 PM

I agree. For those not in the know, the two centers that house NOMAD services are JSC and MSFC, and every other center is slaved to one of these two. It seems crazy to me that the two centers picked as critical network infrastructure sites, are some of the most hurricane-disaster-prone. In a review that I attended, some extremely limited fail-over capability between the two sites was described, but apparently they are not even activating that for Ike.

Is it a surprise that when all the (email) eggs are placed in the One NASA basket, a few will break? It's a shame that the natural redundancy that exists between centers was not exploited.

Posted by: CM at September 12, 2008 9:10 PM

guess that's a good reason to use gmail afterall huh.

Posted by: A guy at September 12, 2008 9:15 PM

Geee.... you'd think they'd just let the servers run and ride out the storm. I'm sure they have UPS and generators there. Maybe these servers are located in tents... or in a basement and the sump pump is out for service? :-) Loosing WAN links is unavoidable - but running servers thru a storm - no big deal. Our home office runs servers thru fires, mudslides, and earthquakes. Remote Desktops... nothing new here.

Posted by: Rich at September 12, 2008 9:27 PM

Everyone that I know at NASA hates NOMAD. It was created by and for people who know nothing and do not care about NASA. Everyone knows that it was and is a single point failure for communications at NASA. Your report is absolutely no surprise to anyone with half a brain. It was a boondoggle and a waste of taxpayers money. Many civil service and contractor employees at NASA opposed it. But nobody at the top cares. Does that surprise you?

Posted by: astroboy at September 12, 2008 9:55 PM

"This stuff should be located far inland. Someplace like GRC."

But this would mean JSC would have to give up some of it's stranglehold on the agency.

Posted by: Sean at September 12, 2008 9:58 PM

"I had to re-read this message several times to make sure that I understood the sheer idiocy it contains."

hehe - so did I when I got it - I must've reread it 4 or 5 times before accepting that it really was that rediculous. NOMAD is horrible

Posted by: an employee at September 12, 2008 10:44 PM

Aww, now! You are all thinking LOGICALLY! This isn't Vulcanis...Earthlings (especially at NASA) aren't PERMITTED to think logically! This makes about as much sense as when we assaulted Grenada and an Army Ranger had to use his credit card to call back to the U.S. mainland...so he could get air support from Naval aircraft circling right overhead, because their radio frequencies weren't compatable!

"We have met the enemy...and he is us!" - Walt Kelly's Pogo

Posted by: Trailrider at September 12, 2008 11:01 PM

This time, Kieth, I'm afraid I have to agree with you.

As much as I defended how well I though the overall NOMAD system ran, for being an Exchange system, this is just, well, not smart.

Especially today, when it is so easy to have geographically dispersed redundant servers. The company I host my own site with has server farms at a couple places in the US, Germany and Asia. If one farm goes down, my site traffic simply gets redirected, with no one the wiser.

Posted by: space scott at September 12, 2008 11:27 PM

Many "rank and file" NASA scientists have pointed out the many deficiencies of the NOMAD system. We were basically told "shut up" and live with it because it's going to happen. The decision to go to NOMAD was a "management decision" in the worst sense of that term. That is, it was made by upper level management with no significant input from those it would effect most. The ostensible reasons for doing so were never adequately investigated; would it be cheaper, would it be more efficient, would it be more reliable? There was no attempt to study the systems it would replace; were they effective, were they cost effective, were employees satisfied, even happy with them? The decision was ideological (the antithesis of scientific), a commercial, privatized system must be better. Unfortunately, the ultimate pressures come from the very top of the government, and with such dysfunction at the top, it's not surprising that it trickles all the way down. NASA's civil servant managers simply haven't had the back-bone to resist. We are all worse
off for that.

Posted by: TS at September 13, 2008 12:21 AM

This ain't rocket science.

Oh yes, I did.

Posted by: SebiMeyer at September 13, 2008 12:54 AM

Since there might be a large earthquake once every 25 years, the "powers that be" decided to put services like NOMAD at JSC and MSFC, instead of a place like Ames (which, ironically is an IT center of excellence). But, there have been multiple hurricanes every year affecting both centers.

Posted by: HateNomad at September 13, 2008 2:57 AM

Well, well, well...it seems we have come full circle back to the "management doesn't listen to the engineers" discussion.

Posted by: Shuttle Hugger at September 13, 2008 7:46 AM

Remember: everything (including e-mail) must be controlled by the "Apollo-on-Steroids" crowd, lest the Nation no longer be "inspired."

Posted by: Common Sense at September 13, 2008 10:34 AM

NASA = Need Alternate Server Architecture

Posted by: Ray at September 13, 2008 11:07 AM

It is indeed a single point failure for e-mail, as well as the human capital portal, which is not working as of this morning. The NSSC was a bad idea as well and another boondoggle that was not supported by the rank and file at the Centers, which lost their ability to handle much of their own business services. What most people don't realize is that it is a way to funnel money to the IT division of a large aerospace company. Anyone surprised?

Posted by: astroboy at September 13, 2008 11:23 AM

It get's even better. Coming in the next few weeks is centralized DHCP and DNS services.

Posted by: me at September 13, 2008 11:29 AM

Of course we all should expect something so archaic, corrupt, and dopey... NOMAD is a creation of a goverment agency!!! What IF JSC had been severely damaged and down for a week?? Who's head would roll over that? Heads should roll anyway - this is sooooooo unacceptable.

If they have commercial web service available, I guess it's time to use a commercial email service that works..... I know I sure would.

Posted by: Charlie at September 13, 2008 12:16 PM

The person in the CIO organization at HQ that made these decisions is a long-time friend of mine and we talked about the risky decision to site any NOMAD servers at JSC given the obvious hurricane risk... TS is right, the decision was essentially forced by upper management pressure and idealogy at that level. I believe that he told me that the reason that any servers were placed at JSC was due to one of key project leaders being from there and unwilling to move anywhere else...there was huge pressure from the top to "do this now" and they didn't want to hear about anything that would slow down the "transition" from e-mail servers at each Center to the centralized NOMAD service. My friend knew the risk that they were taking, but his attitude was pretty much FIGMO and he planned to quietly work later to build in full NOMAD server redundancy at MSFC or another Center in a less vulnerable location. Of course, within six months after this conversation, my friend left the Agency because he couldn't stand the working environment in HQ and dealing with the 9th floor...

Posted by: NASA employee at September 13, 2008 12:21 PM

Isn't NOMAD brought to us by those "wonderful" people at ODIN?

Posted by: some guy at September 13, 2008 1:55 PM

Don't blame all of NASA for the stupid decisions made by folks at headquarters. Any attempt to inject reason into the process are IGNORED at local centers. This is NOT a NASA problem, it is a management problem.

Posted by: Sim at September 13, 2008 4:52 PM

Just curious, by since when has a hurricane struck MSFC? I have lived here for almost 2 decades and haven't seen one come close. By the time any head our way, they are just rain storms. This is not KSC. MSFC is not on the coast. 6 hour drive from the coast.

Though we do have our power failures! :)

Posted by: Curious at September 13, 2008 9:08 PM

I think the problem starts with the fact that NASA does not feel email is critical, nor is it an "official" communication method in their minds. I remember an issue a few years back where someone sent an email that was pretty critical and it didn't get received. Management sent out a policy statement that email was for convenience and not an official communication route for critical communications, It was hilarious because this policy statement was sent out by email!. Call or send a letter if it's important, that was the mantra. I think NASA management doesn't feel email is critical so there is no need for failure tolerance. It's just an inconvenience not to be able to email someone.

Posted by: possum at September 13, 2008 10:13 PM

As someone who reads things like this, looks at NASA websites and so on, my opinion holds.

NASA needs some 19 year old to run their IT structure, dev the sites... It seems the people they have now are clueless to technology and the internet.

They also need a programmer at NASA TV but don't get me started on that rant.

Posted by: Danny at September 14, 2008 7:54 AM

Right - I tried to email a friend at JSC to see if he survived Ike ok. And got this:

Technical details of permanent failure:
DNS Error: DNS server returned answer with no data

I can't believe they don't have the ability to shift their server loads between locations. JSC is closed shift it to JPL or White Sands. Sheesh!!!

Posted by: jf at September 14, 2008 11:32 AM

NASA has a lot of good IT people, even better than 19 year olds. When the "requirements" come down from on high, you try to inject some sanity into them, you shut up and obey, or you leave. Guess which option career NASA people are taking.

Centralization, FDCC, HSPD-12, eAuth, Smartcards, NAMS, NCAD, timetables and other efforts are more important than usability. Take a look at the OCIO's NISE page if you haven't seen them yet.

Posted by: IT Guy at September 14, 2008 11:50 AM

The really odd and sad part is there will be knowledgeable people at NASA who will say "to be fair..." the whole story is not really as simple as this story and comments makes it out to be. Simple complaints or observations will fail to sway experts precisely because the reasons are too simple. Experts love complexity. Delusion and megalomania are similar in symptoms. Complex internal story telling justifies irrational behavior (or in this case project implementations). Simple logic usually comes from customers. Complex logic comes from implementors to whom the customer was a fund source, or security, or a program manager...or worse "a directive".

Case in point. We keep seeing ODIN like efforts across the federal government. Increasingly the costs of such work is in IT security, configuration control (or making everyone uniform anyway), and standards (again, making everyone the same so we spend less money tracking differences). The customer wants? Umm...let's see...about 112th on the to do list.

Posted by: A NASA Engineer at September 14, 2008 2:10 PM

Like the rest of IT at NASA, NOMAD sucks in many ways. But hey, it costs less (so they say) because it consolidates e-mail services. Apparently redundancy is a bad thing. It's a wonder we design multi-engine airplanes.

If you think NOMAD sucks, the forced-deployment of ODIN is going to be even worse. We're just getting the tip of the iceberg at Langley, and it is already retarded. I guarantee we will look back on this at some point in the future and realize it was a huge mistake. But, it saves money on paper....

Posted by: NASA Engineer at September 14, 2008 8:39 PM

Thanks for bringing this to light, Keith. Not looking at the possible uses of a system is one of the simplest mistakes a designer can make. I think JSC is one of the many Houston communities that will need to brainstorm lessons learned after the dust (water) settles.

One side affect of this IT architecture is that I received a phone call (which went to voicemail) asking me to call MSFC to report my whereabouts. Email, or even text messages, worked fine for this function during all of the dry-runs, but voice calls are very hard to get out of Clear Lake right now - overloaded cell towers probably. They'll have to learn of my whereabouts later.

Even though Houston companies will all be learning from this "wet-run" - and hopefully improving their IT infrastructure - I hope JSC steps to the plate and recognizes that we need to lead by example here.

Posted by: Another JSC Engineer at September 14, 2008 11:34 PM

Seems like what they really want is a single server suite so that they can track and keep all employee emails in one facility. Couldn't be a big brother effort now could it?

Posted by: Anonymous at September 15, 2008 12:18 PM

Well, suddenly the 200MB inbox size limit makes sense: loss of user data is greatly limited in event of system failure. And to think I complained about having to clean out my Inbox and folders all the time... Thanks, NOMAD! You've got us covered!

Posted by: designer monkey at September 17, 2008 2:31 PM

We hear about "NASA Management".

Some years ago, I coined:

"Leadership is about maximizing gains.
Management is about minimizing losses."

Of course, choosing M$ as the vendor of the e-mail infrastructure was probably not the brightest idea... I tend to pronounce "Outlook" as "Outage". Loathed Notes (a/k/a "Lotus Notes") doesn't look quite so bad in comparison, it seems, but maybe the ancient PROFS system might be a step up?

NOMAD is probably intended to be "economically efficient".

Resiliency is seldom-- if ever-- economically efficient, which is why X programs tend to have only one test article of late.

Posted by: John Campbell at September 18, 2008 9:16 AM
Post a comment









Remember personal info?