HPC: Coming to an IT Shop Near You?

As smaller  commercial IT shops gravitate toward cloud computing, they are also searching for more lower-cost, raw processing power to accelerate and improve product design. One solution could be high-performance computing (HPC) systems. Cloud computing has helped remove barriers that have kept smaller companies out of the HPC market, but it has also introduced new problems involving data movement to and from the cloud as well as concerns about security of data as it moves off site.

But Bill Magro, director of HPC Software Solutions at Intel, believes commercial shops will likely adopt HPC systems in greater numbers once a few technical and commercial barriers are eliminated. We talked to Magro about these barriers as well as the significant opportunity HPC represents for developers. Here’s what he had to say.

[Disclosure: Intel is the sponsor of this program.]

Q. Why hasn’t HPC broken out of the academic, research and governmental markets and more into the commercial market?

Bill Magro: Actually, it has. Today, HPC is a critical design tool for countless enterprises. The Council on Competitiveness (compete.org) has an extensive set of publications highlighting the use of HPC in industry. HPC systems are often run within engineering departments, outside the corporate data centers. This is perhaps a factor why IT professionals don’t see as much HPC in industry. In any case, we believe over half of HPC is consumed by industry and the potential for increased usage, especially by smaller enterprises, is enormous.

Q. Most of that potential revolves around what you call the “missing middle.” Can you explain what that is?

B.M.: At the high end of HPC, we have Fortune500 companies, major universities, and national labs utilizing large to huge HPC systems. At the low end, we have designers working with very capable workstations. We’d expect high usage of medium-sized systems in the middle, but we, as an industry, see little. That is called the “missing middle.” Many organizations are now coming together to make HPC more accessible and affordable for users in the middle, but there are a number of barriers to be cleared before they can.

Q. Might the killer app that gets the missing middle to adopt HPC systems be cloud computing?

B.M.: When cloud first emerged, it meant different things to different people. A common concern among IT managers was: Is cloud going to put me out of a job? That question naturally led to: Does cloud computing compete with HPC? An analyst told me something funny once. He said: “Asking if cloud competes with HPC is like asking if grocery stores compete with bread.” Cloud is a delivery mechanism for computing; HPC is a style of computing. HPC is the tool, and the cloud is one distribution channel for that tool. So, cloud can help, but new users need to learn the value of HPC as a first step.

Q. So will HPC increase the commercial adoption of cloud?

B.M.: I’d turn that question around to ask, “Will cloud help increase the commercial adoption of HPC?” I think the answer is “yes” to both questions. There is a synergy there. Today, we know applications like Facebook and Google Search can run in the cloud. And new applications like SpringPad can run in Amazon’s cloud services. Before we see heavy commercial use of HPC in the cloud, we need a better understanding of which HPC workloads can run safely and competently in the cloud.

Q. So you can assume cloud-computing resources are appropriate for solving HPC problems?

B.M.: Early on the answer was “No,” but more and more it is becoming “Yes – but for the right workloads and with the right configuration.” If you look at Amazon’s EC2 offering, it has specific cluster-computing instances, suitable for some HPC workloads and rentable by the hour. Others have stood up HPC/Cloud offerings, as well. So, yes, a lot of people are looking at how to provide HPC in the cloud, but few see cloud as replacing tightly integrated HPC systems for the hardest problems.

Q. What are the barriers to HPC for the more inexperienced users in the middle?

B.M.: There are a number of barriers, and the Council’s Reveal report explores these in some detail. Expertise in cluster computing is certainly a barrier. A small engineering shop may also lack capital budget to buy a cluster and the full suite of analysis software packages used by all their clients. Auto makers, for instance, use different packages. So a small shop serving the auto industry might have to buy all those packages and hire a cluster administrator, spending hundreds of thousands of dollars. But cloud computing can give small shops access to simulation capabilities on demand. They can turn capital expenses into variable expenses.

Q. What are some of the other barriers to reaching the missing middle?

B.M.: Security is a concern. If you upload a product design to the cloud, is it secure from your competitors? What about the size of those design files? Do most SMB shops have uplink speeds fast enough to send a file up and bring it back, or is it faster to compute locally? Finally, many smaller companies have users proficient in CAD who don’t yet have the expertise to run and interpret a simulation. Even fewer have the experience and confidence to prove that a simulation yields a better result than physical design. These are the questions people are struggling with. We know we will get through it, but it won’t happen overnight.

Q. Can virtualization play a role in breaking down some of these barriers?

B.M.: HPC users have traditionally shied away from virtualization, as it doesn’t accelerate data movement or computing. But, I think virtualization does have a role to play in areas like data security. You would have higher confidence levels if your data were protected by hardware-based virtualization. Clearly, virtualization can help service providers to protect themselves from malicious clients or clients that make mistakes. It also helps with isolation -- protecting clients from each other.

Q. So developers can expect to see closer ties with HPC and virtualization moving forward?

B.M.: Traditionally, virtualization and HPC have not mixed because virtualization is used to carve up resources and consolidate use on a smaller number of machines. HPC is just the opposite: it is about aggregating resources in the service of one workload and getting all the inefficiencies out of the way so it can run at maximum speed.

Virtualization will get better to where performance overhead is small enough that it  makes sense to take advantage of its management, security, and isolation capabilities. Virtualizing storage, networks and compute have to come together before you can virtualize HPC.

Q. Is there anything Intel can do at the chip level to help achieve greater acceptance of HPC?

B.M.: Much of the work at the chip level has been done, primarily through the availability of affordable high-performance processors that incorporate many features – such as floating-point vector units – critical to HPC. Today, the cost per gigaflop of compute is at an all-time low, and HPC is in the hands of more users and driving more innovation than ever.

The next wave of HPC usage will be enabled by advances in software. On the desktop, the combination of Intel Architecture and Microsoft Windows established a common architecture, and this attracted developers and drove scale. In HPC, there have been no comparable standards for clusters, limiting scale. As we reach the missing middle, the industry will need to meet the needs of these new users in a scalable way. Intel and many partners are advancing a common architecture, called Intel Cluster Ready, that enables software vendors to create applications compatible with clusters from a wide variety of vendors. Conversely, a common architecture also allows all the component and system vendors to more easily achieve compatibility with the body of ISV software.

Q. How big a challenge will it be to get inexperienced developers to write applications that exploit clustering in an HPC system?

B.M.: It will absolutely be a challenge. The good news is that most HPC software vendors have already developed versions for clusters.  So, a wide range of commercial HPC software exists today to meet the needs of first-time commercial customers. Also, more and more universities are teaching parallel programming to ensure tomorrow’s applications are ready, as well. Intel is very active in promoting parallel programming as part of university curricula.

Photo: @iStockphoto.com/halbergman

Why Is Parallel Programming So Hard

Most people first learn how to program serially. Later they learn about multitasking and threading. But with the advent of multicore processors, most programmers are still pretty intimidated by the prospect of true parallel programming.

To help serial programmers make the transition to parallel, we talked to distinguished IBM engineer Paul E. McKenney. (McKenney maintains RCU in the Linux kernel and has written the detailed guidebook Is Parallel Programming Hard, And, If So, What Can You Do About It? ) Here’s what he had to say.

Q: What makes parallel programming harder than serial programming? How much of this is simply a new mindset one has to adopt?

McKenney: Strange though it may seem, although parallel programming is indeed harder than sequential programming, it is not that much harder. Perhaps the people complaining about parallel programming have forgotten about parallelism in everyday life: Drivers deal naturally with many other cars; sports team members deal with other players, referees and sometimes spectators; and schoolteachers deal with large numbers of (sometimes unruly) children. It is not parallel programming that is hard, but rather programming itself.

Nevertheless, parallelism can pose difficult problems for longtime sequential programmers, just as Git can be for longtime users of revision control systems. These problems include design and coding habits that are inappropriate for parallel programming, but also sequential APIs that are problematic for parallel programs.

This turns out to be a problem both in theory and practice. For example, consider a collection API whose addition and deletion primitives return the exact number of items in the collection. This has simple and efficient sequential implementations but is problematic in parallel. In contrast, addition and deletion primitives that do not return the exact number of items in the set have simple and efficient parallel implementations.

As a result, much of the difficulty in moving from sequential to parallel programming is in fact adopting a new mindset. After all, if parallel programming really is mind-crushingly difficult, why are there so many successful parallel open-source projects?

Q: What does a serial programmer have to rethink when approaching parallel programming? Enterprise programmers know how to multitask and thread, but most don’t have a clue how to tap into the power of multicore/parallel programming.

McKenney: The two biggest learning opportunities are a) partitioning problems[SK7]  to allow efficient parallel solutions, and b) using the right tool for the job. Sometimes people dismiss partitioned problems as “embarrassingly parallel,” but these problems are exactly the ones for which parallelism is the most effective. Parallelism is first and foremost a performance optimization, and therefore has its area of applicability. For a given problem, parallelism might well be the right tool for the job, but other performance optimizations might be better. Use the right tool for the job!

Q: What can best help serial programmers retool themselves into parallel programmers?   

McKenney: All tools and languages, parallel or not, are domain-specific. So look at the tools and languages used by parallel programmers in application domains that interest you the most. There are a lot of parallel open-source projects out there, and so there is no shortage of existing practice to learn from. If you are not interested in a specific application domain, focus on tools used by a vibrant parallel open-source project. For example, the project I participate in has been called out as having made great progress in parallelism by some prominent academic researchers.

But design is more important than specific tools or languages. If your design fully partitions the problem, then parallelization will be easy in pretty much any tool or language. In contrast, no tool or language will help a flawed design. Finally, if your problem is inherently unpartitionable, perhaps you should be looking at other performance optimization.

Back in the 1970s and 1980s, the Great Programming Crisis was at least as severe as the current Great Multicore Programming Crisis. The primary solutions were not high-minded tools or languages, but rather the lowly spreadsheet, word processor, presentation manager, and relational database. These four lowly tools transformed the computer from an esoteric curiosity into something that almost no one can live without. But the high-minded tools and languages of the time are long forgotten, and rightly so. The same thing will happen with parallelism: The average person doesn’t care about fancy parallel languages and synchronization techniques; they just want their smartphone’s battery to last longer.

Of course, I have my favorite languages, tools and techniques, mostly from the perspective of someone hacking kernels or the lower levels of parallel system utilities. I have had great fun with parallel programming over the past two decades, and I hope your readers have at least as much fun in the decades to come!

For more on this topic, check out Paul E. McKenney’s blog.

Photo Credit:@iStockphoto.com/michelangelus

Thought Leader James Reinders on Parallel Programming

The explosion of multicore processors means that parallel programming -- writing code that takes the best advantage of those multiple cores -- is required. Here, James Reinders, Intel’s evangelist [please note: Intel is the sponsor of this program] for parallel programming, talks about which applications benefit from parallelism and what tools are best suited for this process. His thoughts may surprise you.

Q: What new tools do you need as you move from serial to parallel programming to get the most out of the process?

Reinders: The biggest change is not the tools, it’s the mindset of the programmer. This is incredibly important. I actually think that human beings [naturally] think about things in parallel, but because early computers were not parallelized, we changed the way we thought. We trained ourselves to work [serially] with PCs, and now parallelism seems a little foreign to us. But people who think of programming as a parallel problem don’t have as much a problem with it as those who have trained themselves to think serially.

[As for the best tools], if you woke up one day and could think only in parallel you’d be frustrated with the tools we have now, but they are changing. From a very academic standpoint, you could say no [computer] languages are designed for parallelism so let’s toss them out and replace them, [but] that is not going to happen.

The languages we use will get augmented. Intel has done some popular things; Microsoft has some extensions to its toolset; Sun’s got stuff for Java and Apple’s got stuff too.

There are some very good things people can look for but they are still emerging, and programmers need to learn them. I can honestly say that as of the last year or so, trying to do parallel programming in FORTRAN or C or C++ is a pretty reasonable thing to do. Five years ago, it was something I couldn’t have done … without a lot of training and classes.  Now these [existing] tools support what we need enough to be successful.

Google has done amazing things in parallelism. Their [Google’s] whole approach in building the search engine was all about parallelism. If you asked most people back then to go examine every Web page on the planet, they’d have written a for loop process. But Google looked at this in parallel. They said, “Let’s just go look at all of them.” They thought of it as a parallel program. “I can’t emphasize how important that is.”

Q: How about debuggers and the debugging process? How does that change with parallel programming?

Reinders: Debuggers are also getting extended but they don’t seem to move very fast. They still feel a lot like they did 20 years ago.

There are three things happening in debuggers. First, as we add language extensions, it would be nice if the debuggers knew they existed. That’s an obvious fix that is now happening quickly.

Second, I suddenly have multiple things happening at once on a computer. How can you show that to me? The debugger will usually say, “Show me what’s happening on core 2,” but if you’re working with a lot of cores, the debugger needs to show you what’s happening on a lot of cores without requiring you to open a window for each. Today’s debuggers don’t handle this well, although a few [of the more expensive ones] do.

Third, this is very academic, but how do you deal with determinism? When you get a bug in a parallel program, it can be non-deterministic, meaning it can run differently each time you run [the program.] If you run a program and it does something dumb, in non-deterministic programming, just the fact that the debugger is running causes the program to run differently, and the bug many not happen the same way. So you need a way to go back to find where it happened, what the break point was.

In serial programming, typically if I run the program 20 times, it will fail the same way, but if the program runs differently every time, it’s not obvious how to fix that. The ability to rewind or go back when you’re in the debugger and look at something that happened earlier instead of rerunning the program … is very helpful. You tell the debugger to back up. To do that, the debugger has to collect more information while you’re running so you can rewind and deal with that non-determinism.

Q: If you’re a programmer, writing custom apps for your company or an ISV, when does it become essential to employ parallel programming? What apps reap the biggest advantages, and are there apps for which parallel programming has little benefit?

Reinders:  Any application that handles a lot of data and tries to process it benefits [from parallelism.] We love our data -- look at the hard drive of your home PC. Parallelism can bring benefits to obvious things that we see every day: processing pictures, video and audio. We love getting higher-res cameras, HDTV. We like higher resolution stereo sound. That’s everyday stuff.

Scientific applications are obvious beneficiaries, and business apps that do knowledge mining of sales data to reach conclusions. They all do well in parallel. That’s the obvious stuff. But then people can get very creative with new things. There are some things that you might think won’t benefit [from parallelism], like word processing software and browsers, but you’d be wrong.

Look at Microsoft Word. There are several things Microsoft does there in parallel that we all enjoy. When you hit print, it will go off and lay it out and send it to the printer, but it doesn’t freeze on you. If you go back 10 years, with Microsoft Word, you might as well have gone for coffee after hitting print.

Spelling and grammar checking, when you type in Word, it puts in the squiggles [on questionable spelling or usage]. It’s doing that in parallel. If it wasn’t, every time you type a letter, it would freeze while it looked it up. Word is WYSIWYG; if you’re in print mode, it’s justifying and kerning -- that’s doing a lot of things in parallel with many other things.

From Our Sponsor:

To learn more about Intel’s software technologies and tools, visit Intel.com/software.


Photo: @iStockphoto.com/loops7

Taming the Parallel Beast

Many programmers seem to think parallelism is hard. A quick Internet search will yield numerous blogs commenting on the difficulty of writing parallel programs (or parallelizing existing serial code). There do seem to be many challenges for novices. Here’s a representative list:

  • Finding the parallelism. This can be difficult because when we tune code for serial performance, we often use memory in ways that limit the available parallelism. Simple fixes for serial performance often complicate the original algorithm and hide the parallelism that is present.
  • Avoiding the bugs. Certainly, there is a class of bugs such as data races, deadlocks, and other synchronization problems that affect parallel programs, and which serial programs don’t have. And in some senses they are worse, because timing-sensitive bugs are often hard to reproduce -- especially in a debugger.
  • Tuning performance. Serial programmers have to worry about granularity, throughput, cache size, memory bandwidth, and memory locality. But for parallel programs, the programmer also has to consider the parallel overheads and unique problems, like false sharing of cache lines.
  • Ensuring future proofing. Serial programmers don’t worry whether the code they are writing will run well on next year’s processors -- it’s the job of the processor companies to maintain upward compatibility. But parallel programmers need to think about how their code will run on a wide range of machines, including machines with two, four, or even more processors. Software that is tuned for today’s quad-core processors may still be running unchanged on future 16-, 32- or even 64-core machines.
  • Using modern programming methods. Object-oriented programming makes it much less obvious where the program is spending its time.
  • Other reasons that parallel programming is considered hard include the complexity of the effort, insufficient help for developers unfamiliar with the techniques, and a lack of tools for dealing with parallel code. When adding parallelism to existing code, it can also be difficult to make all the changes needed to add parallelism all at once, and to ensure that there is enough testing to eliminate timing-sensitive bugs.

Use Serial Modeling to Evolve Serial Code to Parallel
The key to success in introducing parallelism is to rely on a well-proven programming method called serial modeling. Using serial modeling tools and technique, programmers can achieve parallelization with enhanced performance and without synchronization issues. The essence of the method involves consistently checking and resolving problems, and beginning early in the process to slowly evolve the code from pure serial, to serial but capable of being run in parallel, to truly parallel.

The first step is to measure where the application spends time -- effort spent in hot areas will be effective, while effort spent elsewhere is wasted. The next step is to use a serial modeling tool to evaluate opportunities for potential parallelization and determining what would happen if this code ran in parallel. This kind of tool observes the execution of the program, and uses the serial behavior to predict the performance and bugs that might occur if the program actually executed in parallel.

Checking for problems early in the evolution process, while a program is still serial, ensures that you don’t waste time on parallelization efforts that are doomed because of poor performance. You can then model parallelizations that resolve the performance issues or, if no alternatives are practical, focus your efforts on more profitable locations.

The tool can also model the correctness of the theoretical parallel program, and detect race conditions and other synchronization errors while still running the serial program. Although the program still runs serially, it is easy to debug and test, and it computes the same results. The programmer can change the program to resolve the potential races, and after each change, the program remains a serial program (with annotations) and can be tested and debugged using normal processes.

When the program has fully evolved, the result is a correct serial program with annotations describing a parallelization with known good performance and no synchronization issues. The final step in the process is to convert those annotations to parallel code. After conversion, the parallel program can undergo final tuning and debugging with the other tools. The beast has been tamed.


Photo: @iStockphoto.com/angelhell

Which Comes First: Parallel Languages or Patterns?

On the shuttle to the UPCRC (Universal Parallel Computation Research Center) Annual Summit meeting on the Microsoft campus in Redmond, Wash., I was listening in on a discussion about parallel programming patterns. Being a parallel programmer, I was interested in what people (and these were some of the experts in the field) had to say about parallel programming patterns, how they are evolving and how they will impact future parallel coders.

The discussion turned to whether patterns would affect programming languages directly or remain something that would be constructed from statements of the language. I think I’m in the former camp. Here’s why.

For those of us that were programming when Elvis was still alive, think back to writing with assembly language. For the most part, there were instructions for Load, Store, Add, Compare, Jump, plus some variations on these and other miscellaneous instructions. To implement a counting/indexing loop, you would use something like the following:

Initialize counter
LOOP: test end condition,
goto EXIT if done
Loop Body
increment counter
goto LOOP
EXIT: next statement

This is a programming pattern. Surprised? With the proper conditional testing and jumping (goto) instructions within the programming language, this pattern can be implemented in any imperative language. Since this pattern proved to be so useful and pervasive in the computations being written, programming language designers added syntax to “automate” the steps above. For example, the for-loop in C:

for (i = 0; i < N; ++i) {
Loop Body
}

Once we had threads and the supporting libraries to create and manage threads, parallel coding in shared memory was feasible, but at a pretty crude level since the programmer had to be sure the code handled everything explicitly. For example, dividing the loop iterations among threads can be done with each thread executing code that looks something like this:

start = (N/num_threads) * (myid)
end = (N/num_threads) * (myid + 1)
if (myid == LAST) end = N
for (i = start; i < end; ++i) {
Loop Body
}

Parallel programming patterns will be abstractions that can be “crudely” implemented in current languages and parallel libraries, like the pseudocode above. New languages (or language extensions) will make programming parallel patterns easier and less error prone. From the example above, OpenMP has the syntax to do this, but it only takes a single line added to the serial code:

#pragma omp for
for (i = 0; i < N; ++i) {
Loop Body
}

From the evidence above, I think future parallel programming languages or language extensions supporting parallelism will be influenced by the parallel programming patterns we define and use today. And nothing will remain static. During his UPCRC presentation, Design Patterns’ Ralph Johnson remarked that some of the original patterns saw early use, but this use has slacked off. Two reasons he noted for this was that some of the patterns couldn’t easily be implemented in Java and modern OO languages had better ways to accomplish the same tasks -- most likely these new languages found inspiration from the patterns and their usage.

For an answer to the question posed in the title, it boils down (no pun intended) to the old chicken-and-egg paradox. There were algorithms (patterns) to do computations before there were computers; prior to that, those algorithms were modifications of previous algorithms influenced by the tools available. Looking forward, though, we’re still in the relative stages of infancy for programming, let alone parallel programming. Clearly, the next generation of parallel programming languages or libraries or extensions bolted onto serial languages will be influenced by the patterns we use now for specifying parallel computations.