Tuesday, November 30, 2010

Core and threads

Courtesy Jason Lane.

Intel does it, AMD does not:
Intel and AMD have done their best to differentiate the x86 architecture as much as possible while retaining compatibility between the two CPUs, but the differences between the two are growing. One key differentiator is hyperthreading;


To Windows threads are cores:
Multicore and HyperThreading (referred to as “HT”) are not the same, but you can be suckered into believing they are, because hyperthreading looks like a core to Windows. My computer is a Core i7-860, a quad-core design with two threads per core. To Windows 7, I have eight cores.


Core was designed, when processors hit a clock speed wall
Multicore CPUs were introduced as a solution to the fact that around a decade ago, processors hit a clock speed wall. The CPUs just were not getting any faster, and could not do so without extreme cooling. Unable to get to 4GHz, 5GHz, and beyond, AMD and Intel turned to dual core designs.


There’s two ways to look at computation: speed and parallelization. In some instances, such as tasks involving massive calculation, it makes sense to have an 8GHz core. The thing, is most business application uses don’t really need this. In fact, there are vendors arguing that the Xeon is overkill for many server tasks.

So the solution was multicore. AMD was the first consumer chip vendor to release a multicore chip in 2005 with the Athlon X2. Intel followed in 2006 with the Core 2 Duo. (The first non-embedded dual-core CPU was IBM’s Power 4 processor in 2001.) If a vendor’s CPU couldn’t improve performance with one 5GHz core, they could get there with two 2.5GHz cores.

Dual Cores are like 2 way lane on a high way:
To use a highway analogy, it’s the equivalent of going from a one-lane road to a two-lane road. Even if the two-lane road has a lower speed limit, more cars can travel to their destinations at any given time.


Dual-core are connected via interconnect rather than mobo
You may remember the days of symmetric multiprocessing (SMP), when computer systems had two physical CPUs on the motherboard. There’s no difference in execution between two single-core processors in an SMP configuration and a single dual-core CPU.


The difference, though, is that the dual-core CPU has much, much faster communication between the cores. That’s because they are on the same die and connected by a high-speed interconnections. In an SMP system, “communication” between the CPUs has to go out through the CPU socket, cross the motherboard, and go through the socket of the second CPU. So inter-CPU communication is considerably faster.


Intel first introduced HyperThreading with the Pentium IV processor in 2002 and that year’s model of Xeon processors. Intel dropped the technology when it introduced the Core architecture in 2004 and brought it back with the Nehalem generation in 2008. Intel remains firmly dedicated to HT and is even introducing it in its Atom processor, a tiny chip used in embedded systems and netbooks. As Tom’s Hardware found in tests, HyperThreading increased performance by 37%.


Why AMD doesn't believe in HT:
AMD has never embraced hyperthreading. In an interview with TechPulse 360, AMD’s director of business development Pat Patla and server product manager John Fruehe told the blog, “Real men use cores … HyperThreading requires the core logic to maintain two pipelines: its normal pipeline and its hyperthreaded pipeline. A management overhead that doesn’t give you a clear throughput.”


With the June 2009 release of the six-core “Istanbul” line of Opteron processors, AMD introduced something called “HT Assist,” a technology to map the contents of the L3 caches. It reserves a 1MB portion of each CPU’s L3 cache to act as a directory that tracks the contents of all of the other CPU caches in the system. AMD believes this will reduce latency because it creates a map of cache data, as opposed to having to search every single cache for data. It’s
not multithreading and shouldn’t be confused for it. It’s simply a means of making the server processor more efficient.


HyperThreading Deconstructed
HT is the technical process where two threads are executed on one processor core. To Windows, a core capable of executing two threads is seen as two processors, but it’s really not the same. A core is a physical unit you can see under a microscope. Threads are executed inside the core.

HT is like passing on the left on the highway. If a car ahead of you is going too slow, you pass it at your preferred speed. In HyperThreading, if a thread can’t finish immediately, it lets another run by it. But that’s a simplistic explanation.

Here’s how it works. The CPU has to constantly move data in and out of memory as it processes the code. The CPU’s cache attempts to alleviate this. Within each CPU core is the L1 cache, which usually is very small (32kb). Outside of the CPU core, right next to it on the chip, is the L2 cache. This is larger, usually between 256kb and 512kb. The next cache is the L3 cache, which is shared by all of the cores and is several megabytes in size. L3 caches were added with the advent of multicore computers. After all, it was easier and more efficient to keep data in the super fast memory of L3 cache than let it go out to memory.

The CPU executes one instruction at a time, according to its clock speed. Instructions take a various amount of cycles; some can be done in one cycle, others may require a dozen. It’s all based on the complexity of the task. Cycles are measured in nanoseconds.

Every CPU core has what’s called a pipeline. Think of pipelines as the stages in an assembly line, except here the process is the assembly of an application task. At some point, the pipeline may stall. It has to wait for data, or for another hardware component in the computer, whatever. We’re not talking about a hung application; this is a delay of a few milliseconds while data is fetched from RAM. Still, other threads have to wait in a non-hyperthreaded pipeline, so it looks like:

thread1— thread1— (delay)— thread1—- thread2— (delay)— thread2— thread3— thread3— thread3—

With hyperthreading, when the core’s execution pipeline stalls, the core begins to execute another program that’s waiting to run. Mind you, the first thread is not stopped. If it gets the data it wants, it resumes execution as well.

thread1— thread1— thread2— thread2— thread1— thread2— thread1— thread2— thread2—

The computer is not slowed by this; it simply doesn’t wait for one thread to complete before it starts executing a new one.

HT in Practice

There’s two ways HT comes into play. One is execution. A multithreaded computer boots much faster, since multiple libraries, services, and applications are loaded as fast as they can be read off the hard drive. You can start up several applications faster with a HT-equipped computer as well. That’s primarily done by Windows, which manages threads on its own. Windows 7 and Windows Server 2008 R2 were both written to better manage application execution, and both
operating systems can see up to 256 cores/threads, more than we will see for a long time.

Of course, the same applies to cores. A quad-core processor is inherently faster than a dual core processor with HT, since full cores can do more than HT.

However, benchmarks have found that for system loads, you don’t gain much after four cores (or two cores with two threads each). More cores and threads cannot compensate for other bottlenecks in your computer, like the hard drive.

Then there’s application hyperthreading, wherein an application is written to perform tasks in parallel. That requires programming skill; in addition, the latest compilers search for code that can be parallelized. Parallel processing has been around for years, but up until the last few years it remained an extremely esoteric process done by very few people (who commanded a pretty penny). The advent of multicore processors and the push by Intel and AMD to support multithreading is bringing parallel processing to the masses.

Intel has never, ever claimed that HT will double performance, because applications have to be written to take advantage of HT to make the most of it. Even there, HT is not a linear doubling. Intel puts the performance gain of HT at between 20% and 40% under ideal circumstances.

Threads can’t jump cores. Because threads are hardwired to the core, chip vendors can’t put too many threads per core or they slow the core down. That’s why Intel has just two per core.

Multithreading does not add genuine parallelism to the processing structure because it’s not executing two threads at once. Basically it’s letting whichever thread is ready to go run first. Under certain loads, it can make the processing pipeline more efficient and to push multiple threads of execution through the processor a little faster.

So while multithreading is good for system level processing — loading applications and code, executing code, etc. — application-level multithreading is another matter. Multithreading is useless for single-threaded applications and can even degrade performance in certain circumstances.

AMD takes great delight in pointing this out. On a blog entry, “It’s all about cores,” the company points out examples from software vendors against HT.

A consultant who deals with Cognos, business intelligence software owned by IBM, recommends disabling HyperThreading because it “frequently degrades performance and proves unstable.”
Microsoft recommends turning off HyperThreading when running PeopleSoft applications because “our lab testing has shown little or no improvement.”

A Microsoft TechNet article recommends disabling HT for production Microsoft Exchange servers and says it should “only [be] enabled if absolutely necessary as a temporary measure to increase CPU capacity until additional hardware can be obtained.”
Why? Because these applications have not been optimized for multithreading, for starters. And, since two threads are sharing the same circuitry in the processor, there is the odd chance for cache overwrites and data collision. Even with Windows Server 2008’s multithreading management, it can’t fully control what the processor does.

It should be noted that these are exceptions and not the rule. Hypervisors, the software layer that manages a virtualized server, are HT-aware and make full

use of the threads. HT provides the virtual machines with more execution scenarios than if HT was disabled, because the CPUs might otherwise be viewed as busy.

Applications that need lots of I/O – network and disk I/O in particular – can benefit from splitting operations into multiple threads in a HT system. By splitting tasks like disk and network I/O into multiple threads, you might see some performance gains.

The bottom line is that when considering hyperthreading systems, you need to check with your software vendors to learn what they recommend. If a number of your applications are better off with HT disabled, that should play into your decision-making process.

A Hidden Cost

The threads vs. cores argument has one more element to consider. Enterprise software vendors have two pricing policies that could potentially impact you: by the core and by the processor. Both Intel and AMD have encouraged the industry to support pricing on a per-processor (or per-socket) basis.

Some do. Microsoft has a stated policy that it charges by the processor, not by the cores. VMware’s software license is on a per-core basis. Oracle licenses its software both ways. Its Standard Edition software is on a per-socket basis, while its Enterprise Editions are on a per-core basis. Fortunately, they actually charge by the cores, and don’t view HT as extra cores.

Because of this variance, it’s incumbent on every company making a purchase decision to ask the software providers if their pricing scheme is per-core or per-socket. A typical blade server from Dell has two sockets on it, and some have four, but CPUs are moving to four, six, eight, and 12 cores.

If you purchase an AMD blade server with two Opteron 6100 processors, it’s the difference between two processors or 24 cores. Or you may want either an Intel Xeon 5600 with six cores, or an older AMD Opteron 6000, which also had six cores but no HT. It certainly adds a nice layer of confusion, doesn’t it?

No comments: