Multithreading and CPUs: needs to be fixed?

Traak · 2011-10-12, 10:53 PM

Theory: CPU's and their subsystems should decide on threading and thread distribution to the various cores of a CPU, not the software. Otherwise, the software almost has to be tailored for a certain number of CPU's, or it fails to properly utilize them all.

Something I have noticed over the years. CPU's went to dual cores, in lieu of only increasing the capability of the one core.

However, there is a flaw in this whole system that needs to be addressed.

This is the flaw: the software running on any given computer may or may not be designed so it runs on more than one CPU or processor core.

I quote from this page on HARDOCP:

"We used the brand new game F1 2011. For our run-through we played on the Shanghai International Circuit track with clear skies. We played at 1920x1200 with NO AA, 16X AF and in-game settings at 'Ultra' and DX11 enabled. This game does not support DX11 driver multithreading. Our framerates were high throughout our testing."

So, you could have the world's most wonderful CPU, consisting of one million cores, each with 1MB of L1 cache, 20MB of L2 cache, and 1 GB of L3 cache per core. Computing power so vast, that it could literally anticipate any move you could make in a game, have it pre-loaded, and execute it nanoseconds after you chose it. Framerates would be beyond what a cheetah could ever need, if cheetahs played games.

But, wait. It wouldn't work. Because most software is only capable of using a certain number of cores?

Here is how it needs to be.

Program A needs to be completely unthreaded. It just vomits data at the CPU, and receives data from the CPU in one stream. That is how the new SATA drives have FAR higher bandwidth than the old PATA drives. PATA had a certain number of channels, wires, conductors, down which data flowed. It was bottlenecked. SATA, however, just uses a few conductors, and sends data down it or them in serial fashion, similar to how data flows on the internet. You don't need a seventy-two conductor cable to send data to and from the internet.

So the CPU needs to receive the data packets from the software, and vice versa, in serial fashion. Then the CPU sorts out which packet goes to which core, or cores, and processes it, then sends the result out, in one stream, to the software. This makes it so any software, whether it is a game, a spreadsheet, archiving program, or anything else, does its job by allowing the CPU to parse the instructions into as many threads as it needs.

Limiting CPU utilization by having it controlled by the software is like allowing a retarded jockey decide how many legs a horse gets to use to run a race.

The jockey is the software. He inputs he wants to the horse. He does not tell the horse what foot to put forth first, how many legs to use, or in what order. The horse is the CPU. He receives inputs and produces performance, using all four legs at once, that carry the jockey where he wants to go.

The CPU and its subsystems need to parse the data stream into however many few or many threads and to however many cores it can most efficiently utilize, and remove control of that from the software.

What I am curious about is this: how long will it take for them to just set up the software so the CPU can do what it needs to, such as on "heavily threaded applications" that do take advantage of multiple CPU's/cores.

Why isn't everything "completely threaded"?

Mutant · 2011-10-13, 07:52 AM

Welcome to Amdahl's law.

Traak · 2011-10-13, 08:16 AM

Well, that was an interesting read.

It makes it clearer to me why some motherboards are faster, sometimes markedly faster, than others. Same CPU, same RAM, same everything, and the system is faster.

Since AMD still has South Bridge functions offloaded onto the motherboard, could this be what is bottlnecking their systems?

Intel has a few more functions directly on the same die as the CPU cores, so they don't have to travel through the relatively long pathway of the processor die, motherboard leads, processing on the South Bridge, back through the motherboard, and into the CPU again.

Well, in any case, AMD is behind, still, and needs another whole line of CPU's to surpass Intel. And I wonder what Intel has up their sleeve that they are not bothering to tell anyone, so they can continue selling their existing chips without having to invest in retooling for the next generation at the chip fabs yet.

As long as AMD is behind Intel, Intel can go back to making profits instead of spending them on advancing their CPU's at the same pace as we saw back when AMD was faster for a few generations. Faster and cheaper.

Intel may not have a monopoly, but the duopoly that exists is slanted heavily in favor of Intel. If AMD had spent more money on further chip designs, and not so much on CEO bonuses or whatever they spent it on back before the CORE series from Intel came out, they might still be in the lead.

Water under the South Bridge, I guess.

I wonder when or if Google is going to jump into the CPU game, at the same level as Intel and AMD.

Rbstr · 2011-10-13, 07:54 PM

Well, that was an interesting read.

It makes it clearer to me why some motherboards are faster, sometimes markedly faster, than others. Same CPU, same RAM, same everything, and the system is faster.

What? The article doesn't have anything to do with that.

Intel has up their sleeve that they are not bothering to tell anyone, so they can continue selling their existing chips without having to invest in retooling for the next generation at the chip fabs yet.

This is untrue, they do die-shrinks just about every other year. With a new "core" in the alternating.

I'm in a related field in lots of respects...Intel really isn't a rest-on-thier-laurels kind of company. They've done a fairly good job of not letting the engineers get taken over by accountants. Not to say they haven't screwed up (Netburst, for instance, and they have been sort of anti-competetive even while having "better" product in the past).

You do bring up a good point about moving functions to the CPU...it does help speed things up, for more reasons even...eventually clock rates get to the point they're outpacing the electrons over the distance of the leads. Memory will be the first thing.

I wonder when or if Google is going to jump into the CPU game, at the same level as Intel and AMD.

Google doesn't make a single piece of any of its hardware. The kinds of equipment and expertise you need for a chip-fab isn't like anything Google has.

Sirisian · 2011-10-14, 12:18 AM

Originally Posted by Traak

So, you could have the world's most wonderful CPU, consisting of one million cores, each with 1MB of L1 cache, 20MB of L2 cache, and 1 GB of L3 cache per core.

General rule of thumb. Doubling the size of the cache only minimizes the number of misses by 30% to 40%. There's a reason caches are only around 12 MB even in the best processors.

Originally Posted by Traak

Program A needs to be completely unthreaded. It just vomits data at the CPU, and receives data from the CPU in one stream.

You've introduced a dependency. One that requires a lock. Also you must understand that what you described is far slower than the current system. The current system uses a pipeline which effectively breaks a sequence of instructions into microcode to execute independently at the fastest rate possible for each stage.

// Edit: Forgot to mention one of the biggest things about parallel computation: communication cost. If you're doing 1 computation for each communication then communication is your bottleneck. Even at like 20+ instructions it's still your bottleneck. Threading and letting other cores do work is a trade-off when the computation you want to do is less than the communication cost of sending it there.

Originally Posted by Traak

Why isn't everything "completely threaded"?

Lazy programmers. Most software is threaded actually. There's a reason your UI isn't blocking for an action to complete. It might pass a task off like "load a web page" to a separate thread to parse the page and synchronize with the renderer. Multi-threading nowadays is fairly easy, but it wasn't always that way.

Originally Posted by Traak

Well, in any case, AMD is behind, still, and needs another whole line of CPU's to surpass Intel.

Not really. Not sure where you got that idea. The only thing Intel has is the new AVX instruction set which no one uses at all. It allows much faster computing for very specific tasks, but no compiler, except maybe the Intel one for C++, is designed utilize such features. It's normally up to the programmer to design native only code. They'll say stuff like "Most of my users have SSE2 and if I code in a special case for it then the code will run 200% faster (SIMD)".

If you're really curious about this you should take some parallel systems and architecture courses. They're surprisingly not very math heavy from my experience. (However, they tend to be higher level CS courses so they have programming assignments like simulation stuff). You would enjoy a CS degree. It covers all of these things in massive detail.

Traak · 2011-10-14, 06:22 AM

I deal more in paradigms than in details. And I take my best shot at understanding stuff, theorizing how, from what I can tell, it should be changed. And I learn as people who are more deeply involved and in more detail explain it better.

So, You guys are saying that even on the chip die, you can experience instructions getting "lost"?

And, that as you use larger and larger numbers of cores, the communication costs, and the overhead needed to assign bits of code here and there can become prohibitively large.

And, that there are existing practical limits to how much good cramming more cache on a chip die can produce.

That Intel, though they could, are not going to sit around hurting their arms patting themselves on the back, but will continue to produce new architectures and refreshes on their usual schedule.

So, you have a clean sheet of silicon. What do you do with it? How do you improve the CPU?

We have obviously gone from 8-bit to 16-bit, to 32-bit, to 64-bit CPU's. I'm guessing that future CPU's may have the following:

128-bit architecture, so each instruction can do more with a given clock cycle

More stuff on the die, so that clock speeds will not have to deal with the relatively long leads from the die through the motherboard and back again, or to wherever else any more than necessary.

I'm taking it this will lead to processors with more pins (or those pads in lieu of pins that Intel currently uses)

Could we be migrating to entire systems-on-a-chip that are inclusive of all functions that have enough speed and capability for gaming? Sound, graphics, etc? I wonder if this bit will happen, it would be a heck of a chip that could do that even at today's levels of performance for graphics, etc. I wonder if temperature gradients would be a factor, with parts of the chip being hot or cold forming thermal stresses. I doubt that, because they are speeding up and slowing down various cores, so the heat gradient is already in existence, and doesn't seem to be a problem.

What would you do? What do you think is coming?

Rbstr · 2011-10-14, 06:45 PM

Originally Posted by Traak

I
So, you have a clean sheet of silicon. What do you do with it? How do you improve the CPU?

I'm not an electrical engineer so I don't know exactly (as a guy in materials/chemical eng. I know how to make the transistors/chips, not how the transistors are actually arranged, if that makes sense)

I'd say, 128 bit doesn't seem likely in the near/mid term...64bit is more about the amount of memory addressing and the number of decimals you can keep track of natively...we won't be needing 128bits of memory addressing for a good long while, at least. (That is 64 bit works to 16 Exabytes..an exabyte is a million terabytes, 16 of this is a bit less than the ENTIRE INTERNET'S throughput in a month)

More stuff on the die, so that clock speeds will not have to deal with the relatively long leads from the die through the motherboard and back again, or to wherever else any more than necessary.

I'm taking it this will lead to processors with more pins (or those pads in lieu of pins that Intel currently uses)

Could we be migrating to entire systems-on-a-chip that are inclusive of all functions that have enough speed and capability for gaming? Sound, graphics, etc? I wonder if this bit will happen, it would be a heck of a chip that could do that even at today's levels of performance for graphics, etc. I wonder if temperature gradients would be a factor, with parts of the chip being hot or cold forming thermal stresses. I doubt that, because they are speeding up and slowing down various cores, so the heat gradient is already in existence, and doesn't seem to be a problem.

What would you do? What do you think is coming?

Heat gradients are problems in chips...but most of the time just having them attached to something is enough to spread the heat out. But they can't get too hot, as we well know. These things are made of layers of dissimilar material and they have to take care to make sure the expansion coefficients aren't too misaligned.
Entire systems on a chip? Possible (and done at lower performance levels)...not likely for a while.

Quantum computing and optical computing are both well under research, but I don't know much about either.

Really there are lots of reasons Moore's "law" can run out at some point. Feynman said "there's lots of room at the bottom" ...but we're down to what? 24nm processes? In something like silicon atoms are ~.1nm or so apart! Will a single molecule organic transistor pop out of the woodwork ("organic" meaning made of carbon, not necessarily something made by a living thing)?

I dono.

Traak · 2011-10-14, 10:15 PM

I remember in the 130nm era, people were thinking "Wow, how would we ever be able to get much smaller???"

Perhaps with one-atom-wide pathways, we will be at a point where electrons will finally have a corridor that is as narrow as it is going to get to be crammed down.

Who knows? We might hit some quantum tipping point below a certain size that makes it so that WOOO HOOOO electron flow or behavior suddenly has a massive reduction in resistance or time.

Meanwhile, Moore continues to be right, more or less.

And, if it becomes cost- or technology-prohibitive to raise clock speeds or reduce size of the traces on the CPU, we will be forced to go to a higher number of bits being processed simultaneously.

Which means instructions will either get more complex, or be enabled to carry a whole stack of instructions in parallel at once.

For example, instructions that used to be "this bit needs to go here, then have this decision or process done, then return over there" can be done in one big instruction, processed at the same time, instead of three in series.

Memory size is not the only reason for going to more bits is it? I thought being able to do more per clock cycle was the idea, or at least one of the ideas. PCI Express, which is a serial communication method, to my knowledge, still needs more lanes to communicate at higher data rates after a lane reaches its limit.

I figure that more bits is more lanes for data to travel down at once, so more can be done per clock cycle.

Graphene: future looks bright for microprocessors from Wikipedia

Integrated circuits
Graphene has the ideal properties to be an excellent component of integrated circuits. Graphene has a high carrier mobility, as well as low noise, allowing it to be used as the channel in a FET. The issue is that single sheets of graphene are hard to produce, and even harder to make on top of an appropriate substrate. Researchers are looking into methods of transferring single graphene sheets from their source of origin (mechanical exfoliation on SiO2 / Si or thermal graphitization of a SiC surface) onto a target substrate of interest.

In 2008, the smallest transistor so far, one atom thick, 10 atoms wide was made of graphene. IBM announced in December 2008 that they fabricated and characterized graphene transistors operating at GHz frequencies. In May 2009, an n-type transistor was announced meaning that both n and p-type transistors have now been created with graphene. A functional graphene integrated circuit was also demonstrated – a complementary inverter consisting of one p- and one n-type graphene transistor. However, this inverter also suffered from a very low voltage gain.

According to a January 2010 report, graphene was epitaxially grown on SiC in a quantity and with quality suitable for mass production of integrated circuits. At high temperatures, the Quantum Hall effect could be measured in these samples. See also the 2010 work by IBM in the transistor section above in which they made 'processors' of fast transistors on 2-inch (51 mm) graphene sheets.

In June 2011, IBM researchers announced that they had succeeded in creating the first graphene-based integrated circuit, a broadband radio mixer. The circuit handled frequencies up to 10 GHz, and its performance was unaffected by temperatures up to 127 degrees Celsius.

Sirisian · 2011-10-15, 02:15 AM

Originally Posted by Traak

So, You guys are saying that even on the chip die, you can experience instructions getting "lost"?

No. As far as I know when data is communicated in a processor it is never lost. Controllers determine and lock the pathways for communication so collisions are not possible. This topic is fairly complicated, so I'm not going to go into it.

Originally Posted by Traak

Could we be migrating to entire systems-on-a-chip that are inclusive of all functions that have enough speed and capability for gaming? Sound, graphics, etc?

Current modern CPUs have DX10 graphics built into them. So the CPU and GPU are integrated together. (aka Integrated graphics). For instance on a sandy bridge you can remove your GPU and play Starcraft 2 just fine because the CPU has a GPU in it and is decent. (However, dedicated is still much better).

Originally Posted by Traak

What would you do? What do you think is coming?

I want to buy a Knight's Fury card. Research them. They're basically like larrabee with changes.

Originally Posted by Traak

Perhaps with one-atom-wide pathways, we will be at a point where electrons will finally have a corridor that is as narrow as it is going to get to be crammed down.

7 atoms is the current minimum lab tested transistor. However, humans are notoriously bad at making predictions for the future. They overestimate or underestimate all the time. Popular science once stated a modern computer could fit inside a single room thinking that was a peak of engineering.

One thing though will be true. More cores. If you want to think of it like the human brain it helps kind of to get the idea of a complex networking system. Evolution normally over millions of years will get to the most optimal format. (Then again it's important to realize the size of our brain and complexity hasn't change for a long time. We have an identical brain to neanderthals for instance if you study Anthropology).

The human brain has a huge number of synapses. Each of the 10^11 (one hundred billion) neurons has on average 7,000 synaptic connections to other neurons. It has been estimated that the brain of a three-year-old child has about 10^15 synapses (1 quadrillion). This number declines with age, stabilizing by adulthood. Estimates vary for an adult, ranging from 10^14 to 5 x 10^14 synapses (100 to 500 trillion).

If you can imagine each neuron is a core in a computer attached to 7K other cores communicating. (That and as a child grow the brain discards and optimizes itself something a real computing system must do with a dynamic network mesh). That's if the optimal patterns are in the brain. Just an example.

3D processors are being made at the moment by IBM that stack processors wafers on top of one another. This is a good move toward more compact computing.

Traak · 2011-10-15, 02:20 AM

I'm imagining that the stacked 3D processors are going to require direct-contact liquid cooling flowing between the layers.

I wonder if we would see cube-shaped processor stacks marketed, that we would need to have coolers for that would touch all five exposed sides to dissipate the heat?

Sirisian · 2011-10-15, 02:42 AM

Originally Posted by Traak

I'm imagining that the stacked 3D processors are going to require direct-contact liquid cooling flowing between the layers.

Not enough room for a liquid. It's a patented glue for heat dissipation. Before the glue was made this wasn't really possible because of heat. This was a pretty recent thing.

Traak · 2011-10-15, 03:20 AM

Wow. This will bust things open on the performance side.

Intel Qube processor. 20-layer, $200. 50-layer, $5000.

AMD High-Rise processor. 20-layer, $100. 200-layer, $4,000.

Entire computers in the space of a loaf of bread that perform 1000X faster than currently available.

I could see the thing being in the middle of a huge, artistic, passive-cooled heat sink that had enough heat dissipation area that the fins would be cool to the touch.

Or one CPU cooler on each exposed side. Computers could look so much different.

Which brings me to a question. If Air-cooled heat sinks have massive fins, but they are not needed for liquid-cooled heat sinks, why not just immerse an air-cooled heat sink in liquid, and cool that? Wouldn't the heat dissipation be much greater? Wouldn't the temperature reduction be far lower?

Would a chip be kept cooler by spraying liquid coolant directly on it, or having a heat sink on it that has some fins/slots/holes/pins/whatever to increase its surface area exposed to the cooling liquid? The guys who use liquid nitrogen, they just let the cup of liquid nitrogen sit right on top of the processor. What if they bathed the whole computer in liquid nitrogen (the non-moving parts)? Wouldn't that enable even more extreme overclocking?

How about a cooling fin complex that was a foot cubic in size and shape? MASSIVE surface area to expose to cooling. Then liquid-cool it.

I know there is some point of diminishing returns, here. I know air-cooled airplane engines have massive fins up on the cylinder heads and near the tops of the cylinders, not so much at the base, because not so much cooling is needed at the bases of the cylinders.

Well, why not gigantic cooling things that have cooling fins that are a foot tall? That stick right outside the case? Then you could have four fans, one on each side, blasting air into the cooling fin/pin stack, a casing around the rest of the stack, and one huge fan on the end exhausting the air?

I wonder if we have really reached the limits of simple physics when it comes to cooling CPU's.

Even most liquid-cooled systems don't keep the coolant at ambient temperature, limiting their effectiveness.

How do we decide whether we have fins or just coolant blasting or splashing directly on the surface of what we are cooling?

Rbstr · 2011-10-15, 12:03 PM

Cooling systems can't keep the fluid at room temperature without inordinate amounts of effort.
You've gotta dissipate the heat through the radiator still, and because those are of limited size and do not transfer heat perfectly.

Lots of the things you say are just really impractical, yeah you put a bunch of fins in the water cooling block and it would help...but not much, and the pump will have to work harder and your cost goes up and you run into the potential for more surface fouling.

LN2 is hilariously impractical for every day computing use. It's not actually that expensive (about the same as milk!) but it's dangerous because it can displace oxygen, if it's in a sealed dewar it has to have proper venting so it doesn't over-pressure as it boils off, ect. You also run into the opposite problem of heat expansion, things shrink when cold.

And I, for one, don't want some massive beastly computer that sounds like a jet engine, has car radiators sticking out of it and weighs half a ton.

Read http://en.wikipedia.org/wiki/Heat_transfer

Traak · 2011-10-15, 10:47 PM

But, what if, by having a massive, and extremely effective, cooling solution, you could buy cheaper components and overclock them up to the same levels, or beyond, of the more expensive ones?

It isn't like you have to cart your computer around or anything.

Rbstr · 2011-10-16, 12:06 AM

Because I have a real job and don't need to worry about buying decent enough components to last a couple years without bothering with the hassle of overclocking?

I used to do it, back in my highschool days, that 2.8ghz P4 was cooking, that FX5600, I bought this gigantic cooler for it...but if I had just spent the cooler's worth on a better card instead I'd have creamed my oc-ing gains.

Computer performance is too transient to go to such lengths when you don't need to.