Processors: The Inner Workings Revealed


69 views

The majority of computer users, engineers, and technicians think they know a great deal about PCs, when they really don’t even scratch the surface. Since day one as a computer engineer, I have yearned to understand every waking component inside of my PCs. I want to know the purpose of every capacitor, resistor, and chip. Does it sound as though I should have been an electrical engineer? Perhaps. The biggest of these mysteries to me, and the most important, was how a processor works. I have several times attempted to understand this and each time come out with a slightly greater knowledge of things. I believe that now, after three years of computer engineering and several certifications, and lots of work experience, I am ready to figure it out. I’ll be writing this article as I read in order to explain the entire processing process (excuse my terrifying pun) and explaining down to the lowest levels what everything does.

Data flows in, it flows out, it flows in, it flows out - nobody knows why it happens (sort-of quoting Lewis Black). I don’t know why they keep these things a secret, but I intend to blow their secret wide open. All that I’m assuming you know about processors is this:
- They are the part of the PC that handles transforming and calculating data.
- They take input from memory, and output to memory.
- They “process” things.
- “CPU” is a synonym for processor, standing for Central Processing Unit. (You know it now!)
- The “CPU” is not the same as the “tower”, or the whole PC. That is NOT a CPU (I hate when people make that mistake).

Not too hard of prerequisites, so if you didn’t know that, go read a simple article first. ^-^
The first thing I’m going to do is take a detour to explain the clock cycles, or speed of the processor. There is a chip with a crystal in it that resonates at a certain freqency when electricity is applied. This is essentially the “heart” of the PC. Every pulse of this crystal means that the bits and bytes flow around the PC, similar to the human blood stream (but rediculously faster). Each pulse of the crystal is known as a clock cycle, and each clock cycle is 1Hz. You may have a 2Ghz processor (a bit older by 2007 standards, unless dual/quad core), and wonder what that means about the crystal (often called the Time Crystal). In the world of computing, 1024 is the magic number. 1024 Bytes = 1 Megabyte, 1024 Megabytes = 1 Gigabyte. The same happens here with Hertz (Hz). 2Ghz = 2048 Megahertz = 2048000 Hertz. That is how fast the crystal pulses. The speeds we reach in the modern PC for this crystal are inconcievable. All of this is fundamental to the way the processor works.

Speedstep, a fancy name for the now-standard system of clocking down to conserve power when the processor isn’t heavily used, is based on several MSRs, or model-specific registers, called performance counters. These are read to determine whether the CPU could/should be clocked down, and hence able to be undervolted, saving power. On mobile processors like our example, the Pentium M, separate power lines feed each execution unit, allowing them to be shut off to save power when not in use for long periods of time (How many times could you possibly need to SSE Shuffle [see execution units later]? Lol).

Alright, so - Some amount of data is pulled from memory that makes up a program or part of a program that you’d like to run. The processor somehow pulls the first pieces of data, and the games begin. The first component that the data reaches is called an “instruction decoder”. “But why does it need to decode the instructions, aren’t they already x86 binary code?” - yes, but that’s not what the processor wants, suprisingly enough.

Back in the day, there were two fully-separate incompatible types of processors. CISC (complex instruction set) and RISC (reduced instruction set). These were fully different types of instructions that could only be processed by the appropriate type of architecture. The inside of today’s Pentium (and equivalent VIA and AMD, etc., models) and newer x86 processors take RISC instructions. Every programming language, including those that make up Windows, are all CISC. Obviously CISC cannot run on RISC. “WTF”, you may be saying to yourself right about now. The key to this is the aforementioned component, the instruction decoder. It essentially takes the x86 CISC instructions and converts them into micro-ops (micro-operations), or RISC instructions.

Using the Pentium M as a reference, there are several paths in the decoder that the instructions take, depending on the type of isntructions. There are ratios of CISC to RISC operation numbers, depending on the size of the instruction. In the Pentium M, for 1-4 output RISC operations, it goes to one decoder, there are two decoders for 1:1 ratio of CISC to RISC, and then a ROM-utilizing sequencer for more than 4 output operations. These paths are all separate, but feed into a decoded instruction queue, then to a RAT, or Register Allocation Table. Registers, for those who aren’t familiar, are very tiny data storage for the processor to work with handfuls of information. This specific system can do 3 instructions per clock cycle, ending with 6 RISC instructions. Either 4+1+1 from the first three decoders, or else using the sequencer to play a part. When using the sequencer, it can take many clock cycles to finish decoding. This specific (Pentium M) architecture fuses two 118-bit RISC operations into one 236-bit one for transport, and separates it for execution (saving time and power).

Registers… tiny little data cubicles that are just a few bits or bytes in size - WRONG! This is 2007 (at the time of writing, at least, lol if it’s like 2010, 2020… heh…), and we aren’t constrained to such rediculous numbers! This will blow the mind of an ASM programmer: In the old designs, and in code, there are 8 32-bit registers with cryptic names (which I won’t explain here). In the modern CPU, specifically the Pentium M again, there are 40 registers, yes, 40! They are all 80-bit, as well. Clearly we aren’t efficiently writing code, we are emulating, which means that everything could be running MUCH faster! Anyway, the RAT changes the name/contents of the old registers in the program and allocates them to the new ones. Since there are far more registers in this scenario than previously, a program can execute code on the same register of the original 8, because you could emulate it more than once, easily. This allows more efficient emulation, I suppose.

Code in a PC flows in a straight line… until you reach the processing internals. After being decoded, the instructions are sent to the ROB, or Reorder Buffer. This device shoots the instructions forth to be executed in any arbitrary order.
Why? There are branched instructions, which can be unconditional or conditional. Unconditional instructions always access an out-of-order (from here-out “OOO”) instruction stream, while conditional ones test a condition (hence the names). The “things” that the conditional instructions test are the flags, or small bits in the processor that can be set one way or another to affect processing as explained here.

The instructions are sent to the “RS” or Reservation Station (defusing the previously 236-bit instructions back to 118-bit), then to the execution units (which do the actual processing, we still don’t know how), and then back to the ROB, where they are put back into the original order. They then exit the ROB in order, called the retirement stage.

More on the RS - It takes in the instructions, defuses them, then sends them through one of several ports, which lead to different execution areas. Below is a list of execution units that the data can then be sent to, ripped directly from http://www.hardwaresecrets.com/:

IEU: Instruction Execution Unit is where regular instructions are executed. Also known as ALU (Arithmetic and Logic Unit). “Regular” instructions are also known as “integer” instructions.

FPU: Floating Point Unit is where complex math instructions are executed. In the past this unit was also known as “math co-processor”.

SIMD: Is where SIMD instructions are executed, i.e. MMX, SSE and SSE2.

WIRE: Miscellaneous functions.

JEU: Jump Execution Unit processes branches and is also known as Branch Unit.

Shuffle: This unit executes a kind of SSE instruction called “shuffle”.

PFADD: Executes a SSE instruction called PFADD (Packed FP Add) and also COMPARE, SUBTRACT, MIN/MAX and CONVERT instructions. This unit is pipelined, so it can start executing a new micro-op at each clock cycle even if it didn’t complete the execution of the previous micro-op. This unit has a latency of three clock cycles, i.e. it delays three clock cycles to deliver each processed instruction.

Reciprocal Estimates: Executes two SSE instructions, one called RCP (Reciprocal.Estimate) and another called RSQRT (Reciprocal Square Root Estimate).

Load: Unit to process instructions that ask a data to be read from the RAM memory.

Store Address: Unit to process instructions that ask a data to be written at the RAM memory. This unit is also known as AGU, Address Generator Unit. This kind of instruction uses both Store Address and Store Data units at the same time.

Store Data: Unit to process instructions that ask a data to be written at the RAM memory. This kind of instruction uses both Store Address and Store Data units at the same time.
Back in the day, processors just had one unit, the IEU, and then they added a “math coprocessor”, which is now the FPU. When there are many execution units the processor takes on the adjective “superscalar”.

Since there are so many areas to be executed, in this Pentium M scenario up to 12 instructions can be processed at the same time, even though only 5 at a time can be dispatched through the 5 ports (which connect to groups of the above execution environments). Intel intelligently paired execution units on the same ports, such as the IEU and FPU, so that when a slower unit like the FPU has an instruction to munch on, the IEU can keep cranking through input in the temporarily free port, thus keeping all ports busy. Up to 3 items can be removed from the ROB, which as said above is where these return once executed.

You may have heard geekier people than you (or be one of those people like myself) talk about “cache”, and that their processor has more “L2 cache” than someone else’s. What is cache? Detour!

There are two types of chip-based storage in the world: dynamic and static. Dynamic requires a constant influx of energy to keep the data stored, and this makes them cheaper. “..but my jump drive has chips and it keeps stuff on it when it isn’t plugged in!” - This is where static memory comes in. It requires no energy to keep its data, which makes it perfect for storage in the long-term, away from a PC.

(While reading about the internals of cache, I came across that at some point [probably the mid-90s], a 2gb hard drive was “cheap” when costing less than $200, scaling to 2007 the “cheap when less than $200″ size would be 750GB… Big leaps… big leaps… You may be reading this now going “Lol my 3 exobyte hard drive cost $50, nub!”, and I’m sitting in a room with a 10EB hard drive, laughing back… enough sidetrack!)

Dynamic and static RAM (Random Access Memory, the long name for memory…) are abbreviated DRAM and SRAM. SRAM, being more expensive, was unacceptable to put into the PC, so the cheaper DRAM was used. In order to keep up with the increasing processor speeds, small amounts of SRAM were added to the processor itself, allowing it to store small amounts of data (but much larger than the registers), increasing it’s processing efficiency. My (very high-end) system (note that it is 2007, future people. Don’t lol me.) has 4MB of L2 cache. Back when they were first putting cache in, it was more around 256KB. We have had to scale the amount of cache to keep the efficiency optimal. The faster the processor, and greater difference between RAM and processor speed, the more SRAM is needed to keep the processor from idling (and wasting your time). When the processor wants the same information again, it doesn’t have to pull it from memory, it can get it from it’s cache. This was a huge performance achievement back then.

Popularity: 3% [?]

Explore posts in the same categories: articles, hardware, technical

Comment: