31 Jul, 2022

[Hiroshige Goto's Weekly Overseas News] Why is the AMD "Cayman" a multi -core+VLIW 4?

● Cayman architectural roads shown at TFE in October

　The tendency of GPU architecture evolution, which looks like AMD's next high -end GPU "Cayman", is clear.The multi -core of the GPU is further advanced, and the inside of the processor will be withdrawn from a traditional GPU architecture.Multi -core is to double the throughput of geometry processing.Transforming the internal architecture from the GPU type is to optimize for more general use.

　The direction of such evolution of Cayman is basically similar to NVIDIA.In that sense, it can be said that it is evolution according to the current GPU trend, but AMD is more conservative.Until now, the AMD GPU was a conservative design compared to NVIDIA, and traditional 3D graphics had high processing efficiency.However, AMD is now trying to step out to some extent while maintaining conservativeity.

　AMD partially revealed the Cayman architecture at the technical conference "AMD Technology Forum and EXHIBE (AMD TFE)" held in Taiwan in October.The outline revealed at TFE is as follows.Cayman is a single dai and an entertainer st'en GPU.In terms of performance, geometry pipe throughput is enhanced.In addition, the basic structure of the calculation processor is changed.Then, GPU computing performance such as double -precision operation performance is enhanced.

　From these hints, architectures can be guessed to some extent.First, AMD is expected to promote the GPU multi -core and replace it from the Radeon HD 5870 (Cypress) generation, which had an incomplete dual core structure, to a complete dual core.Below is a Radeon HD 6800 (BARTS) configuration with Cypress -saving costs, but it can be seen that the setup engine is not separated, although the SIMD (Single Instructure, Multiple Data) arrays are separated into two.

Radeon HD 6800の構成PDF版はこちら

　In addition, the "VLIW (Very Long INSTRUCTION Word) processor", which is the basic unit of the AMD GPU, is expected to be changed from the current 5-Way operation to 4-Way operation.This means that you will be further away from the traditional graphics core structure.Simple processors to increase efficiency, and facilitate instruction scheduling (driver software performs runtime).As a result of the reform of the VLIW structure, the double -precision floating point (FP) operating performance is improved from one -fifth to one -quarter of the single -precision operation.

　As a whole, the evolution of CAYMAN will increase the performance of general -purpose computing, enhancing the throughput of DirectX 11 generations of applications from conventional AMD GPUs.However, AMD does not say that Cayman's memory hierarchy architecture will make a major change.It involves complicated reasons, but if AMD changes the memory hierarchy, it will be after the 28nm process generation.

NorthernIslandsのダイサイズPDF版はこちら

●1クロック当たり1トライアングルの制約から脱却

　In the AMD GPU, the geometris leo secret for the operating frequency has not changed since the Radeon 9700 (R300) era.With one triangle (or peak) throughput per clock, even if the GPU evolved, the performance only improved as much as the operating frequency increased.This situation was the same for NVIDIA.However, the NVIDIA and AMD architecture in the DirectX 11th generation GPU were roughly divided into this part.

　The AMD also follows the structure of this one triangle/cycle on the DirectX 11th generation Radeon HD 5870.NVIDIA, on the other hand, greatly raised geometry loup from GeForce GTX 480 (GF100).

　As a result, NVIDIA has greatly changed the GPU structure.The conventional NVIDIA and AMD GPUs had a fixing functional unit for geometry processing and one unit of raster liza (in the case of AMD in addition to AMD).This is because the vertex stream can only be serialized on the graphics pipeline.

　However, in NVIDIA's GF100, many of the tesserator and geometry fixing functions were placed on "SM (Streaming Multiprocessor)", which is the minimum unit of the GPU processor configuration.SM physically 16 on the GPU, so it is 16 parallel.The GF100 consists of four such SMs and a larger cluster "GPC (Graphics Processing Cluster)".The GPC is equipped with a block of processing before and after rasterizers one by one.In order to split graphics processes into physically divided units, NVIDIA divides the drawing area itself and assigns it to the cluster.Although some inefficient efficiency occurs, if you look at it, you can increase the geometric leopput by multi -core.

GF100のStreaming MultiprocessorPDF版はこちら

GF100のGPCPDF版はこちら

　This parallel type new configuration can be described as a multi -core of GPU.According to GF100, as a result of multi -core, geometry leution has increased about eight times as much as NVIDIA's company in the architecture.In addition, rasterization can rasterize up to 8 pixels for one cycle, so that up to 4 triangles can be converted to up to 32 pixels for each cycle.NVIDIA's JEN-HSUN HUANG (Co-Founder, President and CEO) describes this architecture as "the first GPU architecture has scalable geometry processing."

【後藤弘茂のWeekly海外ニュース】AMD「Cayman」はなぜマルチコア+VLIW 4になるのか

●マルチコア化はDirectX 11時代のGPUの必然の流れ

　AMD took an extremely multi -core NVIDIA with a more quiet approach with Radeon HD 5800.Many of the functions around the geometry were fixed to one unit, and the rasterizer and the processor group were divided into two.In other words, it is a "half dual core" structure.Therefore, the geometric leopard is the same as before, the same for the latest Radeon HD 6800.However, this part is improved in the following Cayman.

Eric Demers氏

　"Radeon HD 6800 geometrislin this is the same as Radeon HD 5800 per clock, which is the same as Radeon HD 5800. The geometris loop is a little improved in the next Cayman," said AMD's AMD (GPG ChifeF Technology Officer (, AMD) was explained.

　Almost the only way to enhance the geometric leopput is multi -core.To be more accurate, the graphics processing is divided from the vertex stream of the geometry.

GPUの構造変革PDF版はこちら

　Thus, the conclusion is visible, so the prediction of the Cayman architecture is very easy.The logical consequences are separated from the current row of geometry fixed units and tesserators.Then, it is reconstructed from the rasterizer to the two completely separated core, including the unit below.That way, the peak of the geometris leou secret will be doubled.To some extent, it will be an approach similar to NVIDIA.

　Geometry performance is the key to DirectX 11.DirectX 11 with tessellation has increased the degree of weight of geometry processing.In the past, expressions that were expressed in pixel processing tricks are trying to realize them by actually creating complex shapes on the geometry processing side.The GPU of the DirectX 11 era seems to be a natural evolution.

●VLIWプロセッサを5-wayから4-wayに組み替える

　AMD changes the composition of the VLIW processor from Cayman.AMD is TFE, "The performance ratio of double -precision floating point operations is higher than Radeon HD 5800 in Cayman. Because the internal structure (of the processor) has been changed. Cypress ratio of double -precision operation to single precision operations.Was one -fifth. Cayman is one -quarter. Therefore, even if it is the same (processor number) as Cypress, the performance of double -precision operations will be roughly improved by telling it to 20%"(Demers).I explained.

　In the conventional AMD architecture, one VLIW processor had five single -precision floating points (FP) operators.Among them, the double -precision/single -precision ratio was one -fifth, because one of the four -cycle -slopes used four pieces.As described by Demers, one -quarter means that the single -precision unit has four structures.

　AMD's VLIW processor is a unit that calculates one top and pixels for graphics.The current five arithmetic units of the current VLIW processor are four single -precision FP configuration (MAD) and one super function unit (SFU).The SFU was an executable unit such as a transcendental function, and also served as a single -precision FP configuration unit.Each operational unit can execute individual instructions.The entire VLIW processor can execute 6 instructions (5 operation instructions+1 control flow instruction).

　The advantages and weaknesses of AMD GPU lies in this processor structure.The AMD GPU adopts a VLIW instruction with six instruction slots, extracts multiple orders that can be executed in parallel when compiling the order and put it in one VLIW language.On the processor side, the instruction of each slot of the VLIW instruction is executed in parallel in each operational unit.

Radeon HD 6800のVLIWPDF版はこちら

　On the other hand, NVIDIA, on the other hand, composes the GPU operator as a scalar processor, and does not parallize the command level in the same thread.Execute in a single ska.Therefore, NVIDIA has a smaller unit of the processor to be executed in SIMD (Single Instructure, Multiple Data), and the overhead of the order unit is larger.AMD has a relatively small control system overhead for execution units.In addition, in graphics processing, the same order is often executed for 3 to 4 elements (for example, RGBA pixels, etc.), it is easy to parallize the command level.In other words, graphics can increase the performance efficiency.

　Instead, if the AMD architecture cannot be scheduled well, the processor's efficiency will be reduced.In particular, if there are many serial scalar instructions, parallelization may be difficult and efficiency may be reduced.In addition, even if the processing of 3 to 4 elements is the main, the 5-Way execution unit VLIW will inevitably have a higher probability of becoming an idol.In other words, even if a large number of arithmetic units are physically arranged, there is a possibility that the efficiency is not high.

シェーダプログラムの変化とGPUの対応の違いPDF版はこちら

●コアの稼働率を上げるためにVLIWを狭くする

　The problem is that VLIW processors are too parallel.Therefore, AMD is expected to replace VLIW into 4-Way operators in Cayman.In the case of 4-Way, you can cover the parallel operations up to 4 elements that are frequently used in graphics.In addition, the use efficiency of the processor can be increased in the case of the scalar instruction stream.For example, if VLIW's effective IPC (Instruction Per Cycle) is an average of 3 instructions, 5-Way has a core operating rate of 60%, but 75%in 4-Way.

CaymanのVLIWPDF版はこちら

　So how do you implement the SFU function in 4-Way VLIW?The reasonable method is to add logic to the confrontation unit to make it possible to execute it in multiple configuration units.In the case of AMD's VLIW architecture, multiple arithmetic units can be linked by disassembling and entering the complex instructions so that the instruction slot is straddled.In this method, the arithmetic unit is slightly enlarged, but it seems that it can be covered by increasing efficiency.

　These changes in VLIW processors mean leaving a traditional GPU.Originally, the configuration of the current four super-functional unit (SFU) from the current four-way unit plus comes from the 4-Way Simd (Single Instructure, Multiple Data) unit plus SFU.。If you try to run the code that was running on the old GPU, it would be safer to keep the ratio of the arithmetic unit in the same way.For this reason, NVIDIA also kept the ratio of the Sports Counterfeit Unit and SFU at the same rate as the old GPU in GeForce 8800 (G80).

　However, NVIDIA is currently lowering the ratio of SFU to the configuration for the Sitrificial Unit.This is because the current order mix is more important to execute on the Sports Country unit.In particular, in general -purpose computing other than graphics, the ratio is presumed that the ratio is often inclined to confrontation.The same should be true for AMD, and it is no wonder that SFU is not a dedicated unit, but to realize it with a confrontation unit.

　In this way, you can see that the architecture that AMD is taken in Cayman is reasonable.In the case of CAYMAN architecture, the effective performance is higher than the conventional AMD architecture, even with the same number of processors.In addition, the performance of the double precision operation is 25%higher than before in the case of the same number of processors.

　Changes in the architecture from Cayman can be expected.The internal bus and memory hierarchy are left behind in Cayman.This would be this part if AMD puts his hand next.The improvement of the performance of geometry processing and general -purpose computing requires reform of that part.AMD once failed with a ring bus on the internal bus.NVIDIA rationally implemented a two -way bus in the GF100 by separating only the bus of the texture fetch.Again, AMD is likely to take an approach similar to NVIDIA.

Prev Next