日B视频 亚洲,啪啪啪网站一区二区,91色情精品久久,日日噜狠狠色综合久,超碰人妻少妇97在线,999青青视频,亚洲一区二卡,让本一区二区视频,日韩网站推荐

0
  • 聊天消息
  • 系統(tǒng)消息
  • 評(píng)論與回復(fù)
登錄后你可以
  • 下載海量資料
  • 學(xué)習(xí)在線課程
  • 觀看技術(shù)視頻
  • 寫(xiě)文章/發(fā)帖/加入社區(qū)
會(huì)員中心
創(chuàng)作中心

完善資料讓更多小伙伴認(rèn)識(shí)你,還能領(lǐng)取20積分哦,立即完善>

3天內(nèi)不再提示

著名的 Box86/Box64 模擬器現(xiàn)在有了更好的 RISC-V RVV 1.0 支持,性能提升顯著

RISCV國(guó)際人才培養(yǎng)認(rèn)證中心 ? 2024-10-15 08:08 ? 次閱讀
加入交流群
微信小助手二維碼

掃碼添加小助手

加入工程師交流群

部分機(jī)器翻譯。轉(zhuǎn)載自:https://box86.org/2024/10/optimizing-the-risc-v-backend/

915776dc-8a89-11ef-b5cd-92fbcf53809c.png


大家好!一個(gè)半月前,我們撰寫(xiě)了關(guān)于 RISC-V DynaRec(動(dòng)態(tài)重編譯器,即 Box64 的 JIT 后端)的最新?tīng)顟B(tài)的文章,并分享了在 RISC-V 上運(yùn)行《巫師 3》的令人欣慰的進(jìn)展。如果您還沒(méi)有看過(guò)那篇文章,千萬(wàn)不要錯(cuò)過(guò)!無(wú)論如何,上個(gè)月,我們并沒(méi)有只是坐在那里無(wú)所事事,而是專(zhuān)注于性能改進(jìn),現(xiàn)在我們有一些東西可以分享了。

Are We SIMD Yet?

多年來(lái),x86 指令集慢慢擴(kuò)展了大量 SIMD 指令,分散在多個(gè) SIMD 擴(kuò)展中,從最初的 MMX 到 SSE、SSE2、SSE3、SSSE3、SSE4,再到 AVX、AVX-2、AVX-512 以及即將推出的 AVX10。您可能已經(jīng)猜到,這些指令一定有廣泛的應(yīng)用,值得對(duì)編碼空間產(chǎn)生如此大的影響。

事實(shí)上,由于現(xiàn)代編譯器的存在,如今幾乎所有 x86 程序或多或少都會(huì)使用 SIMD 指令。特別是,一些性能敏感且并行友好的程序?qū)⒃跓岽a路徑中使用手寫(xiě) SIMD 內(nèi)核,以大幅提高性能。因此,box64 需要有效地翻譯這些指令。

幸運(yùn)的是,x86 并不是唯一一個(gè)擁有 SIMD 或矢量擴(kuò)展的指令集。它是如此重要,以至于幾乎所有的指令集都有它。例如,AArch64 有 Neon、SVE和 SVE2,LoongArch 有 LSX 和 LASX,RISC-V 有 Vector 擴(kuò)展(或 RVV)。本質(zhì)上,這些擴(kuò)展的目標(biāo)是相同的,即加速并行友好代碼的執(zhí)行。因此,即使它們有這樣或那樣的差異,它們通常是相似的,許多基本指令是完全相同的,因此可以通過(guò) box64 等模擬器一對(duì)一地翻譯。

那么 box64 對(duì)這些 x86 SIMD 指令的支持程度如何?嗯,這是一個(gè)復(fù)雜的問(wèn)題。例如,目前最完整的 AArch64 DynaRec 支持從 MMX 到 AVX-2 的幾乎所有指令。簡(jiǎn)單來(lái)說(shuō),這些指令將被翻譯成一個(gè)或多個(gè) Neon 指令來(lái)完成等效的工作。同時(shí),最不完整的LoongArch64 DynaRec目前僅支持一小部分MMX和SSE*指令,未實(shí)現(xiàn)的操作碼將回退到解釋器,速度非常慢。

So, what about our beloved RISC-V? Are we SIMD yet?

嗯,一個(gè)半月前,答案是否定的。RISC-V DynaRec 確實(shí)實(shí)現(xiàn)了從 MMX 到SSE4的大多數(shù)指令,但這些指令是用標(biāo)量指令模擬的。

For example, for the SSE2paddqopcode, what this instruction does is:

917001e8-8a89-11ef-b5cd-92fbcf53809c.pngSo how is it emulated on RISC-V? Let’s take a look at it via the dump functionality of Box64:

9180e2d8-8a89-11ef-b5cd-92fbcf53809c.png

You can see that the translation is implemented by twoLOAD LOAD ADD STOREsequences, totaling 8 instructions. This is probably the easiest opcode to simulate so it will be even worse for other more complex opcodes.

So how is this implemented on AArch64?

9191bbd0-8a89-11ef-b5cd-92fbcf53809c.png


Ah ha, this opcode is translated one-to-one to theVADDinstruction! No surprises at all.

可以想象,在 RISC-V 上,這種方法確實(shí)會(huì)比簡(jiǎn)單地回退到解釋器有更好的性能,但與手頭有 Neon 指令的 AArch64 相比,它就差遠(yuǎn)了。

RISC-V 指令集以多樣性而聞名(如果你討厭 RISC-V,你也可以說(shuō)是碎片化)。這意味著除了基本指令集之外,供應(yīng)商還有充分的自由來(lái)實(shí)現(xiàn)或不實(shí)現(xiàn)官方擴(kuò)展,以及添加或不添加自定義擴(kuò)展。

你看,在 AArch64 中,Neon 擴(kuò)展是強(qiáng)制性的,因此 box64 可以隨意使用它,無(wú)需擔(dān)心它的存在。但RVV卻大不相同。例如,JH7110(VisionFive 2、MilkV Mars 等)沒(méi)有任何矢量擴(kuò)展,而 SpacemiT K1/M1(Banana Pi F3、MilkV Jupiter 等)支持矢量寄存器寬度為 256 位的 RVV 1.0,SG2042(MilkV Pioneer)支持舊 RVV 版本 0.7.1(或 XTheadVector),寄存器寬度為 128 位。

In fact, the one on SG2042 is not strictly 0.7.1, but based on 0.7.1, that why it is called XTheadVector. Although it has some differences with RVV 1.0, such as the encoding of instructions, the behavior of some instructions, and the lack of a small number of instructions, it is generally very similar.

Anyway, on RISC-V we cannot assume that RVV (or XTheadVector) is always present, so using a scalar implementation as a fallback is reasonable and necessary.

But for a long time, the RISC-V DynaRec only had a scalar solution, which was a huge waste of hardware performance for hardware with RVV (or XTheadVector) support, until recently.Yes, in the past month, we added preliminary RVV and XTheadVector support to the RISC-V backend!Also, we managed to share most of the code between RVV 1.0 and XTheadVector, so no additional maintenance burden for supporting 2 extensions at once.

Ohhhh, I can’t wait, let me show you what thatpaddqlooks like now!

91a3cd70-8a89-11ef-b5cd-92fbcf53809c.png

Hmmm, okay, it looks much nicer. But, you may ask, what the heck is thatVSETIVLI? Well… that’s a long story.

In “traditional” SIMD extensions, the width of the selected elements is encoded directly into the instruction itself, e.g. in x86 there is not onlypaddqfor 64bit addition, but alsopaddb,paddwandpadddfor 8bit, 16bit and 32bit respectively.

In RVV, on the other hand, there is only 1 vector-wise addition instruction, which isvadd.vv. The selected element width (SEW) is instead stored in a control register calledvtype, and you need to use the dedicatedvsetivliinstruction to set the value ofvtype. Every time a vector instruction is executed, thevtyperegister must be in a valid state.

In the abovevsetivliinstruction, we essentially set the SEW to 64bit along with other configurations. However, inserting avsetivlibefore every SIMD opcode doesn’t sound like a good idea. Ifvtypedoes not need to change between adjacent opcodes, we can safely eliminate them. And that’s how we did it in Box64.

Look at these dumps:

91bf34a2-8a89-11ef-b5cd-92fbcf53809c.png

You can see that among the 5 SSE opcodes, as the actual SEW has not changed, we only callvsetivlionce at the top. We achieved this by adding a SEW tracking mechanism to the DynaRec and only insertingvsetvliwhen it’s necessary. This tracking not only includes the linear part but also considers control flow. A lot of state caching in box64 is done using a similar mechanism, so nothing new here.

For now, we haven’t implemented every x86 SIMD instruction in RVV / XTheadVector, but we implemented enough of them to do a proper benchmark. By tradition, we use the dav1d AV1 decoding benchmark as a reference, which happens to use SSE instructions a LOT, and here is the command we used:

dav1d -i ./Chimera-AV1-8bit-480x270-552kbps.ivf --muxer null --threads 8

We did the test on the MilkV Pioneer, which has the XTheadVector extension.

We also tested RVV 1.0 with the SpacemiT K1, the result is more or less the same.

91d24aa6-8a89-11ef-b5cd-92fbcf53809c.png

Compared to the scalar version, we get a nearly 4x performance improvement! Even faster than native! Ugh… well, the faster-than-native part is more symbolic. The comparison is meaningful only if native dav1d fully supports XTheadVector, which the native dav1d does not support at all.

Last But Not Least

In thelast post, we complained about RISC-V not having bit range insert and extract instructions, and therefore not being able to efficiently implement things like 8bit and 16bit x86 opcodes.camel-cdrcame up with a great solution:https://news.ycombinator.com/item?id=41364904. Basically, for anADD AH, BL, you can implement it using the following 4 instructions!

91e7e460-8a89-11ef-b5cd-92fbcf53809c.png

The core idea is to shift the actual addition to the high part to eliminate the effect of the carry, which is a pretty neat trick. And it can be applied to almost all of the 8-bit and 16-bit opcodes when there is noeflagscomputation required, which covers most scenarios. We have adopted this approach as a fast path to box64. Thank you very muchcamel-cdr!

This method requires an instruction from the Zbb extension calledRORI(Rotate Right Immediate). Fortunately, at least all the RISC-V hardware I own provides this extension, so it’s commonly available. (Well, SG2042 does not have Zbb, but it has an equivalent instructionTH.SRRIin the XTheadBb extension).

We also found that in the XTheadBb extension, there is aTH.EXTUinstruction, which did the bit extract operation. We’ve adapted this instruction to some places too, for example, the indirect jump table lookup — when box64 DynaRec needs to jump out of the current dynablock to the next dynablock, it needs to find where the next is.

In short, there are two cases. The first is a direct jump, that is, the jump address is known at compile time. In this case, box64 can directly query the jump table at compile time to obtain the jump address and place it in the built-in data area of dynablock, which can be used directly when jumping at runtime, no problem there.

The second is an indirect jump, that is, the jump address is stored in a register or memory and is unknown at compile time. In this case, box64 has no choice but to generate code that queries the jump table at runtime.

The lookup table is a data structure similar to page table, and the code for the lookup is as follows:

92014fcc-8a89-11ef-b5cd-92fbcf53809c.png

Hmmm, I know, it’s hard to see what’s happening there, but it seems like a lot of instructions there for a jump, right? But withTH.ADDSLandTH.EXTUfrom XTheadBb, it becomes:

921d46aa-8a89-11ef-b5cd-92fbcf53809c.png

Now it’s much more readable; you should be able to see that this is a 4-layer lookup table, and the number of instructions required has also been reduced a bit.

Okay, all these optimizations look good, but does it show effects in real-world benchmarks? Well, we tested 7z b, dav1d and coremark, and there are no visible differences in the scores with or without XTheadBb. But, a quote from theSQLite website:

A microoptimization is a change to the code that results in a very small performance increase. Typical micro-optimizations reduce the number of CPU cycles by 0.1% or 0.05% or even less. Such improvements are impossible to measure with real-world timings. But hundreds or thousands of microoptimizations add up, resulting in measurable real-world performance gains.

So, let’s do our best and hope for the best!

In the End

Well, this is the end of this post, but it is not the end of optimizing the RISC-V DynaRec, we’ve barely started!

Next, we’ll add more SSE opcodes to the vector units, as well as MMX opcodes and AVX opcodes, and we will make the RISC-V DynaRec as shiny as the AArch64 one!

So, a bit more work, and we can have a look again at gaming, with, hopefully, playable framerates and more games running so stay tuned!

聲明:本文內(nèi)容及配圖由入駐作者撰寫(xiě)或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人,不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用,如有內(nèi)容侵權(quán)或者其他違規(guī)問(wèn)題,請(qǐng)聯(lián)系本站處理。 舉報(bào)投訴
  • 編譯器
    +關(guān)注

    關(guān)注

    1

    文章

    1673

    瀏覽量

    51961
  • 模擬器
    +關(guān)注

    關(guān)注

    2

    文章

    1026

    瀏覽量

    45868
  • RISC-V
    +關(guān)注

    關(guān)注

    49

    文章

    2954

    瀏覽量

    53603
收藏 人收藏
加入交流群
微信小助手二維碼

掃碼添加小助手

加入工程師交流群

    評(píng)論

    相關(guān)推薦
    熱點(diǎn)推薦

    RV生態(tài)又一里程碑:英偉達(dá)官宣CUDA將兼容RISC-V架構(gòu)!

    時(shí)間里,RISC-V生態(tài)也在不斷壯大,RISC-V?CPU產(chǎn)品性能持續(xù)提高,開(kāi)始往高性能的服務(wù)CPU發(fā)展,形成向Arm服務(wù)
    的頭像 發(fā)表于 07-19 00:04 ?6857次閱讀
    RV生態(tài)又一里程碑:英偉達(dá)官宣CUDA將兼容<b class='flag-5'>RISC-V</b>架構(gòu)!

    RISC-V支持 Nx嗎?

    RISC-V architecture! · nrwl/nx · Discussion #27915 如果能夠支持 RISC-V,那就太好了。目前我無(wú)法在我的 VisionFive2 板上
    發(fā)表于 02-04 06:27

    RISC-V,正式崛起

    編譯自financialcontent全球半導(dǎo)體行業(yè)迎來(lái)里程碑式的變革:開(kāi)源指令集架構(gòu)(ISA)RISC-V已于2026年1月正式占據(jù)全球處理市場(chǎng)25%的份額。這一里程碑標(biāo)志著x86和Arm長(zhǎng)期雙
    的頭像 發(fā)表于 01-16 15:17 ?910次閱讀
    <b class='flag-5'>RISC-V</b>,正式崛起

    探索RISC-V在機(jī)器人領(lǐng)域的潛力

    的硬件配置給人留下了深刻的第一印象: ? 核心處理: 搭載進(jìn)迭時(shí)空的K1系列高性能RISC-V處理,具備強(qiáng)大的通用計(jì)算能力和AI加速
    發(fā)表于 12-03 14:40

    為什么RISC-V是嵌入式應(yīng)用的最佳選擇

    最近RISC-V基金會(huì)在社交媒體上發(fā)文,文章說(shuō)物聯(lián)網(wǎng)和嵌入式系統(tǒng)正在迅速發(fā)展,需要更高的計(jì)算性能、更低的功耗和人工智能。RISC-V是為未來(lái)而建的,包括超高效的MCU到高性能應(yīng)用處理
    的頭像 發(fā)表于 11-07 10:09 ?1866次閱讀

    提高RISC-V在Drystone測(cè)試中得分的方法

    Drystone 是一種常用的計(jì)算機(jī)性能基準(zhǔn)測(cè)試,主要用來(lái)測(cè)量整數(shù)(非浮點(diǎn))計(jì)算性能。 影響 RISC-V 在 Drystone 測(cè)試中得分的因素主要有以下幾個(gè): 處理核心設(shè)計(jì):
    發(fā)表于 10-21 13:58

    RISC-V B擴(kuò)展介紹及實(shí)現(xiàn)

    ,可以被任何支持RISC-V ISR的處理解釋執(zhí)行。 需要注意,B擴(kuò)展是與基本RV32I/RV64I RISC-V指令集完全兼容的。因此,
    發(fā)表于 10-21 13:01

    PIC64GX1000 RISC-V MPU:一款面向嵌入式計(jì)算的高性能64位多核處理

    Microchip Technology PIC64GX1000 64RISC-V四核微處理 (MPU) 支持Linux^?^ 操作系統(tǒng)
    的頭像 發(fā)表于 09-30 14:47 ?1246次閱讀
    PIC<b class='flag-5'>64</b>GX1000 <b class='flag-5'>RISC-V</b> MPU:一款面向嵌入式計(jì)算的高<b class='flag-5'>性能</b><b class='flag-5'>64</b>位多核處理<b class='flag-5'>器</b>

    全球首款RiSC-V企業(yè)級(jí)模擬平臺(tái),躍昉科技LeapEMU正式亮相

    9月19日,廣東躍昉科技在珠海舉辦的“RISC-V軟件生態(tài)研討會(huì)上”,公司正式發(fā)布全球首款支持超128核RiSC-V RVA23企業(yè)級(jí)模擬
    的頭像 發(fā)表于 09-25 00:32 ?4361次閱讀
    全球首款<b class='flag-5'>RiSC-V</b>企業(yè)級(jí)<b class='flag-5'>模擬</b>平臺(tái),躍昉科技LeapEMU正式亮相

    【Milk-V Duo S 開(kāi)發(fā)板免費(fèi)體驗(yàn)】RISC-V核心NCNN基準(zhǔn)測(cè)試

    ncnn是騰訊優(yōu)圖實(shí)驗(yàn)室推出的一個(gè)為移動(dòng)端極致優(yōu)化的高性能神經(jīng)網(wǎng)絡(luò)前向計(jì)算框架,是目前同樣也比較罕見(jiàn)的為 RISC-V 架構(gòu)做過(guò)適配和優(yōu)化的神經(jīng)網(wǎng)絡(luò)框架。 本文是一份教程,步驟 (step
    發(fā)表于 08-24 23:46

    【Milk-V Duo S 開(kāi)發(fā)板免費(fèi)體驗(yàn)】RISC-V性能測(cè)試

    /coremark\' 最后編譯成可執(zhí)行程序: file coremark.exe coremark.exe: ELF 64-bit LSB executable, UCB RISC-V, RVC
    發(fā)表于 08-18 09:40

    知合計(jì)算:RISC-V架構(gòu)創(chuàng)新,阿基米德系列劍指高性能計(jì)算

    在2025 RISC-V中國(guó)峰會(huì)上,知合計(jì)算處理設(shè)計(jì)總監(jiān)劉暢就高性能RISC-V處理架構(gòu)探索與實(shí)踐進(jìn)行了精彩分享。 在以X
    的頭像 發(fā)表于 07-18 14:17 ?3029次閱讀
    知合計(jì)算:<b class='flag-5'>RISC-V</b>架構(gòu)創(chuàng)新,阿基米德系列劍指高<b class='flag-5'>性能</b>計(jì)算

    阿里巴巴達(dá)摩院劉志偉:QEMU RISC-V 的進(jìn)展、特性與未來(lái)規(guī)劃

    ,分享 QEMU 去年的合入情況、最新特性以及未來(lái)的規(guī)劃與展望。 ? QEMU 是通用開(kāi)源機(jī)器模擬器和虛擬化,可跨平臺(tái)運(yùn)行操作系統(tǒng)和程序,支持多種架構(gòu)虛擬化。目前,QEMU 已包含
    發(fā)表于 07-18 11:20 ?5882次閱讀
    阿里巴巴達(dá)摩院劉志偉:QEMU <b class='flag-5'>RISC-V</b> 的進(jìn)展、特性與未來(lái)規(guī)劃

    RISC-V 發(fā)展態(tài)勢(shì)與紅帽系統(tǒng)適配進(jìn)展

    ,硬件情況有改善,紅帽也成為最早支持 RISC-V 系統(tǒng)的企業(yè)之一,當(dāng)時(shí)已能搭建支持桌面的系統(tǒng),這在當(dāng)時(shí)是很大的突
    發(fā)表于 07-18 10:55 ?4245次閱讀
    <b class='flag-5'>RISC-V</b> 發(fā)展態(tài)勢(shì)與紅帽系統(tǒng)適配進(jìn)展

    RISC-V生態(tài)又一里程碑!Debian官宣支持RV64

    電子發(fā)燒友綜合報(bào)道 RISC-V生態(tài)又一個(gè)里程碑!最近Debian社區(qū)宣布正式接收RISC-V 64成為官方支持的處理架構(gòu),同時(shí)將MIPS
    的頭像 發(fā)表于 05-23 01:10 ?3273次閱讀
    苍南县| 西林县| 子长县| 深圳市| 门头沟区| 金溪县| 仁怀市| 巴东县| 天气| 琼海市| 全州县| 舟山市| 宁德市| 平邑县| 津南区| 通州市| 太保市| 越西县| 天全县| 富蕴县| 新余市| 和田县| 莎车县| 石嘴山市| 淄博市| 连南| 龙岩市| 林芝县| 平舆县| 郎溪县| 洛南县| 嘉善县| 广丰县| 内江市| 颍上县| 麦盖提县| 西充县| 亳州市| 酒泉市| 满城县| 汝城县|