<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Cache-Optimization on Xi's Blog</title><link>https://xichen1997.github.io/tags/cache-optimization/</link><description>Recent content in Cache-Optimization on Xi's Blog</description><generator>Hugo -- 0.154.5</generator><language>en-us</language><lastBuildDate>Mon, 15 Apr 2024 00:05:14 -0400</lastBuildDate><atom:link href="https://xichen1997.github.io/tags/cache-optimization/index.xml" rel="self" type="application/rss+xml"/><item><title>HPC-1-divide-and-conquer-block-matrix-algorithmr</title><link>https://xichen1997.github.io/posts/2023-04-14-hpc1-divide-and-conquer-block-matrix/</link><pubDate>Mon, 15 Apr 2024 00:05:14 -0400</pubDate><guid>https://xichen1997.github.io/posts/2023-04-14-hpc1-divide-and-conquer-block-matrix/</guid><description>&lt;h1 id="week-2-block-matrix-algorithm"&gt;week 2 block matrix algorithm&lt;/h1&gt;
&lt;h1 id="1-blis-reference-high-performance-implitation-vs-naive-methods"&gt;1. BLIS reference high performance implitation v.s. naive methods:&lt;/h1&gt;
&lt;p&gt;&lt;img alt="img" loading="lazy" src="https://raw.githubusercontent.com/xichen1997/picture_for_blog/master/Plot_All_Orderings.png"&gt;&lt;/p&gt;
&lt;h1 id="2-with-different-block-size"&gt;2. With different block size:&lt;/h1&gt;
&lt;p&gt;This is the MB NB PB = 40.&lt;/p&gt;
&lt;p&gt;But if the block size is too small, the performance is not as good as naive PJI.&lt;/p&gt;
&lt;p&gt;The front for loop is JIP is not related to the performance of the algorithm because the computer will focus on each implementation in blocking. That means the register will focus on optimize the final for loop: the Gemm_JPI function, but will not paralize and optimize the for loop for block - matrix- matrix - multiplication.&lt;/p&gt;</description></item></channel></rss>