Posts on Xi's Blog

Kernel comparison with a MMA in CUDA and near-SOTA/cuBLAS performance kernel

Sun, 16 Nov 2025 00:06:00 +0000

Kernel comparison with a MMA in CUDA and near-SOTA/cuBLAS performance kernel

The project is hosted in the repository: CUDA-refresh

Introduction

The kernel is the “kernel” in the concept of CUDA, it directly influence the compute efficiency and it’s the key to take advanage of GPU’s huge amount of computation resource and bandwidth.

Here is a simple refresh of the CUDA calculation and memory hierachy and their infleunce to the computation efficiency.

Add array support for DSL on Minimal CPU

Fri, 19 Sep 2025 00:06:00 +0000

The project is hosted in the repository (section-4): minimal_CPU

Introduction

Currently, the compiler and DSL support the basic calculation like assigning value, reset value, add and substrction operation. However, there is an important part in the language not being supported. It’s array. The purpose of this array is to read and write string with more convenience, and it can be used to construct a mini-terminal or shell to interact with the simulated hardware(CPU).

DSL and Compiler Based on Minimal CPU

Mon, 14 Jul 2025 00:06:00 +0000

The project is hosted in the repository (section-2): minimal_CPU

Introduction

After implementing our minimal CPU, we can write machine code or assembly code to run programs. However, these low-level languages are not easy to read or maintain. We need a way to construct a high-level language that can be translated to machine code, making program development more accessible.

Thus, the purpose of this section is to design a Domain-Specific Language (DSL) and its compiler with a complete build system.

Learn from MVP: Minimal Instruction Set CPU

Fri, 30 May 2025 00:05:00 +0000

Introduction

The 6502 CPU program has been a great inspiration for understanding the foundations of computer science. It’s fascinating how basic boolean functions and transistors can form such a complex and beautiful system. However, even the 6502 CPU, with its 150+ instructions, can be overwhelming for those trying to understand the fundamental principles of computing.

The Importance of Minimal Viable Products

When learning complex systems, it’s crucial to start with a minimal viable product (MVP) - understanding the most essential components that make a program run. This approach led me to explore foundational theories and historical concepts in computing.

Running LLM on mac mini clusters, strategy and practice

Fri, 28 Feb 2025 00:05:00 +0000

Running LLM on mac mini clusters, strategy and practice

Data Parallel

This is the most straightforward strategy, typically used when batch size > 1. It increases throughput by giving the system more data to process simultaneously.

Pipeline Parallel

Due to VRAM or unified memory limitations in Mac minis, loading the entire model into memory isn’t always possible. Pipeline parallel is an effective strategy to reduce memory usage.

The approach splits the model into several parts, loading them into memory sequentially. When running, it operates like a factory pipeline - data flows through the system as different model parts process it in sequence.

An interesting flipping coin question

Sat, 26 Oct 2024 00:05:00 +0000

An interesting flipping coin question

Problem statement

You have a fair coin, which means a flip result could be tail(T) and head(H), and their probability is 50%.

You cast the coin until the coin until have a pattern “THTHT” or “THHHT”, which one is easier to get? And their probability?

First thought

Lots of people will think they are identical intuitivly because “T” and “H” both have same probability to get. Then their probability is $\frac{1}{2}^5$.

Set up remote access for my backup desktop to serve as remote development machine

Sat, 07 Sep 2024 00:05:00 +0000

Background

After configuring my new desktop, the previous one was suspended because of the electricity concern. But my friend told me his apartment was electric-free, after discussing with him I decided to move my previous desktop to his living room to set up a remote server to reduce the expence of cloud servers.

Steps

Step 1: clean the computer

This step was quiet simple, just download ubuntu 24.02 LTS image and install in the NVME SSD.

When ChatGPT will let you down?After 6 hours on a project

Sun, 21 Jul 2024 00:05:14 -0400

Introduction

Nowadays, chatgpt has been integrated into every aspect of our lives and work. It greatly enhances the capacity not only of the individual developer, but of staff in giant tech.

It also has great potential in education. In my opinion, it’s more like a very, very patient personal teacher that can many strange questions any time and is willing to discuss the details with you. And it possesses almost all human knowledge, akin to Aristotle in 21st centry.

HPC-3-use-openmp(shared-memory-method)

Tue, 16 Apr 2024 00:05:14 -0400

Introduction to HPC, shared memory parallel using openmp

1 The multicore system

The relationship with L1-L3 cache. The L3 cache is shared, but every core have its own L1-2 cache.

2 Using openmp

1

#include "omp.h"

Before using it, we need to define how many threads we want to use:

In Unix system:

1

export OMP_NUM_THREADS=4

The instruction:

1

#pragma omp parallel

If we put this macro before one line of code or one block, the line or block will be executed $OMP_NUM_TRHREADS times.

HPC-1-divide-and-conquer-block-matrix-algorithmr

Mon, 15 Apr 2024 00:05:14 -0400

week 2 block matrix algorithm

1. BLIS reference high performance implitation v.s. naive methods:

2. With different block size:

This is the MB NB PB = 40.

But if the block size is too small, the performance is not as good as naive PJI.

The front for loop is JIP is not related to the performance of the algorithm because the computer will focus on each implementation in blocking. That means the register will focus on optimize the final for loop: the Gemm_JPI function, but will not paralize and optimize the for loop for block - matrix- matrix - multiplication.

HPC-2-Memory-hierarchy-in-computer

Mon, 15 Apr 2024 00:05:14 -0400

Hierarchy Memory

1. Why use Hierarchy Memory

Because the register memory is much faster than main memory, in fact the difference is about two magnitude. And the performance gap will be larger because the CPU’s speed increase faster than main memory.

In this situation, if we fetch data from the main memory too many times, the expense will be very expensive. But if we create some memory which is faster than main memory but a little bit slower than register memory. We call it cache.

Several issues I meet when plug in new RTX 4090 GPU for my workstation

Sat, 24 Feb 2024 00:05:14 -0400

Several issues I meet when plug in new RTX 4090 GPU for my workstation

Background hardware s

AMD 7950X3D

128GB(4x32GB) DDR5 6000MHz

Gigabytes X670 ATX

2TB samsung storage

[Nvidia RTX 4090]

Install 4090 hardware to the workstation

At first, the installation is smooth, just plug the PCIE slot into the upper one is fine.

However, I found I forgot to install the support for the RTX(it’s too big and heavy, definitely need the support.)

The lessons I learned from setting up an API server via perl script

Tue, 21 Nov 2023 21:05:14 -0400

Problem Statement

Recently When I was trying to set up an API server in a perl script to do some unit tests, I met lots of trouble.

In conclusion, the trouble can be classfied as 3 parts:

The residual server processes can’t be killed properly.
Program stucked after launch the API server.
stdin, stdout, stderr on windows.

Error examples

The residual server processes can’t be killed.

1

system(perl server.pl daemon -l http://*:0);

Use the command above in main test script will launch an API server in a child process. The residual server processes should be killed without any left after the tests. At first I use “system” to run the command, and then kill it in a different system command via:

Why I open this blog

Thu, 28 Sep 2023 21:05:14 -0400

Feynman has told us the best way to learn something is that you use concise and understandable words introduce what you learned to others, if they can understand well, then you master the knowledge.

I have a habit which writing some notes/docs for myself, which is hard to undetstand for others(Even myself would lost after several weeks). Then I decide to make my work/study log perfect and share what I think is valueable to others.