Xi’s Blog

Welcome to my blog! Here I share my thoughts on computer science, high-performance computing, and technology.

欢迎来到我的妙妙屋。

Kernel comparison with a MMA in CUDA and near-SOTA/cuBLAS performance kernel

Kernel comparison with a MMA in CUDA and near-SOTA/cuBLAS performance kernel The project is hosted in the repository: CUDA-refresh Introduction The kernel is the “kernel” in the concept of CUDA, it directly influence the compute efficiency and it’s the key to take advanage of GPU’s huge amount of computation resource and bandwidth. Here is a simple refresh of the CUDA calculation and memory hierachy and their infleunce to the computation efficiency. ...

November 16, 2025 · 19 min · 3835 words · Xi Chen

Add array support for DSL on Minimal CPU

The project is hosted in the repository (section-4): minimal_CPU Introduction Currently, the compiler and DSL support the basic calculation like assigning value, reset value, add and substrction operation. However, there is an important part in the language not being supported. It’s array. The purpose of this array is to read and write string with more convenience, and it can be used to construct a mini-terminal or shell to interact with the simulated hardware(CPU). ...

September 19, 2025 · 4 min · 718 words · Xi Chen

DSL and Compiler Based on Minimal CPU

The project is hosted in the repository (section-2): minimal_CPU Introduction After implementing our minimal CPU, we can write machine code or assembly code to run programs. However, these low-level languages are not easy to read or maintain. We need a way to construct a high-level language that can be translated to machine code, making program development more accessible. Thus, the purpose of this section is to design a Domain-Specific Language (DSL) and its compiler with a complete build system. ...

July 14, 2025 · 5 min · 1015 words · Xi Chen

Learn from MVP: Minimal Instruction Set CPU

Introduction The 6502 CPU program has been a great inspiration for understanding the foundations of computer science. It’s fascinating how basic boolean functions and transistors can form such a complex and beautiful system. However, even the 6502 CPU, with its 150+ instructions, can be overwhelming for those trying to understand the fundamental principles of computing. The Importance of Minimal Viable Products When learning complex systems, it’s crucial to start with a minimal viable product (MVP) - understanding the most essential components that make a program run. This approach led me to explore foundational theories and historical concepts in computing. ...

May 30, 2025 · 5 min · 984 words · Xi Chen

Running LLM on mac mini clusters, strategy and practice

Running LLM on mac mini clusters, strategy and practice Data Parallel This is the most straightforward strategy, typically used when batch size > 1. It increases throughput by giving the system more data to process simultaneously. Pipeline Parallel Due to VRAM or unified memory limitations in Mac minis, loading the entire model into memory isn’t always possible. Pipeline parallel is an effective strategy to reduce memory usage. The approach splits the model into several parts, loading them into memory sequentially. When running, it operates like a factory pipeline - data flows through the system as different model parts process it in sequence. ...

February 28, 2025 · 3 min · 578 words · Xi Chen

An interesting flipping coin question

An interesting flipping coin question Problem statement You have a fair coin, which means a flip result could be tail(T) and head(H), and their probability is 50%. You cast the coin until the coin until have a pattern “THTHT” or “THHHT”, which one is easier to get? And their probability? First thought Lots of people will think they are identical intuitivly because “T” and “H” both have same probability to get. Then their probability is $\frac{1}{2}^5$. ...

October 26, 2024 · 3 min · 499 words · Xi Chen

Set up remote access for my backup desktop to serve as remote development machine

Background After configuring my new desktop, the previous one was suspended because of the electricity concern. But my friend told me his apartment was electric-free, after discussing with him I decided to move my previous desktop to his living room to set up a remote server to reduce the expence of cloud servers. Steps Step 1: clean the computer This step was quiet simple, just download ubuntu 24.02 LTS image and install in the NVME SSD. ...

September 7, 2024 · 3 min · 560 words · Xi Chen

When ChatGPT will let you down?After 6 hours on a project

Introduction Nowadays, chatgpt has been integrated into every aspect of our lives and work. It greatly enhances the capacity not only of the individual developer, but of staff in giant tech. It also has great potential in education. In my opinion, it’s more like a very, very patient personal teacher that can many strange questions any time and is willing to discuss the details with you. And it possesses almost all human knowledge, akin to Aristotle in 21st centry. ...

July 21, 2024 · 6 min · 1083 words · Xi Chen

HPC-3-use-openmp(shared-memory-method)

Introduction to HPC, shared memory parallel using openmp 1 The multicore system The relationship with L1-L3 cache. The L3 cache is shared, but every core have its own L1-2 cache. 2 Using openmp 1 #include "omp.h" Before using it, we need to define how many threads we want to use: In Unix system: 1 export OMP_NUM_THREADS=4 The instruction: 1 #pragma omp parallel If we put this macro before one line of code or one block, the line or block will be executed $OMP_NUM_TRHREADS times. ...

April 16, 2024 · 4 min · 684 words · Xi Chen

HPC-1-divide-and-conquer-block-matrix-algorithmr

week 2 block matrix algorithm 1. BLIS reference high performance implitation v.s. naive methods: 2. With different block size: This is the MB NB PB = 40. But if the block size is too small, the performance is not as good as naive PJI. The front for loop is JIP is not related to the performance of the algorithm because the computer will focus on each implementation in blocking. That means the register will focus on optimize the final for loop: the Gemm_JPI function, but will not paralize and optimize the for loop for block - matrix- matrix - multiplication. ...

April 15, 2024 · 9 min · 1852 words · Xi Chen