HP Chromebox vivisection – reading/writing ROM chip data

I’ve got a Chromebox and I really enjoy this piece of hardware. It is beautiful, powerful and yet very cheap x86 computer. I bought one at Amazon for 139$ and I doubt you can find better computer for this price (eBay has a lot of interesting Chromebox offers). What attracts me is that this computer has open firmware and this is a good chance for me to learn more about it. ChromeOS version of coreboot is freely available in ChromeOS repository.

But before doing any firmware development one should understand that recovering bricked computer is slightly more complicated than recovering broken operation system. Once your computer FW corrupted you cannot boot your computer neither from disk nor from USB stick. One should physically connect to ROM chip pins and write new firmware using a programmer. And today I’ll show show to disassemble the chromebox, connect to its ROM chip, read and write the firmware code. Disclaimer: opening device voids its warranty.

To connect to SPI flash you need following hardware:

  • SOIC 8 test clip to connect to SPI flash chip (I use Pomona 5250).
  • ROM programmer that provides interface between host computer and the chip that uses SPI protocol. I use Bus Pirate that is powerful piece of hardware but you can use any other programmer.
  • Bunch of jumper cables

Ok, let’s start vivisection. Remove four rubber pads at the bottom of the device

IMG_20141230_170607

IMG_20141230_170856

Remove four screws and accurately remove the bottom plastic case cover

IMG_20141230_171153

Now you need to remove the metal shield. For that gently detach metal net glued to the metal shield

IMG_20141230_171328

Remove 5 screws that attach shield to the case and then carefully remove the metallic shield. Now you see the motherboard, but there is no SPI flash here😦 it must be at the other side of the board.

IMG_20150103_064319

Remove 5 more screws that keep the motherboard to the case. Remove small metal shield from top of the power connector and then pull the motherboard out of the case. Be careful as there is a cable that attaches MB to the case sensors.

IMG_20150103_064835

And here is the SPI flash chip that we were looking for. It is Winbond W25Q64FVSIG in SOIC package.

IMG_20150103_064941

Now attach the BusPirate pins and the test clip pins as

Bus Pirate  Flash chip
CS              CS
MISO          DO (IO1)
GND           GND
3v3             VCC
CLK            CLK
MOSI         DI (IO0)

Refer to Winbod W25Q64 datasheet for flash pins location.

IMG_20150110_141012

Attach BusPirate to your host computer and then run flashrom from the host. Reading 8M over SPI takes a while (about 10 minutes).

$ flashrom -p buspirate_spi:dev=/dev/ttyUSB0 -r chromebox_fw.bin

flashrom v0.9.7-r1711 on Linux 3.18.2-2-ARCH (x86_64)
flashrom is free software, get the source code at http://www.flashrom.org

Calibrating delay loop… OK.
Found Winbond flash chip “W25Q64.V” (8192 kB, SPI) on buspirate_spi.
Reading flash… done.

Here we are. We were able to read the ROM chip content. To write saved firmware to the chip use -w key for flashrom.

Building ChromeOS kernel without chroot

If you work on ChromeOS kernel you need to use chroot environment for development. This is kind of virtual machine where you have need development packages installed.

But today we’ll try to make a kernel development environment that does not use chroot. Why? There are several reasons: chroot is a large complex setup that requires a lot of sources to download and build, it is more complicated and slower than “normal” kernel development where you just need make. I also wanted to learn how ChromeOS kernel is built/installed to target machine and writing such script is a great exercise.

Note that you still need chroot if you are going to create a recovery USB image. This image contains full ChromeOS distribution that includes kernel and all required userspace tool. You need this image to boot and setup target machine environment the first time. Make sure that you built image with development options enabled. I usually build recovery image with following flags:

./build_image –noenable_rootfs_verification test –enable_serial=ttyS0

In this article I’ll describe how I setup simple chroot-less development environment for ChromeOS kernel on my Linux Arch machine. I am going to build kernel for arm64 target platform. If you have 32bit arm or x86 platform you need to modify instructions accordingly.

First you need cross-toolchain for your board. I use an arm64 board and I need aarch64 toolchain. Arch has it in AUR repository. Install the toolchain with

yaourt -S aarch64-linux-gnu-gcc aarch64-linux-gnu-binutils aarch64-linux-gnu-gdb

It will take a while, so take your coffee break.

In addition to the toolchain you need to install other chromeos development tools that we will be using:

yaourt -S hdctools-git dtc vboot-utils uboot-tools bc

Now let’s compile our kernel. First you need to create kernel .config file that tells what features/drivers we want to compile:

chromeos/scripts/prepareconfig chromiumos-arm64
yes ” | make $ARCH_FLAGS oldconfig

then compile it:

make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-

You’ll get vmlinux binary. Now we need to create u-boot format image that contains vmlnux plus device tree information. Device tree (dts) is hardware configuration description that tells kernel what hardware chips present on the board, how these chips connected and what drivers should be used.

mkimage -D ‘-I dts -O dtb -p 1024’ -f kernel.its kernel.img

kernel.its is file tells where to find dts files. Later I’ll show you an example of such file.

The image may contain multiple device tree configurations and thus can be booted on multiple board that have similar architecture. Board firmware (in our case it is coreboot) knows what device tree should be used to configure the kernel.

The next step is to create signed ChromeOS kernel binary. It requires several additional files:

futility vbutil_kernel –pack kernel.bin \
–keyblock kernel.keyblock \
–signprivate kernel_data_key.vbprivk \
–version 1 \
–config config.txt \
–bootloader bootloader.bin \
–vmlinuz kernel.img \
–arch aarch64

Finally you have kernel.bin that we are going to write to target machine flash chip.

The flash chip on my device is /dev/mmcblk0. To examine the ChromeOS partition table run cgpt show /dev/mmcblk0 . You’ll see that there are three pairs of ChromeOS kernel/rootfs partitions. It is very handy to have several kernel/rootfs partitions for development. I usually keep KERN-A/ROOT-A pristine and deploy my changes to KERN-B/ROOT-B. Sometime my development kernel does not work as expected or crashes, then coreboot will reboot into known-good kernel (KERN-A).

It is interesting to learn how coreboot chooses partition to boot from. Each GPT partition contains additional flags used by coreboot: “successful”, “tries”, “priority”. Coreboot scans all partitions on all storage devices, selects all partitions with data that looks like valid kernel, then chooses one with the largest priority value and boots it. Partition must have either “successful” or “tries” field non-zero. If “successful” was zero then coreboot decrements “tries” and writes back to storage. And we use this feature to boot our development kernel. We set KERN-B partition “priority” to a large value (e.g. 15), “successful” flag to zero and “tries” to one.

cgpt add -i 4 -S 0 -T 1 -P 15 /dev/mmcblk0

Thus coreboot will load KERN-B once, update “tries” to zero. If my development kernel crashes or if I simply reboot target machine then stable KERN-A will be used. Isn’t it smart? Here is the code that chooses the partition (LoadKernel() function).

Kernel boot options contain “root=PARTUUID=%U/PARTNROFF=1” that says root partition is right next after kernel partition. If kernel is /dev/mmcblk0p2 then root is /dev/mmcblk0p3 and so on.
The last thing to do is to copy kernel modules to the target machine. If you have kernel and modules compiled with different gcc versions then you might have ABI incompatibility and your kernel will crash when tries to load a module. We “install” kernel modules at the local machine, mount remote /dev/mmcblk0p5 (ROOT-B) and then rsync to remote target

make INSTALL_MOD_PATH=~/tmp/mytemproot ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- modules_install firmware_install

Reboot your target development board and it will boot into your custom kernel.

Here is another thing that makes our life easier – password-less access to the board. We are going to use testing ssh keys that our development image contains. Download keys from ChromeOS repository and then configure ssh client with something like this

Host devboard1
User root
IdentityFile /home/anatol/.ssh/chromeos_testing_rsa

Where devboard1 is hostname that I use for my arm64 board. I configured my router to match board MAC address to ‘devboard1’. See ‘man ssh_config’ for more details about ssh client configuration. After configuring it I can ssh to my board as ‘ssh devboard1’ and no password is required. It is very convenient.

Here is ready shell script that does everything we described above – builds kernel and deploys to a remove target machine, sets GPT flags, … The script expects following directory structure

$DIR/
* configs/
* * kernel.keyblock
* * kernel_data_key.vbprivk
* * config.txt
* * kernel.its
* linux
* update_kernel.sh

Then run update_kernel.sh $TARGET_MACHINE

CPU cache thrashing

When you work as a web developer you rarely think in terms of CPU, registers and memory cells. And most modern developers know a little about what is going on at the low-level. Recently I’ve decided to eliminate this gap and started to read about microprocessors architecture and other stuff related to it. This is really an interesting are and I recommend you to learn more about it. You might want to start from great book called “Inside the Machine”.

Today I’ll show you an interesting effect of L1 cache thrashing. As you might heard, modern architecture has several types of memory with different speed and cost: hard drives, RAM, CPU caches (L3,L2,L1), registers. Faster memory is more expensive so amount of it is a bounded resource . That is why developers employ technique called caching – keep only active information in faster memory closer to processor.

Level 1 CPU cache is a piece of fast memory usually a few KB is size (32KB in my case) that works at the same speed as CPU. If required information is not in this cache (neither in L2,L3) then CPU accesses RAM. In fact RAM is not fast enough for CPU, fetching information from RAM usually takes about a dozen of CPU cycles. If there was no on-chip cache then modern CPU waste most of its time just waiting a response from RAM. Thankfully we have it and as Intel reports ~90% of all memory accesses hit L1 cache.

Very often access to memory has sequence pattern – if you accessed address X recently then there is a high chance that you will access some memory cell nearby. That is why when CPU fetches a cell from RAM to cache it also fetches a few bytes nearby. This block called “cache line” and has size 16 bytes on x86 Intel architecture. Cache address range 0-15 is line #0, 16-31 – line #1, 32-47 – line #2, and so on. When you fetch some byte at a memory address you fetch the whole cache line, if you discard an information from cache – you discard the whole line as well. Got it? This cache line addressing schema also helps to reduce complexity and price of cache circuit.

Another question is cache associativity. A specific line in RAM can be mapped to only one line in L1 cache. RAM address 0-15 can be stored in the line #0 only, 16-31 in line #1 only. I have 32K of L1 cache and in my case RAM address 32768-32783 also map to cache line #0. Any address in range from L1_SIZE*N+LINE_SIZE*K to LINE_SIZE*N+LINE_SIZE*(K+1)-1 map to line #K. This is called direct-mapped cache. Again, it is done to simplify the cache circuit.
Imagine scenario when I try to copy 16 bytes from RAM address 0 to RAM address 32768. In C I would use naive code like *d++=*s++. Let’s check what this code does:
1) fetch address 0-15 from RAM to line #0 of the cache
2) copy byte at address 0 to a temporary register
3) now we need to copy this byte to address 32768 and for that address 32768-32783 should be fetched to cache first. And its destination line cache is also #0. CPU discards recently fetched data from the line #0 and puts 32768-32783 to it.
4) copies data from temporary register to line #0 cell 0 (that represents address 32768).

The index variables increments and now my program wants to copy byte at address 1 to address 32769. But for that it should discard line #0 again and refetch address 0-15 from RAM. This refetching repeats for every byte. Do you remember I told that fetching data from RAM is very slow operation from CPU point of view? Copying like this wastes a lot of CPU time and called “cache thrashing”.

Here is a program in C that demonstrates cache thrashing. It fills matrix of size 32K*32K with numbers. The most interesting part is line 32
array[i][j] = i*j;
on my machine gives 0.3 sec, but if I swap i and j like this
array[j][i] = i*j;
it gives 4 seconds.

CPU executes the same amount of instructions and the only difference is memory access pattern. First example writes to memory serially – load 0-15 to cache line #0 then change value of address 0,1,2,3,… and only then flushes the cache line to RAM. The second example thrashes L1 cache by loading 0-15 to line #0, write value to cell 0, flush it, load address 32768-32783 to line #0, write value to address 32768, flush cache line, load the next RAM address to line #0 write and flush it and so on. These access patterns show 10 times difference in application performance.

This effect called “Cache Thrashing” and developers who work on high-efficient applications try to avoid it e.g. by laying out frequently accessing data nearby.

I love g++ error messages

#include <map>
#include <string>
#include <algorithm>
using namespace std;
void main( int argc, char *argv[] ){
    map< map<string,string>, map<string,string> > hoh;
    sort( hoh.begin(), hoh.end() );
}

Do you see an error here? How big the error message is? Try compile it with g++ and you’ll have a lot of joy reading compiler output.

PS Am I the only person who thinks that error messaging is seriously broken in GNU Tools (gcc autotools gmake …)?

GTK tray example in Vala

If you are starting to implement your own GUI application I highly recommend to look at the Vala language. This is a killer language for GTK.

As a vala example I decided to implement a small application that shows how to work with Tray. The example shows how to handle left/right click, has a menu with “about” window and exit item. It is almost a real application🙂

Ok, here is the promised code: Continue reading

Why Tup build tool is great

I played with Tup build tool for a couple of weeks and I think that this is a really great build tool. And here is why:

  • It is god damn fast. Especially for incremental builds. With my 20K files project null build takes 6 millisecons. This differs from SCons –  for this size of project it will take an hour. Your build is not restricted by the build tool anymore.
  • Tupfile syntax is simple and very similar to make. Although I am not a big fan of make-like syntax in the most cases it is enough. Tup uses pretty much the same idea as make – “input files |> command |> output files”. It is easy to convert a build from make to Tup – in one day I converted 6 of my personal projects.
  • Tup provides strong build consistency. You can forget about “clean builds”. It does not matter from what state you started building your project – the final state is always the same. A detailed explanation you can find in this document page 8.
  • It is written in pure C. The code is clean and very readable. After just several hours of reading Tup sources I was able to understand its basics.

The main sell point is the speed of Tup and here is explanation why it is so fast:

  • Tup stores dependency graph in local database (sqlite), it means it does not need to reparse all Tupfiles on every run. Tup parses files only when they changed.
  • As tup has a disk storage it does not need to keep all graph in RAM. So its memory consumption is really low.
  • Tup has “file monitor” feature on some platforms. Monitor is an inotify daemon that listens all file changes. With it you don’t need to scan your project for changes every time you run build.
  • Tup finds nodes need to be rebuilt starting from leaves of the graph. It is a time saver in case of incremental builds as you need to visit only small part of the dependency graph.

Try this tool. I am sure you will love it!

Benchmarking C++ build tools

Here is a good benchmark of various build tools (including Tup). I highly recommend to look at it.

http://retropaganda.info/~bohan/work/sf/psycle/branches/bohan/wonderbuild/benchmarks/time.xml

The benchmark result clearly shows that tup’s speed is great for incremental builds. And scons/autotools totally sucks for incremental non-recursive builds.