Friday, February 20, 2015

A look at Raspberry Pi 2 performance and overclocking

Raspberry Pi 2 significantly improves on original model


The Raspberry Pi 2 significantly increases performance when compared to the original Raspberry Pi. It corrects deficiencies in the design of the original SoC used inside the Raspberry Pi by integrating four more modern and faster Cortex-A7 ARMv7 CPU cores in a quad-core configuration, as opposed to the single ARM11 core in the original SoC, all within the constraints of a similar 40 nm manufacturing process. Whereas the CPU inside the BCM2835 processor of the original Raspberry Pi effectively ran without a L2 cache (which was tied to the GPU), the new Broadcom BCM2836 SoC contains a dedicated 512 KB CPU cache, improving memory performance and performance in general. The amount of RAM has also doubled to 1 GB. Other changes include more USB ports and a MicroSD card slot for storage instead of SD.

Compatibility with Raspbian


Otherwise the new SoC as well as the device itself has been engineered to maintain hardware and software compatibility with the original Raspberry Pi, while running considerably faster. When using the Raspbian OS, an ARM11 compatible Debian-based distribution using armhf specifically maintained for the Raspberry Pi, only the kernel is specific to the Raspberry Pi 2 with the entire userland being 100% compatible. Although this misses out on some of the advantages of the newer ARMv7 instruction set (such as the reduced code size of Thumb2 instructions, which are used in ARMv7 Debian), applications that can take advantage of, for example, NEON SIMD instructions usually do so on a run-time detection basis (as they do in ARMv7 Debian), so that the most critical gains from the new instruction set can in theory be taken advantage of in Raspbian.

Nevertheless, the new device can run an OS specifically configured for ARMv7, such as Debian armhf and derived distributions such as Ubuntu, which take advantage of the reduced-size Thumb2 instruction set. An example of such a distribution that has been applied to the Raspberry Pi 2 is Ubuntu Snappy Core.

Components of Raspberry Pi 2 SoC clocked conservatively out of the box


The maximum CPU clock of the Cortex-A7 cores in the Raspberry Pi 2 is 900 MHz, while the L2 cache appears to be clocked at only 250 MHz by default, inheriting the clock rate of the original Pi's GPU cache. SDRAM is clocked at 450 MHz by default. The GPU is clocked at 250 MHz, similar the original Raspberry Pi.

The configured speed of the L2 cache is particularly low, as we will see, since speeds up to 600 MHz seem to be stable when overclocking, resulting in a large performance increase. The CPU clock speed can also be bumped up somewhat.

The raspi-config utility in Raspbian at the time of writing contains just one overclocking option for the Raspberry Pi 2, which clocks the CPU at 1000 MHz, doubles L2 cache speed to 500 MHz and clocks SDRAM also at 500 MHz. Unfortunately, this setting turned out to be unstable on my device. This appears to be due to the SDRAM clock speed being set too high and causing problems. Bumping the SDRAM speed down to 483 MHz results in a stable system.

Overclocking test set-up


I have performed a number of overclocking tests with different clock configurations. The test set-up was as follows.

To prevent corruption of the root file system, I modified /etc/fstab to mount the root filesystem read-only at boot by adding "ro" to the mount flags. To remount with read-write capability when necessary after boot (on a stable system), I ran "sudo mount -o remount,rw /dev/mmcblk0p2 /".

The main stability test was performed using the single-threaded memtester package (available in Raspbian and Debian) using the command line "memtester 16M 10" (16 MB memory region, 10 loops). In several cases four of these commands were run in parallel to fully occupy the CPU and provide reliable stability information. In unstable configurations, this test almost always shows errors.

Memory performance was tested using a slightly modified version of the fastarm package (https://www.github.com/hglm/fastarm) with the command line "for x in 0 1 2 3 4 5 6 7 8 9; do ./benchmark --duration 1 --repeat 1 --memcpy e --test 0; done". Because of result variation due to cache allocation effects, I took the best result out of ten. Tests number 0 (memcpy of varying size, aligned, depends on CPU as well as memory) and 43 (4K page-aligned memcpy, a more pure memory subsystem test) were used.

For a real-world CPU performance indication I used the command line "time zcat bullet3-Bullet-2.83-alpha.tar.gz >/dev/null" performed multiple times, which is effectively gzip decompression of a large file out of buffer cache memory.

Table with stability testing results


The following table shows stability testing results for a large number of CPU clock, core clock (L2 cache clock), and SDRAM clock configurations. Also included are some benchmark scores, including memory performance and CPU performance.

CPU     +Volt   Core    SDRAM   +Volt   Stability       Memcpy perf.
                                p i c   (memtester)     Varied  4K      zcat

Default:
900     ?       250     450     0 0 0   OK (slow)       716     1015    2.388s
Standard overclock (raspi-config "Pi 2" option):
1000    2       500     500     0 0 0   Fail
Other settings:
900     0       450     450     0 0 0   OK              778     1270    2.380s
900     0       600     467     0 0 0   Almost          804     1431    2.379s
900     2       600     467     0 0 0   OK (multi-test)
1000    0       467     467     0 0 0   OK (multi-test) 867     1410    2.146s
1000    0       500     483     0 0 0   OK (multi-test) 880     1502    2.146s
1000    0       500     483     2 0 0   OK (multi-test) 878     1502    2.169s
1000    2       500     500     0 0 0   Almost
1000    4       500     500     0 0 0   Almost
1000    0       500     500     2 2 0   Almost
1000    0       500     500     4 4 0   Almost?
1000    0       500     500     4 0 0   Fail            886     1415    2.143s
1000    2       500     500     4 0 0   Fail
1000    4       500     500     4 4 0   Fail (multi)
1000    0       500     500     6 6 6   ?
1000    2       600     467     0 0 0   OK (multi-test) 885     1518    2.145s
1000    2       600     500     4 0 0   OK (multi-test) 890     1553    2.142s
1000    2       667     500     4 0 0   Fail (freeze)
1000    6       667     500     6 0 0   Fail (freeze)
1050    0       466     466     4 4 4   OK
1050    0       466     533     4 4 4   Fail
1050    0       466     533     6 6 6   Fail (bitspr.)
1050    4       600     450     0 0 0   OK (multi-test) 916     1528    2.045s
1050    4       600     483     2 0 0   OK (multi-test) 924     1571    2.041s
1067    6       533     533     6 6 6   Fail
1067    4       533     533     8 8 0   Fail (bitflip)
1067    6       533     533     8 8 0   Fail (bitflip)
1067    6       533     500     4 4 0   Almost
1067    4       533     466     0 0 0   OK (multi test) 925     1521    2.010s
1100    0       466     466     0 0 0   Fail (boot)
1100    4       466     466     0 0 0   OK?
1100    4       600     467     0 0 0   Fail
1100    4       500     500     6 6 6   OK?
1100    4       500     500     6 6 0   OK?
1100    4       500     500     4 0 0   Almost
1100    4       500     500     6 0 0   OK?             950     1532    1.950s
1100    6       500     500     6 0 0   Almost
1100    4       533     533     6 0 4   Fail            962     1593    1.948s
1100    4       550     483     0 0 0   OK (multi-test) 944     1549    1.951s
1133    4       567     466     0 0 0   Almost          974     1578    1.893s
1133    4       567     467     4 0 0   Almost
1133    5       567     453     0 0 0   Almost          971     1571    1.896s
1133    8       567     453     0 0 0   Fail
1166    4       466     466     0 0 0   Almost          960     1451    1.841s
1167    4       466     466     2 2 4   Fail
1166    6       466     466     0 0 0   Fail            962     1451    1.841s
1167    8       500     500     4 0 0   Fail                            1.839s
1167    8       500     500     8 8 8   Fail
1200    8       600     450     4 0 0   Fail
The stable configurations show "OK (multi-test)" in the stability column, meaning they were stable during a test with multiple memtester processes running concurrently. Most unstable configurations have an SDRAM clock speed of 500 MHz or higher, or a CPU speed higher than 1100 MHz.

CPU frequency corresponds with the "arm_freq=" setting in /boot/config.txt. The CPU/main SoC voltage is set with over_voltage setting. The core clock (the L2 cache speed on the Raspberry Pi 2) is set with core_freq. The SDRAM frequency is set with sdram_freq, while voltage settings for the SDRAM physical layer, I/O and controller are set using over_voltage_sdram_p, over_voltage_sdram_i and over_voltage_sdram_c, of which the physical layer voltage seems to be the most relevant to overclocking. An example of the relevant lines in /boot/config.txt for a particular overclocking configuration (1000 MHz CPU, with stable 483 MHz SDRAM, as well as 256 MB memory reserved for GPU) follows.
arm_freq=1000
over_voltage=0
core_freq=500
sdram_freq=483
over_voltage_sdram_p=0
over_voltage_sdram_i=0
over_voltage_sdram_c=0
gpu_mem=256
See the official documentation for more details.

Observations based on stability testing


The following is apparent from testing my device:
  • The core_freq setting seems to be directly correlated with the L2 CPU cache in the new SoC, which has a large effect on performance. Depending on other frequencies, core_freq frequencies up to 600 MHz seem to be stable, giving a significant performance boost over the default configuration of 250 MHz.
  • When increasing CPU speed beyond roughly 1000 MHz, the CPU core voltage has to be bumped up.
  • Increasing SDRAM speed beyond about 483 MHz seems to cause instability on my device. Bumping up the SDRAM voltage (in particular the physical layer voltage, but not the I/O voltage or SDRAM controller voltage) may help a little for potential stability. However, SDRAM speeds of 500 MHz and higher tend to cause stability problems regardless of voltages on my device.
  • Certain divisor relationships between CPU clock and core (L2 cache) clock (such as 2:1) seem to enhance stability and performance.

CPU overclocking conclusions


  • The default Raspberry Pi 2 core_freq (L2 CPU cache) setting of 250 MHz appears to be extremely conservative. At the default CPU frequency of 900 MHz, 450 MHz (which has a nice divisor of two) appears to be very stable and even 600 MHz can be stable.
  • Unfortunately, the standard Raspberry Pi 2 overclocking setting available in raspi-config at the time of writing (1000 MHz CPU, 500 MHz core clock, 500 MHz SDRAM) appears to be unstable on my device due to a SDRAM clock speed that is slightly too high. Instead of bumping the CPU voltage as performed by this setting, increasing the SDRAM voltage (primarily the physical layer voltage) may improve stability, but clocking the SDRAM slightly lower at 483 or 467 MHz seems to be the best solution.
  • It seems likely that certain SDRAM parameters (CAS delay, etc) are set to fixed values by the kernel and that higher SDRAM speeds will be possible when these parameters are configurable or appropriately adjusted by the kernel for higher SDRAM clock speeds. However, the actual RAM chip used is an Elpida/Micron EDB8132B4PB-8D-F LPDDR2-800 chip specified for 400 MHz clock frequency, so the overclocking headroom may not be that high.

Table with stable high-performance clock configurations


The following table shows stable high-performance clock configurations tested on my device and their clock frequency ratios:
CPU     Over-   Core    Base
clock   volt    clock   Clock   CPU : Core      SDRAM   Overv.

1067    +4      533     533     2 : 1           467
1050    +4      600     150     7 : 4           483     +2
1000    +2      600     100     5 : 3           500     +4
1000            500     500     2 : 1           483     +2
 900    +2      600     133     3 : 2           467
 900            450     450     2 : 1           450
However, I may have to retest the configuration with an SDRAM frequency of 500 MHz because other configurations show such a setting to be unstable after extensive testing. Additionally, the 1100 MHz CPU frequency setting turned out not be completely stable.

Overclocking the GPU


By default, the Raspberry Pi as well as the Raspberry Pi 2 will use dynamic clocking, whereby the CPU speed, "core_freq" speed and SDRAM frequency are dynamically ajdusted based on CPU load. Any GPU frequency settings, as governed by the "v3d", "h264_freq" and "isp_freq" settings in config.txt, are ignored by default.

Using "force_turbo=1" allows overclocking of the GPU using the "v3d_freq", "h264_freq" and "isp_freq" options. "v3d_freq" corresponds to the frequency of the 3D block (the most relevant for overclocking), while "h264_freq" is the H.264 video block and "isp_freq" governs the camera interface. However, "force_turbo=1" also disables dynamic clocking, locking the CPU, core and SDRAM speeds to fixed maximum values, which is highly undesirable. Also note that using "force_turbo=1" may void the warranty of the device.

There is another setting, "avoid_pwm_pll=1", that allows "core_freq" to be set independently from that of the GPU on the original Raspberry Pi, at the cost of slightly reducing analog audio output quality. However, "force_turbo=1" is still required to be able to modify the GPU clock frequencies.

Because the Raspberry Pi 2 has an independent GPU with its own independent L2 cache seperate from the L2 cache of the CPU, some of these limitations may have become unnecessary (in particular the requirement that the CPU is locked at a high speed with "force_turbo=1" in order to be able to overclock the GPU), and if that is the case these restrictions will hopefully be removed in the future.

When running 3D benchmarks, the following CPU and SDRAM settings were used (note that when using of "force_turbo=1" to overclock the GPU, these frequencies are locked and do not scale down when the CPU is idle):
cpu_freq=900
over_voltage=0
core_freq=450
sdram_freq=483
When running 3D GPU benchmarks without overclocking the GPU (force_turbo=0), it looks like the CPU / L2 cache frequencies are scaled down quickly because the CPU load is relatively low, negatively affecting the throughput of the 3D benchmarks because of a CPU bottleneck, resulting in an initial peak in fps dropping to a lower base. To avoid this, we modify the sampling_down_factor of the ondemand cpufreq governor from 50 to 1000:
sudo sh -c "echo 1000 >/sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor"
The following settings overclock the 3D block (V3D) of the GPU from 250 MHz to 300 MHz:
force_turbo=1
avoid_pwm_pll=1
v3d_freq=300
These are the results of benchmark testing with different V3D clock speeds:
v3d_freq        demo1   demo1   demo2   demo2   demo2   demo5   demo9   game
                        lights          lights  shadows
default          81.1    20.5    26.1     8.87    0.98   50.5    46.4   112
300              95.3            28.4     9.88    1.12   56.7    49.3   130
350             109*     27.4    29.9    10.9     1.24   62.3    51.6   148
400             120*     30.6    31.4    11.7     1.35   40-52*  53.5   108*
450              80*     33.7    20.2*   12.3     1.45   40-56*  55.0   111*
Although the clock frequency of either the CPU or the 3D block seemed to be scaled down in some cases at higher V3D speeds (presumably due to temperature measurements or voltage readings resulting in throttling), there were actually never any signs of stability issues when overclocking the GPU, up the maximum tested speed of 450 MHz. The

Regular dynamic downclocking of the CPU can occur due to USB power supply/cable issue


Initially,  downclocking by the Raspberry Pi 2 kernel's under-voltage monitor seemed to be triggered a lot of more frequently than it is on the original Raspberry Pi. This results in a rainbow-colored icon being displayed in the top-right corner of the screen. This even happens briefly during boot. At such occasions, presumably the CPU and other components are downclocked in order to ensure stability.

The rainbow-colored square suggests a power supply issue since it indicates a voltage that is too low. As it turns out, replacing the USB power cable I was using with a shorter one that is better insulated eliminates the under-voltage warnings, with the same 5V/2A power supply.

Updated 1 March  2015 (update explanation for CPU speed throttling).
Updated  25 March 2015 (update with USB power cable findings).

Sunday, February 15, 2015

Optimizing performance on the Raspberry Pi and Raspberry Pi 2

The Raspberry Pi is a popular platform that usually runs from a flash-based SD card as the root filesystem. Most of the tips from the previous article apply to the Raspberry Pi. They work best on the common 512 MB memory variant of Raspberry Pi Model B, rather than the early 256 MB version.

Raspberry Pi's standard Raspbian OS configuration by default has an extensive set of logging options enabled for rsyslog, and the file system is configured in ordered data mode. While such a configuration might be understandable from the viewpoint of system stability and error reporting, it is not beneficial for performance to the say the least, causing a flash write access bottleneck for overall system performance.

Reducing logging activity, using tmpfs and using ramdisk for cache directories


The Raspberry Pi's rsyslog configuration files are stored in /etc/rsyslog.conf. Comment out the many logging rules by prefixing them with a '#' will eliminate logging activity.

The following lines added to /etc/fstab cause /tmp and /var/tmp to be stored in ramdisks:
    tmpfs    /tmp       tmpfs    defaults    0 0
    tmpfs    /var/tmp   tmpfs    defaults    0 0
To move cache directories to ramdisk, create the file /etc/profile.d/xdg_cache_home.sh with the following content:
    #!/bin/bash
    export XDG_CACHE_HOME="/dev/shm/.cache"

Optimizing the file system


The Raspberry Pi does not accept the journal_async_commit mount option. However, write-back mode can be enabled and barriers can be disabled. Note that the risk of file system corruption with these settings is greater for the Raspberry Pi than for battery-powered devices.

It is best to store the performance options as default mount options in the filesystem itself using tune2fs. Data write-back mode (as opposed to the default ordered data mode) and no barriers are configured with the following command, assuming the root file system is stored in the /dev/mmcblk0p2 partition as it is on Raspbian:
    sudo tune2fs -o journal_data_writeback,nobarrier /dev/mmcblk0p2
When an error is detected during mounting at boot (including an option that is not accepted), the root file system will be mounted read-only, so that configuration files cannot be changed. It is however possible to remount the filesystem in read-write mode using the following command:
    sudo mount -o remount,rw /dev/mmcblk0p2 / 

Disabling X windows system error logging


Edit the file /etc/X11/Xsession and edit the relevant lines to look like this (somewhere in the middle of the file), The log-generating line is commented out and replaced with a one that routes messages to /dev/null:
    #exec >> "$ERRFILE" 2>&1
    exec >> /dev/null 2>&1

Increasing the size of the console font


The Raspberry Pi comes configured with a relatively small font in the text console. However, it is easy to configure a larger font that greatly improves readability by editing /etc/default/console-setup to include, for example, the following two lines:
    FONT_FACE="TerminusBold"
    FONT_SIZE="16x32"
This configures a highly readable 16x32 font. This amounts to 80x32 characters on a 1280x1024 monitor. On a 1920x1024 monitor, this amount to 120x33 characters with 24 scanlines unused.

Editing nano context highlighting configuration


The convenient nano text editor comes configured with context-sensitive color highlighting by default. It operates using rules read from a configuration file, which depends on the file extension of the file being edited. Some of these rules are very complex and much too slow for a device such as the Raspberry Pi, resulting in sluggish editing. For example, the rules for C/C++ source files in /usr/share/nano/c.nanorc includes a rule labelled with the comment "This string is VERY resource intensive!".  Commenting out this rule by putting a '#' in front of it results in greatly improved experience editing C/C++ source files. Similar resource intensive rules exist in the configuration for assembler files in /usr/share/nano/asm.nanorc.

Overclocking


Overclocking options can be set using raspi-config configuration utility. Overclocking is often stable on the device. I had no problems with the highest "Turbo" setting, which overclocks the CPU to 1000 MHz and the RAM to 600 MHz. The "core frequency" is also doubled from 250 MHz to 500 MHz. This frequency is tied to the L2 cache used by the GPU (it does appear to cause large increase in GPU performance).

On the Raspberry Pi 2, the default clock speed is 900 MHz, while the "core frequency" (which has a different meaning than it has for original Raspberry Pi) is a conservative 250 MHz. The main overclocking option intended for the Raspberry Pi 2 clocks the CPU cores at 1000 MHz, RAM at 500 MHz, and doubles the "core frequency" to 500 MHz. This has a positive effect on performance (especially memory performance), suggesting that the core clock is indeed correlated with CPU cache speed on the Raspberry Pi 2, which has a seperate L2 cache for the CPU. However, this setting turned out to be not quite 100% stable, with the culprit being an SDRAM speed that is slightly too high.

The amount of memory allocated to the GPU (memory split) should be set to the something like 128 if you want to be able run a wider range of OpenGL ES 2.0 applications, assuming a 512 MB Raspberry Pi.

Benchmarks


The performance increase from overclocking the CPU and RAM is measurable in low-level benchmarks. The following benchmarks were run with framebuffer_depth=32 in /boot/config.txt and a monitor resolution of 1280x1024.

Running the benchmark program from the "fastarm" repository (http://www.github.com/hglm/fastarm) as follows shows a significant performance increase from the default settings:
    ./benchmark --memset a --test 4
    ./benchmark --memcpy a --test 43
    ./benchmark --memcpy a --test 0
The memset benchmark (which reflects sequential DRAM write access) shows a performance increase from 1220 MB/s to 1929 MB/s (58%). The memcpy benchmark (which reflects copying a sequential memory region of 4 KB) shows an increase from 320 MB/s to 432 MB/s (35%). The third benchmark (which reflects smaller memory copies of varying size) is also dependent on CPU performance. It shows an impressive increase from 206 MB/s to 353 MB/s (71%).

On the Raspberry Pi 2,  the memset benchmark only scores 1090 MB/s with default settings. This seems to be a bandwidth bottleneck related to CPU and L2 cache speed limitations since other memset variants (including one using NEON SIMD instructions) show the same result. Multi-threaded benchmarks may potentially show higher throughout. However, when using the overclocking option, which significantly increases the clock speed of the L2 CPU cache,  memset performance increases to 1730 MB/s, in line with the Raspberry Pi.

The second benchmark (4K memcpy) reports 965 MB/s, which is significantly faster than the original Raspberry Pi. This seems to imply that write bandwidth provides the bottleneck. When overclocking, the result further increases to 1380 MB/s. The third benchmark (memcpy of regions of varying size) reports 673 MB/s, significantly higher than the original Raspberry Pi, which increases to 785 MB/s when overclocking.

The low-level benchx program (http://www.github.com/hglm/benchx) can be used to measure pixel performance in the X server. A command line similar to the one below was used:
    ./benchx --window All
Overclocking shows the following improvements (measurements in MBytes/s) on the Raspberry Pi:
    Test                          Standard     Overclock (Turbo)   Speed-up
    ScreenCopy (33x33)               84.7           150               77%
    ScreenCopy (549x549)            184             277               51%
    FillRect (33x33)                842            1282               52%
    FillRect (549x549)             1161            1835               58%
    PutImage (549x549)               45.1            72.8             74%
    ShmPutImage (549x549)           229             365               59%
    ShmPutImageFullWidth (109x109)  298             534               79%
    ShmPutImageFullWidth (549x549)  233             364               56%
The SRE real-time rendering library (http://www.github.com/hglm/sre) allows us to measure 3D GPU performance. The following benchmarks were run:
    ./sre-demo --benchmark --multi-pass --multiple-lights --shadow-volumes demo2
    ./sre-demo --benchmark demo2
    ./sre-demo --benchmark demo9
The first benchmark, a complex 3D benchmark with multiple lights, multi-pass rendering and shadows, shows an increase from 1.118 fps to 1.325 fps (19%) when overclocking. The second benchmark only uses only a single light with single-pass rendering and no shadows and shows an impressive increase from 13.2 fps to 27.2 fps (106%). The third benchmark, a lighter benchmark which uses a lot of alpha blending, increased from 30.2 fps to 48.0 fps (59%).

On the Raspberry Pi 2, the first benchmark reports an even slower framerate of 0.91 fps, while the second benchmark scores 20.5 fps, and the third benchmark scores 32.5 fps. Surprisingly, these results do remain about the same when overclocking. This is probably because on the Raspberry Pi 2, the core_freq setting does not directly affect the GPU.

File system benchmarks


Testing low-level file system performance using flash-bench (http://github.com/hglm/flash-bench) gives an indication of flash storage performance. A high-speed UHS 1 MicroSD card was used (using an SD card adapter). Measurements in MB/s.

                                          Seq.    Seq.    Random  Random
                                          Read    Write   Read    Write
    PC (using USB SD card adapter)        18.1    15.7     3.55   15.3
    Raspberry Pi, optimized fs options    17.2    13.3     4.35    0.96
    Raspberry Pi (turbo overclock)        17.2    14.4     4.83    0.86
    Raspberry Pi 2                        17.4    14.5     5.09    1.47
    Raspberry Pi 2 (same card via USB)    16.2    10.9     4.38    1.19
    Raspberry Pi 2 (different card)       17.7     9.5     4.46    1.31
    Raspberry Pi 2 (same card, overclock) 17.7     9.7     4.52    1.56
    Raspberry Pi 2 eMMC via USB           17.8    14.4     4.38    6.80
The results show very slow random write performance, despite the use of the journal_data_writeback and nobarrier options, when compared to testing on other devices. Whether this is reflected in actual real-world performance is unclear.

The last entry is for a high-performance eMMC flash module fitted on an eMMC-to-MicroSD adapter fitted on a USB Micro-SD reader. This shows that higher random write performance is possible given high performance flash memory. Although I have not tested this set-up with the Raspberry Pi 2's internal MicroSD card slot, it would probably deliver similar or better performance.

Updated February 25, 2015.

Wednesday, October 29, 2014

Optimizing system performance of flash-based Linux systems

Here are a few tips to significantly improve interactive performance on basic flash-based devices running Linux, such as the Raspberry Pi, other ARM or x86-based development boards, simple netbooks, mini PCs and media boxes, and mobile device such as tablets. These tips apply primarily to Debian and derived distributions such as Raspbian, Ubuntu, Linux Mint, but in general terms apply to most Linux distributions or even Android to some extent.

Typical Linux distributions configured for HDD-based server applications, not optimized for flash-based systems


When the root filesystem used on the device is on a flash memory card or another type of simple flash memory (but not a higher performance recent SSD), the flash storage can quickly become an enormous performance bottleneck, especially given the fact that Linux distributions are typically configured out of the box to continuously access and write to the file system for all kind of logging activity and temporary files used by many applications. While this configuration is suitable for HDD-based PCs or servers, it is not all suitable for a flash-based device.

Possible measures to reduce and optimize flash disk access


As listed below, several types of optimization are possible, each of which can make the system significantly faster, and together they can make a difference of night and day, turning an unusable sluggish system in a fairly quick usable one.
  • Reducing system logging activity. Out of the box, Linux distributions tend to be configured with full logging that produces a significant amount of ongoing write access the disk. A lot of data such as kernel messages is often generously logged into two or three different logs. Although logging has its purpose for diagnosing a system problem in a mission-critical system, this does usually not apply to a flash-based device so almost all logging can be safely disabled.
  • Using ramdisk filesystems for temporary storage. Most of the time temporary files can easily be stored in RAM, avoiding the significant overhead of storing and modifying a temporary file on flash storage. This involves mounting /tmp and similar directories on a ramdisk (tmpfs), and coaxing applications to store their internal cache or temporary storage directories on a ramdisk. Of course, it is helpful if the the device has a reasonable amount of RAM (512MB is already sufficient for extensive use of ramdisks, while 1GB or more is convenient).
  • In general when local cache storage of an application is configurable, an example of which is the web content cache used by a web browser such as Firefox, it can be helpful to eliminate local storage as much as possible (set the size to zero). While obviously more content being kept in RAM by the application of its own volition would be good (this may happen), reloading from the network (internet) is often preferable to the high overhead and bottlenecks caused by the continuous flash disk access for local cache storage.
  • The filesystem used, mostly commonly ext4, can be extensively tweaked to provide much better performance on flash storage. Measures taken include using write-back modes with longer "sync" delays instead of ordered data modes with relatively short delays before writing to disk, resulting in much more effective write caching, which can take the edge out of a flash write access bottleneck. Another obvious trick is to eliminate unnecessary bookkeeping such as file access time (use of the " noatime" mount option), and other performance improvements such as forgoing entirely on features such as journalling and huge file support that cater for larger systems and maximal stability on externally powered systems such as PCs. On a battery-powered device, write-back mode/write caching can often be extended without a disproportionate decrease in reliability and stability.
The relevant configuration settings changes are described below. Most of this section has been copied from the recent netbook blog article. Note that superuser priviledges (for example, using sudo) are required for editing most of the configuration files mentioned and any command lines.

Reducing system logging activity


System logging is often configured to be pretty active out of the box with most Linux distributions. Much of the logging that takes place is not a requirement for a (often single-user) flash-based device and can be disabled without consequences. Although kernel logs can be useful, the dmesg command is also able to log kernel messages for the current boot. The system logger is usually rsyslog, the rules for which may be stored in the /etc/rsyslog.d/ directory (for example, in a file named 50-default.conf). Disabling most logging can be accomplished by commenting out the rules by putting a '#' in front of them, for example:
    #auth,authpriv.*        /var/log/auth.log
    #*.*;auth,authpriv.none -/var/log/syslog
    #cron.*                 /var/log/cron.log
    #daemon.*               -/var/log/daemon.log
    #kern.*                 -/var/log/kern.log
    #lpr.*                  -/var/log/lpr.log
    #mail.*                 -/var/log/mail.log
    #user.*                 -/var/log/user.log
Of course, for security purposes or in a multi-user system it might be preferable to keep some of these logs, such as auth.log. Some kernels can be very noisy due to frequent messages or non-fatal errors or warnings which can affect performance when logging is enabled. In this case, it seems reasonable to disable kernel logging via rsyslog because dmesg already produces a similar log.

Using ramdisks (tmpfs) for temporary files


It is quite easy to move directories where temporary files are stored (primarily /tmp) to a ramdisk, and it can make a significant difference in performance. The following lines added to /etc/fstab cause /tmp and /var/tmp to be stored in ramdisks:
    tmpfs    /tmp       tmpfs    defaults    0 0
    tmpfs    /var/tmp   tmpfs    defaults    0 0

Optimizing cache directories


Applications like browsers and window managers that use a disk cache may conform to the XDG Base Directory specification standard. In that case, the environment variable XDG_CACHE_HOME defines the directory where local temporary cache files are stored. By setting this variable to a ramdisk location, it is possible to significantly speed-up the performance of certain browsers that are otherwise affected by heavy writing to the disk-cache on the flash device. This can be accomplished by creating a new file in /etc/profile.d/, for example /etc/profile.d/xdg_cache_home.sh, that will be executed at the start of every shell.
    #!/bin/bash
    export XDG_CACHE_HOME="/dev/shm/.cache"
Note this may not affect the main internet web content cache with certain browsers (such as Firefox), speeding up other types of cached information used by Firefox instead, while it does cause the main internet content cache to be stored on the ramdisk in the case of the lightweight browser Midori. In the case of Firefox, it can be beneficial to reduce the internet cache as much as possible (down to 8MB or zero), since extra network access (as long as it fast enough and not associated with extra cost) likely to be faster than the constant writing to the internet content cache on the flash card that otherwise happens. In the case of Midori, the internet content cache directory on the ramdisk can build up in size, affecting free RAM, which can be fixed by instructing the browser to empty the internet cache on exit.

Optimizing the filesystem (ext4) using write-back mode and other settings


Resources exist on the web on how to improve filesystem performance with ext4. The following line in /etc/fstab illustrates a optimized set of mount options for the ext4 root filesystem that should make a big difference in performance (UUID=nnnn or /dev/sdXn is the partition device used, which depends on the system):
UUID=nnnn / ext4 noatime,journal_async_commit,data=writeback,barrier=0,nobh,errors=remount-ro 0 1
You should also change the physical flag for journal_data_writeback mode, stored in the filesystem itself:
    tune2fs -o journal_data_writeback /dev/sdXn
where sdX is the SD card device and sdXn is the partition where the filesystem is stored. These changes should improve performance a lot, allowing it to reach an acceptable level.

Although these filesystem options have the potential to jeopardize stability and recoverability somewhat in case of system crash or power interruption, when a device is battery-powered the risk is much less.

Disabling X Window System error logging


Finally, the X Window System maintains a logfile called .xsession-errors in your home directory that gets filled with warnings and errors messages from the X server. In some cases this log file can fill up quickly and affect system performance. To disable it, edit the file /etc/X11/Xsession and edit the relevant lines to look like this (somewhere in the middle of the file), In this case the log-generating line has been commented out and replaced with a one that routes messages to /dev/null:
    #exec >> "$ERRFILE" 2>&1
    exec >> /dev/null 2>&1

Conclusion


In summary, the configuration changes above, which are relative to standard configuration settings in typical Linux distributions, can help transform a flash-based Linux system from very slow, continuously stalling behaviour to a reasonably consistent fairly quick response, make it much more usable.

Sources: SmartLogic

Updated November 2, 2014 (spelling).

Tuesday, October 28, 2014

Project -- Revitalizing an old Asus Eee PC

I was recently given a disused Asus Eee PC model 701SD with a view to making it usable again, because it was very slow. The Eee PC 701SD comes with a Celeron M processor up to 900 MHz, 512 MB DDR2, 8GB of early (2008 era) SSD storage and a 7" 800x480 screen (although the netbook is physically larger than 7" with a big border including speakers around the screen), and Windows XP installed. The device has 802.11b/g WiFi, an Ethernet port and VGA output.

This Asus Eee PC model provides convenient hardware upgrade options


Although some Asus Eee PC models originally came with a version of Linux installed, this particular model (dating from about 2008) came with Windows XP installed on the internal early Phison 8GB SSD storage device. However, as I received it performance was slow probably because of severe speed limitations associated with the early SSD model. Also, limited RAM and the fact that the storage space was almost full contributed to bad performance.

One of the first things I did was to assess to what extent the hardware could be physically upgraded. There is a convenient panel on the bottom of the device that gives easy access to the DDR2 SO-DIMM module and the SSD storage device. I replaced the 512MB DDR2 module with a 1GB one (333 MHz DDR2-667), which should be a big boost.

However, I am not certain the memory is running at the optimal speed. The BIOS seems to configure it running at about 150 MHz (DDR-300 effective) which may reflect a power saving state. Asus includes a Hybrid Engine driver in Windows XP that may regulate the RAM frequency as well as the CPU frequency. Because of this, it is possible that RAM is stuck at a low speed when running Linux.

The internal SSD is easily removable and it looks like it is connected using some kind of IDE-like interface. I have read the device may use a CompactFlash-compatible interface so that installing a more recent CompactFlash storage device may significantly increase performance. There is supposed to be room and board space beyond the cover area to accommodate a fairly large device.

Windows XP slow, but somewhat improved after optimization


I flashed the BIOS to the lastest available version, and after cleaning up the XP installation by removing unnecessary applications and cleaning up temporary files, I updated most of the Windows device drivers for the various hardware devices. However, the drivers that are still listed on the Asus website do not seem to be the most recent versions in many cases. I was able to breath new life into the WiFi chip by downloading an updated driver from Realtek. Overall, the slowness of Windows XP seemed to be improved somewhat.

Linux Mint Mate seems a good match for a netbook


For running Linux, I picked Linux Mint 17 Mate. I have good experiences with the more demanding Cinnamon variant of Linux Mint, and although Cinnamon does not put high demands on hardware, the Mate desktop is supposed to be considerably more lightweight and does look and function well. The main drawback of Mate would be limitations caused by the continuing use of the GTK+ 2.x libraries instead of the current GTK+ 3.x, although I have yet to encounter such limitations on this system.

I believe there is little against porting desktops such as Mate to GTK+ 3.x, because the perceived heavy overhead associated with GTK+3/Gnome is much more associated with the Gnome desktop environment rather than the underlying low-level GTK+ 3.x libraries. I have in the past happily run GTK+ 3.x applications in a GTK+ 2.x desktop environment on a relatively slow ARM-based device with few repercussions for speed or memory use.

I installed Mint on an SD-card, which can conveniently be inserted into the netbook' s card reader. Although the BIOS can be instructed to directly boot from the SD card or an USB stick, I let the USB stick-based installer install grub on the boot record of the internal SSD, so that Linux and Windows XP are now selectable at boot without going into the BIOS, as long as the SD card with  Linux is kept inside the SD card slot.

Despite the netbook's relatively small 7" 800x480 screen, Linux Mint Mate looks pretty good, and menus generally fit on the screen after setting the DPI to the lower value of 80. Performance was already much better than under Windows XP.

Optimizations to reduce and speed up disk (write) access


However, system performance was clearly affected by delays associated with regular and excessive disk access. After some research on the web, I came with the following set of mount options for the ext4 root filesystem in order to improve performance (modified in /etc/fstab):
UUID=nnnn / ext4 noatime,journal_async_commit,data=writeback,barrier=0,nobh,errors=remount-ro 0 1
I also changed the physical flag for journal_data_writeback mode, stored in the filesystem itself:
    tune2fs -o journal_data_writeback /dev/sdXn
where sdX is the SD card device. These changes certainly seem to improve performance a lot, allowing it to reach an acceptable level, even though the used SD card (8GB standard Class 10 HC 8GB dating from 2013) is not very new or particularly fast.

Although these filesystem options have the potential to jeopardize stability and recoverability somewhat in case of system crash or power interruption, the fact that the device is battery-powered provides considerable insurance.

Using ramdisk for tmp directories


I moved the /tmp and /var/tmp directories to a ramdisk by adding the following lines to /etc/fstab:
    tmpfs    /tmp       tmpfs    defaults    0 0
    tmpfs    /var/tmp   tmpfs    defaults    0 0

Moving application cache directories to ramdisk


Applications like browsers and window managers that use a disk cache may conform to the XDG Base Directory Specification standard. In that case, the environment variable XDG_CACHE_HOME defines the directory where local temporary cache files are stored. By setting this variable to a ramdisk location, it is possible to significantly speed-up the performance of certain browsers that are otherwise affected by heavy writing to the disk-cache on the flash device. This can be accomplished by creating a new file in /etc/profile.d/, for example /etc/profile.d/xdg_cache_home.sh, that will be executed at the start of every shell.
    #!/bin/bash
    export XDG_CACHE_HOME="/dev/shm/.cache" 
Note this may not affect the main internet web content cache with certain browsers (such as Firefox), speeding up other types of cached information instead, while it does cause the main internet content cache to be stored on the ramdisk in the case of the lightweight browser Midori. In the case of Firefox, it can be beneficial to reduce the internet cache as much as possible (down to 8MB or zero), since extra network access (as long as it fast enough and not associated with extra cost) likely to be faster than the constant writing to the internet content cache on the flash card that otherwise happens. In the case of Midori, the internet cache directory on the ramdisk can build up in size, affecting free RAM, which can be fixed by instructing the browser to empty the internet cache on exit.

Reducing/eliminating system logging


System logging is also configured to be pretty active out of the box with Linux Mint 17 Mate. The system logger is rsyslog, and the rules it uses are stored at /etc/rsyslog.d/50-default.conf in Mint. Although Mint does not enforce synchronous log updates (the dash in front of the log file means syncing on update is omitted), several logs are still being kept. Although kernel logs can be useful, the dmesg command is also able to log kernel messages for the current boot. It seems dmesg itself also logs kernel messages in /var/log in addition to rsyslog. Disabling most logging can be accomplished by commenting out the rules by putting a '#' in front of them.
    #auth,authpriv.*        /var/log/auth.log
    #*.*;auth,authpriv.none -/var/log/syslog
    #cron.*                 /var/log/cron.log
    #daemon.*               -/var/log/daemon.log
    #kern.*                 -/var/log/kern.log
    #lpr.*                  -/var/log/lpr.log
    #mail.*                 -/var/log/mail.log
    #user.*                 -/var/log/user.log
Of course, for security purposes or in a multi-user system it might be preferable to keep some of these logs, such as auth.log. Some kernels can be very noisy due to frequent messages or non-fatal errors or warnings which can affect performance when logging is enabled. In this case, it seems reasonable to disable kernel logging via rsyslog because dmesg already produces a similar log.

Disabling X Window System error logging


Finally, the X Window System maintains a logfile called .xsession-errors in your home directory that gets filled with warnings and errors messages from the X server. In some cases this log file can fill up quickly and affect system performance. To disable it, edit the file /etc/X11/Xsession and edit the relevant lines to look like this (somewhere in the middle of the file), In this case the log-generating line has been commented out and replaced with a one that routes messages to /dev/null:
    #exec >> "$ERRFILE" 2>&1
    exec >> /dev/null 2>&1

Conclusion


In summary, the configuration changes above, which are relative to standard configuration settings in typical Linux distributions, can help transform a flash-based Linux system from very slow, continuously stalling behaviour to reasonably consistent fairly quick response, make it much more usable.

Sources: SmartLogic

Monday, October 27, 2014

Running Linux distributions from a flash card or stick -- customization required for good performance!

Linux on flash storage is ubiquitous


There are quite a few Linux-based systems running with the main (root) file system on cheap flash memory (not a full SSD), ranging from development boards to the highest-volume consumer mobile devices sold today.

Full Linux distributions being run on such devices include the popular Raspberry Pi educational development board, and there are communities dedicated to running full Linux distributions on other ARM-based developments boards or consumer devices such as tablets and media boxes.

Many consumer devices can be made to run full Linux, but tinkering often required


In fact, if a consumer device such as a cheap tablet has external interfaces such as a USB port and HDMI and the ability to boot an alternative OS after flashing the bootloader or directly from an SD card, it is probably possible to run a pretty much full-featured Linux desktop on it, including mouse, keyboard and a big PC monitor. Although this used to be relatively slow (especially for a full GUI environment), recent ARM SoCs have become faster and can provide a more pleasant experience.

However, on anything but a development board, this is not usually 'plug-and-play', and technical expertise and experimentation is often required, with the different ARM chip platforms being in different states of development regarding running full Linux on them, and development efforts for a typical platform being fragmented. The often widespread variation between devices using the same chip platform in the use of peripheral chips such as WiFi chips, use of different LCD screens, different RAM configurations and other factors add to the complexity. If a functional kernel with enough working device drivers is available, any typical Linux root filesystem such as something based on Debian compiled for ARM can be run on it.

Or course, billions of mobile devices such as smartphones and tablets running Android already run Linux-like systems (including an actual kernel) on cheap flash-based storage. There is also significant cross-over between Android kernel driver source code and device drivers for full Linux, with many kernels and drivers being directly usable or requiring limited adaptation.


Intel/AMD x86 systems too


Running Linux on a cheap or older x86 system using simple flash storage for the root filesystem is not uncommon either. This includes early netbooks such as the Asus Eee PC series, which can still be pretty usable when properly upgraded, configured and optimized with a lightweight Linux distribution. Additionally the cheapest and smallest x86 mini-PCs often have basic flash storage (although low-cost SSDs offer high performance for a higher price).

Optimization required and does wonders for performance


The fact that standard Linux distributions with their default configurations and settings have a strong legacy in full-featured hard disk drive-based systems, including heavy duty applications like servers, and therefore are not at all suited for running on a small, flash-based single user device in their default configuration, is often overlooked.

Even popular development boards such as the Raspberry Pi for a long time shipped with a Debian-based distribution that was still largely configured for heavier use, with extensive logging and temporary files and caches from applications being continuously stored and modified on the local flash disk (as they would on HDD-based systems), which is an obvious way to make a system very slow and unresponsive given the nature of flash drives (especially writing to a flash drive can be costly in terms of time taken, with excessive write access also being detrimental to the stability and lifetime of the flash memory).

Several major file system and OS configuration optimizations possible


As will be described in subsequent posts, it is not at all difficult to largely eliminate excessive flash disk access (especially writes), resulting in a dramatically better user experience and performance. Measures include largely eliminating system logging, clever use of RAM-disks for every possible kind of temporary storage, and configuring a Linux filesystem such as ext4 with settings such as write-back mode resulting in more effective write caching, eliminating unneccessary bookkeeping such as access time, and other tweaks such as forgoing on features such as journalling and huge file support that cater for larger systems and maximal stability on externally powered systems such as PCs.

In fact, because many of the devices mentioned are battery powered or can easily be modified to be battery powered, there is actually potential to run a fairly stable system even with heavy optimizations such as write-back settings and limited journalling that would normally impact stability on HDD-based desktop systems. Apart from a common filesystem like ext4, many older or new filesystems can in principle be configured to work well on flash memory.