Sunday, October 21, 2012

FUSE musings.

If you're actually trying to write a real file system (not just a pass-through or a filter) then you need to keep track of files, directories, links, open reference counts, path lookup - i.e. the typical common file system baggage that in any self-respecting operating system already has a well-debugged and tuned implementation, like VFS in *nixes, IFS in NT and FSS in ESX.

For FUSE, I'm surprised that no one made a generic file system metadata implementation or library that could be consumed by things that are more complicated than sshfs and the like. I suppose I should take a look at "real" file systems like ntfs-3g and the FUSE ZFS port, although I think I already know what I'm going to find... I suppose there's definite sport in writing your own ;-).

Saturday, October 20, 2012

Linux kernel file I/O from a block driver

This code shows how a hypothetical block driver might handle submitted BIOs, servicing them with a file.
/* Handle I/O for a BIO component. */
int file_do_bvec(struct file *file, struct bio_vec *bvec, loff_t pos, int rw)
        u8 *buf;
        ssize_t bw;
        mm_segment_t old_fs = get_fs();

        buf = kmap_atomic(bvec->bv_page, KM_USER0) + bvec->bv_offset;


        if (rw == WRITE)
                bw = vfs_write(file, buf, bvec->bv_len, &pos);
                bw = vfs_read(file, buf, bvec->bv_len, &pos);

        kunmap_atomic(buf, KM_USER0);

        if (likely(bw == len))
                return 0;
        printk(KERN_ERR "Error at byte offset %llu, length %i.\n",
                        (unsigned long long)pos, len);
        if (bw >= 0)
                bw = -EIO;
        return bw;

/* Handle I/O for a BIO. */
int file_do_bio(struct file *file, struct bio *bio, loff_t pos)
        struct bio_vec *bvec;
        struct page *page = NULL;
        int i, ret = 0;

        /* Should have read-only check here, BIOs other than READ/WRITE. */

        bio_for_each_segment(bvec, bio, i) {
                ret = file_do_bvec(file, bvec, pos, bio_rw(bio));
                if (ret < 0)
                pos += bvec->bv_len;

         * At some point we need to fsync. In this simple example - I'll do it here.
         * TBD: should check error.
        vfs_fsync(file, 0);
        return ret;

PM and QA (Apple Maps)

It's not a secret that for every PM who turned to the role to exert a larger sphere of influence than being able in an individual contributor role, there are thousands that did so to simulate work and walk around meetings with laptops. In my six years I've dealt with only exactly one PM who could serve as a poster example of what this role really should be about. Like any other track in life, not everyone is cut out for the job, just like not everyone is cut out for QA, research or people management.

Looks like Apple maps got the right people under one roof :-P

I'm not going to give the usual gripe about missing locations, etc, but a perfect example of something being really wrong about the team working on the maps application. I have the fortune of using a non-English locale (Russian) on my phone, so the navigation is localized as well. The text-to-speech engine is horrible. Basically, when pronouncing foreign names (in this case, English ones) you should either stick to the original (English) phonetics, or with the foreign (Russian) equivalents where possible. Ideally, this should be a configurable option, just like showing native or foreign street names. In the case of the maps application, it's impossible to tell what the TTS engine is talking about, as the pronunciation is neither English, nor Russian.

Which is nothing until once in a while, the map application refers to miles as milliliters, which sounds about the same in Russian as it does in English. The abbreviation used is "мл." (ml.), and apparently lacking context, once in a while the TTS engine says milliliters instead of miles. But not always. Astounding. This does tell me that foreign localization was not a real deliverable, was not done by native speakers, and was not tested or dogfooded internally. Overall - great PM and QA efforts that deserve a promotion to explore other opportunities.

Thursday, October 18, 2012

Linux kernel file I/O

I had a colleague tell me today that file I/O under Linux kernel is contrived. Let's examine...

Wednesday, October 17, 2012

Hiding mount points.

Consider using a user-space file system that acts as a filter on top of a "real" file system. Your user-space driver might mount the "real" file system in the background. What happens if the user-space driver crashes? That's right. No cleanup and left-over mounts.

There's another aspect here - you want to ensure exclusive access to the "real" file system, even from the perspective of PEBCAK-type behaviors, since modifications made directly to the "real" file system could corrupt the filtered one.
So it turns out this is well possible in Linux. The sequence of operations is something like -
mount("source", "/tmp/target", "ext4", 0, "");
dir = opendir("/tmp/target"); /* open so the umount2 defers */
fd = dirfd(dir);
umount2("/tmp/target", MNT_DETACH);
rmdir("/tmp/target"); /* fine too */
/* do stuff in hidden mounted fs through fd */
closedir(dir); /* finally unmounted on close */
In fact, after the MNT_DETACH (deemed a "lazy" umount) you can well rmdir(2) the mount point away (or mount something else on it). Very useful. If you're wondering how you can perform file and directory operations without having a named path, then openat(2) and related are your friends :-).

Wednesday, October 3, 2012


It's been a rocky ride so far.

Observation #1: PMON is a POS (and that's not "point of sale terminal")
  • Why serial console only gets the firmware boot time messages?
  • Why can't I look at all those messages scrolling by into oblivion?
...actually, you could write a book about the "why can't I XXX". Why PMON? If they did a TI where the braindead firmware just ran something like "boot.elf" in the order of USB, then HD, it would be still miles better. Would u-boot really be harder to port? It would certainly be less obscure. I think some grad students ported UEFI.'s pseudo-UEFI, since UEFI doesn't technically *do* MIPS... *sigh*. There is EFI-MIPS, but, curiosly enough it's EDK-based, which makes me feel like I'm back in 2006, and is an obvious dead end for that matter alone.

Observation #2: Booting Linux via a hex input panel, if possible, would probably be simpler.
  • The numbering thankfully doesn't change from different USB ports.
  • Curiously enough, 'load' work faster than 'initrd'. I timed it against loading the same file. Expect to run to Voltage and back for a coffee before the initrd completes...
 Observation #3: Booting doesn't actually work.
  • Without initrd, I get a hang after the i8042 probe.
  • With initrd, I get the MIPS equivalent of a data abort from the firmware.
  • ...PMON can't handle an initrd.
 Observation #4: For an architecture endorsed by RMS, this is all unreasonably obscure.
  • I decided to chain-load  through GRUB2, but there are no GRUB2 prebuilts.
  • There is a GRUB2 prebuilt for the Yeelong, but this one just flashes "press ESC to skip loading on-disk grub.cfg", while ignoring both the USB HID and the serial.
  • There are no "newer" firmwares you can test chain-boot.
Technically, I don't care, but booting Linux is good proof that the hardware is alive, and allows for self-hosted development.

There was something already present on the disk (rays Linux), but it didn't boot with the default boot options. Manually doing -
PMON> load /dev/fs/ext2@wd0/boot/vmlinux-2.6.18-fl-v1.02
PMON> g init=/bin/bash me far enough to change the password so I could use the system, copy the kernel and initrd from the USB stick, and add a boot.cfg entry (root=/dev/sda1 console=/dev/ttyS0)

Of course, still can't boot the Arch Linux build. I get a bunch of unaligned kernel access followed by a page fault.



Tuesday, October 2, 2012

FUSE, redux

Here is a useful tutorial for FUSE - In particular the section on unclear FUSE functions is very useful.

Btw, all paths passed into FUSE callbacks are already canonical, and "/" is in fact valid, and refers to the root of the volume.

What am I writing? Secret for now. Hold on :).


Monday, October 1, 2012

More love for VHDtool

Finally got around to making my VHDtool very useful indeed for folks struggling with the VHD image format.

With Hyper-V, generated VHD images could of course be used either with the emulated IDE disks, or with the virtual SCSI adapter. However, if the created disk size is < 127GiB, you have to be careful with the actual size. If the size implied by the specification-mandated C/H/S calculations doesn't match the created disk size, you will run into problems when:

1) Partitioning a VHD on SCSI, then moving it onto IDE.
2) Converting a raw image into a VHD, and using it on IDE.

The problems will manifest themselves as the disk appearing smaller than it was created, and partitions may be corrupted (and VMs unbootable ;-().

Now there is the '-c' option that will ensure that the created disk size never be smaller than desired when used with the emulated IDE adapter.

VHDtool is a *nix utility for creating fixed and dynamic VHD images, and for converting from raw to fixed VHDs. Support for converting to dynamic VHD images will come soon.

In other news, I'll soon be putting out a tool to correct certain VMDK corruptions, that prevent a virtual disk from being attached to at least try file system recovery tools.


Sunday, September 30, 2012


A lot of people don't appreciate just how amazing FUSE is. Beyond enabling cool productivity enhancers, like the Host Profile File System (paper), it's a great vehicle for research prototyping new block and file system technologies. Of course, FUSE has some inherent performance issues, but for an initial prototype it's basically the difference between productively working on your project or chasing strange dcache issues, especially if you're not doing it full time (or wrote a major Linux file system, and uh... went to jail).

Of course, to help with doing a sensible performance analysis, it would be quite interesting if somebody made a Linux VFS shim on top of FUSE to port existing file systems into FUSE.

Of course, it pays to pay attention to the documentation and examples, since the FUSE API differs from POSIX file API in different ways. For example, the FUSE readlink doesn't behave like readlink(2), since on success it returns 0, instead of the byte count written.


DABT/PABT handling.

Figured out why at least kernel panic handling of DABT/PABT cases was yielding garbage. Going to rewrite the handlers to use the existing SAVE_CONTEXT and RESTORE_CONTEXT. Jury still out on whether the VM trap handling really has to happen the way it is currently, but the code can surely be simplified.

In an unrelated project, investigating lost event upcalls in an x64 PV OS. It's very aggravating that the PV interface is not documented in a fashion that doesn't necessitate combing through sources to find out the call stack layout and the contract between the hypervisor and PV OS. For x86-32 at least there was a book. I shouldn't need to spend a week studying the x64 hypervisor to figure out how to write a PV OS for it.

Here's hoping I do a better job with xen3-arm-tegra.


Friday, September 28, 2012


Finally got my hands on a Loongson 2F machine :-).

Thursday, September 27, 2012

Not such a long way.

Framebuffer support is in, which means that finally, I can almost ditch the serial cable for a lot of work. It also means you don't need a special Xoom (or potentially - some other Tegra2 device) with a serial port. Retail Xoom owners can now proudly do...ummm...something. Not a whole lot, yet ;-(.

Found a page fault abort within __divdi3, but couldn't really debug it since the DABT handler (or should I say - the CPU context save code) is woefully broken. So before I can figure out what's wrong with __divdi3, I need to fix the trap handlers.

But wait, Andrei, how could you have the timer working then??? The IRQ path is completely different (as is the CPU context structure manipulated). Why? No clue... Won't be when I'm done with it ;-).

Btw, ARMv8 will support a division instruction. Until then, I'll have an excuse to read Knuth.


Wednesday, September 26, 2012

On my long way to a framebuffer console...

Here I was hoping to have the framebuffer console done so I can finally move to working on more interesting aspects of my ARMv7 Xen3 rewrite, but alas before getting there I had to make the console layer more generic.

For an extra bonus I've simplified and cleaned up the serial layer. The whole L/H muxing to handle gdb and dom0/xen console traffic was a little insane :-). I'd rather use "O" gdb packets out (btw is this extension officially documented anywhere?)

Can't wait to move on...but having visible things to show people is cool too ;-).

Tuesday, September 25, 2012

I'm back

So I've decided I should continue to write here. After all, I work on relatively interesting, exciting to me things I feel I need to share with the world :-).


It's been an interesting summer. After attending the Linux Symposium in Ottawa with three papers [0] and being again a Tiano Core mentor for Google Summer of Code, I'm back to hacking things filesystem and virtualization related.

My Xen port to Tegra2 has been seeing a lot of attention from me. It's hard to call it a port, as I'm mostly approaching it from rewrite-everything angle, especially given the research prototype status of the initial Samsung ARMv5 tree I forked over two years ago. Of course, EmbeddedXen and ARM PV efforts haven't been sitting still in the meantime, but I'm treating the project as a personal "let's write a hypervisor" effort. There's no explicit desire to be compatible to any other ARM Xen effort.

My branch is targetting ARMv7, initially since that's what I've got around the house, but after digging through ARM ARM both for v6 and v7 I'm glad I'm not targeting anything older. The current differences from the original Samsung tree -
* ARMv7-only support.
* Tegra2 platform, targeting the Motorola Xoom
* Dom0/U configuration is not hardcoded.
* Boot through ATAG-compatible bootloader, with all images passed through a "boot volume".
* Kernel threads ("xen domains"), which are currently cooperative, but full preemption is a design goal.
* No ACM, given likely hypercall changes.
* A FIQ-based extensible serial kernel debugger with useful commands to help debugging and bring-up.
* In progress work on a framebuffer debug console.
* Reworked VMM and PT support with AFE and XN.

I'm concentrating on fleshing out crash debugging and framebuffer debug console support at the moment. After that it's (in no particular order) -
* L2 cache integration.
* Synchronization.
* Switching to VM and back.
* Full preemption for xen domains.
* Hypercall definitions.

I'll leave you with a short video demonstration of what I've got at the moment. It's not too amazing, the most interesting stuff is well ahead.


Tuesday, March 6, 2012

So I found a bug with SMP, NMIs and KDB...

If two (or more) unknown NMIs arrive on different CPUs, there is a large chance both CPUs will wind up inside panic(). This is fine, unless you want to enter KDB -since now KDB cannot round up all CPUs, because some of them are stuck inside panic_smp_self_stop with NMI latched. This is easy to replicate with QEMU. Boot with -smp 4 and send an untargetted, broadcast NMI using the monitor.

Solution for this is simple - add a new call, try_panic, which will be invoked in cases where some special behavior is desired if someone else is already panicking. For handling unknown NMIs, we now call try_panic instead. If panic() is already active in the system, just exit out of the NMI handler. This lets KDB roundup CPUs.

This affects linux-next.

Friday, March 2, 2012

Blowing the dust off of my Xoom.

ATAG    (P): 0x00000100
Linked  (V): 0xff008000
Actual  (P): 0x00a00800
Desired (P): 0x00108000

Handing off to C...
[XEN]  __  __            _____  ___   ____    ____  
[XEN]  \ \/ /___ _ __   |___ / / _ \ |___ \  |___ \ 
[XEN]   \  // _ \ '_ \    |_ \| | | |  __) |__ __) |
[XEN]   /  \  __/ | | |  ___) | |_| | / __/|__/ __/ 
[XEN]  /_/\_\___|_| |_| |____(_)___(_)_____| |_____|
[XEN] Xen/ARMv7 virtual machine monitor for ARM architecture
[XEN] Copyright (C) 2012 Andrei Warkentin <>
[XEN] Copyright (C) 2007 Samsung Electronics Co, Ltd. All Rights Reserved.
[XEN]  University of Cambridge Computer Laboratory
[XEN]  Xen version 3.0.2-2 (andreiw@(none)) (gcc version 4.4.3 (GCC) ) Fri Mar  2 01:59:34 EST 2012
[XEN]  Platform: arm-tegra
[XEN]  GIT SHA: ffd558debcf08dcf59a0c38115906030bf6f261c
[XEN] TTB PA 0x104000
[XEN] idle_pgd VA 0xff004000
[XEN] xen_pstart 0x0
[XEN] xen_pend 0x40000000
[XEN] _end 0xff03e708
[XEN] _end VA 0x13E708
[XEN] nr_pages needed for all page_infos = 0x500
[XEN] frame table is at 0xff03f000-0xff53f000
[XEN] xenheap_phys_start = 0x648000 (VA 0xff548000)
[XEN] xenheap_phys_end = 0x848000 (VA 0xff748000)
[XEN] looking at bank 0
[XEN]         base - 0x0
[XEN]         end  - 0x40000000
[XEN] calling init_boot_pages on 0x648000-0x40000000
[XEN] Using scheduler: Simple EDF Scheduler (sedf)
[XEN] Initializing ARM FCSE Unit
[XEN] *** LOADING DOMAIN : 0 ***
[XEN] DOM0 image is not a Xen-compatible Elf image.
[XEN] Could not set up DOM0 guest OS
[XEN] VMM Panic at xensetup.c:357

Hopefully the next time I work on this won't be in another half a year.

Wednesday, February 22, 2012

ARM virtualization.

So this is pretty awesome. It's not just a prototype for the ARM big.LITTLE architecture, but also a good example of writing a hypervisor targetting ARM.;a=summary

Monday, February 6, 2012

How not to write specifications.

Since my current definition of "havin' a good time" means attempting to start a native NT application in Linux, I am forced to be quite familiar with the PE-COFF format.

Needless to say, this is a poorly written specification. Here are some of the questions you won't find an answer to.
  1. Endianness of the format. Is it always little-endian? (i.e. for big-endian machines as well?). Apparently, yes.
  2. Endianness of applying the relocation records. The base relocation record is obviously LE, but what about the modified VAs? I would assume target-endianness, but this isn't actually noted.
  3. Optional header checksum: what's the actual algorithm? I mean, it can't be any more interesting than a CRC32, and an *interested party* will obviously reverse engineer this, so you can't actually think that hiding such details is a security mechanism?
  4. What is the expected result of IMAGE_REL_BASED_HIGHADJ? Community consensus implies that the high value of the 32-bit word modified needs to be sign adjusted. Why not just say that in the specification?
  5. Why not list what base relocation types apply to what architectures?
I am sure there are more...

Sunday, January 29, 2012

Linux Kernel unit testing.

Most software engineers, when asked how they would approach designing and implementing a particular component, will always say something about unit tests. Especially so if put on the spot in an interview setting. Yet, examining something like the Linux kernel, Xen or Tiano Core UEFI, you see plenty of new complex code that has no unit or component testing anywhere in sight.

Varying SCSI queue depth for VMware PVSCSI block devices.

I was toying with the block subsystem a bit in a Linux virtual machine running under ESX 5.0, when I realized
I could not change the SCSI queue depth. It turns out that the driver simply didn't implement the interface! It was, however, pretty easy to fix this.

Now you can do something like the following:
# for i in {a..z}; do  eval 'echo 1 > /sys/block/sd$i/device/queue_depth'; done

The patch should make it to mainline, when the current PVSCSI maintainer, Arvind Kumar, gets it integrated.

Why would you care? Because /sys/block/sda/device/queue_depth is quite different from /sys/block/sda/queue/nr_requests. nr_requests controls the request flow before the I/O scheduler, while queue_depth controls the flow of actual dispatch on I/O device. You might be interested in either of those if you run with multiple VMDKs, have an intensive I/O workload on one disk, and notice starvation on others.


One gloomy evening I decided to look at the latest Portable Executable specification, and thought it would
be pretty cool to write a PE loader.

Doing so under Linux is not particularly difficult, given the binfmt infrastructure, already well used to support legacy and emulation targets.

Two gloomy evenings later I had something that could load a rudimentary PE-COFF executable :-). It doesn't handle an IAT yet, so no shared objects, and since I was in a hurry and tired, no relocations and section alignment must equal file alignment, but I'll work those three out eventually.

Since I didn't have a PE tool chain on hand, I assembled the headers manually, kindly borrowing from them Tiny PE work.

So, uh, why? Firstly, because I can. It's fun, and it exposes me to those parts of the kernel that you don't have much opportunity otherwise to meddle in (and where the learning curve is steeper than usual). But I was lately
wondering what it would take to run the ReactOS userspace under Linux... So you could say my end goal is
write an NT personality for Linux, so I can run an unmodified ReactOS smss.exe with an unmodified ntdll.dll.

Anyway, as usual, the patch against 3.2+ and example hello.asm on my Github account -

Have fun!