codingFun codingFun - 1 year ago 119
Linux Question

How to estimate the seek speed in file system

Suppose there is a big file on 1TB SSD (read and write at 500MB/sec) with ext4 file system. This file is close to 1TB in size. How to estimate the speed of fseek() to the middle of the file. Is it going to take seconds or a few milliseconds? Thanks.

Answer Source

To estimate latency of fseek, we should break it into two parts: software work and hardware seek time latency. Software work is implementation of ext4 filesystem (FS, in Linux this is VFS subsystem of kernel), which will generate several "random" requests (I/O operations) to the hardware block storage device. Hardware will use some time to handle each random request.

Classic UNIX filesystems (UFS/FFS) and linux filesystems, designed after them, use superblock to describe fs layou on disk, store files as inodes (there are arrays of inodes in known places), and store file data in blocks of fixed size (up to 4KB in Linux). To find inode from file name OS must read superblock, find every directory in the path, read data from directory to find what inode number the file has (ls -i will show you inodes of current dir). Then, using data from superblock OS may calculate where the inode stored and read the inode.

Inode contains list of file data blocks, usually in tree-like structures, check or ext2 inode pointer structure, gpl from

First part of file, several dozens of KB is stored in blocks, listed directly in the inode (direct blocks; 12 in ext2/3/4). For larger files inode has pointer to one block with list of file blocks (indirectly addressed blocks). If file is larger, next pointer in inode is used, to describe "double indirect blocks". It points to block which enumerate other blocks, each of them contains pointers to blocks with actual data. Sometimes triple indirect block pointer is needed. These trees are rather efficient, at every level there is a degree of ~512 (4KB block, 8 bytes per pointer). So, to access the data from middle of the file, ext2/3/4 may generate up to 4-5 low-level I/O requests (superblock is cached in RAM, inode may be cached too). These requests are not consequent in addresses, so they are almost random seeks to the block device.

Modern variants of linux FS (ext4, XFS) have optimization for huge file storage, called extent ( Extents allow FS to describe file placement not as block lists, but as array of file fragments / pointer pairs (start_block, number_of_consequent_blocks). Every fragment is probably from part of MB up to 128 MB. 4 first extents are stored inside inode, more extents are stored as tree-like structure again. So, with extents you may need 2-4 random I/O operations to access middle of the file.

HDD had slow access time for random requests, because they should physically move headers to correct circular track (and position headers exactly on the track; this needs some part of rotation, like 1/8 or 1/16 to be done) and then wait for up to 1 rotation (revolution) of disks (platters) to get requested part of track. Typical rotation speeds of HDDs are 5400 and 7200 rpm (revolutions per minute, 90 rps and 120 rps) or for high-speed enterprise HDDs - 10000 rpm and 15000 rpm (160 rps and 250 rps). So, the mean time needed to get data from random position of the disk is around 0.7-1 rotations, and for typical 7200 rpm hdd (120rps) it is around 1/120 seconds = 8 ms (milliseconds) = 0.008 s. 8ms are needed to get data for every random request, and there are up to 4-5 random requests in your situation, so with HDD you may expect times up to near 40 ms to seek in file. (First seek will cost more, next seeks may be cheaper as some part of block pointer tree is cached by OS; seeks to several next blocks are very cheap because linux can read them just after first seek was requested).

SSD has no rotating or moving parts, and any request on SSD is handled in the same way. SSD controller resolves requested block id to internal nand chip+block id using its own translation tables and then reads real data. Read of data from NAND will check error correction codes, and sometimes several internal rereads are needed to correctly read the block. Read is slower in cheaper NAND type - TLC with 3 bits data stored in every cell - 8 levels; faster in MLC - 2 bit of data with 4 levels; very fast in non-existent SLC SSD with 1 bit and only 2 levels. Read is also slower in worn out SSDs or in SSDs with errors (wrong models of cell charge degradation) in their firmware.

Speed of such random accesses in SSD are very high, they are usually declared is SSD specification like 50000 - 100000 IOPS (I/O operations per second, usually 4KB). High IOPS counts may be declared for deeper queue, so real average random read latency of SSD (with QD1) is 200-300 microseconds per request (0.2 - 0.3 ms; in 2014; some part of the latency is slow SATA/SCSI emulation; NVMe SSDs may be faster as they use simpler software stack). With our 4-5 requests we can estimate fseek on SSD as few milliseconds, for example up to 1 - 1.5 ms or sometimes longer.

You can check time needed for fseek using strace -T ./your_fseek_program. It will report time needed to execute every syscall. But to get real latency, you should check not only seek time, but also time of next read syscall. Before each running of this test you may want to flush kernel caches with echo 3 > /proc/sys/vm/drop_caches command from root (

You may also try some I/O benchmarks, like iozone, iometer or fio to estimate seek latencies.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download