« The Open Data Catalogue | Main | Make way for open source! »

Here's a little mystery for you. Imagine you have three files on a Windows machine called

HTML_File.htm
PDF_File.pdf
Text_File.txt


Windows will have no problem opening the appropriate program when you double-click them because of those three-letter extensions. If however you drop the extensions or mix them up, you'll have problems.

Now copy the three files to a Linux machine. You can see how the operating system perceives them by typing the file command, so file * will list them all ...

HTML_File.htm:   HTML document text
PDF_File.pdf:    PDF document, version 1.3
Text_File.txt:   ASCII text

Okay, lets try mixing up the extensions ...

HTML_File.txt:   HTML document text
PDF_File.htm:    PDF document, version 1.3
Text_File.pdf:   ASCII text

How about dropping them altogether?

HTML_File:       HTML document text
PDF_File:        PDF document, version 1.3
Text_File:       ASCII text

Still no difference. So how does the system know what's what?

Linux uses file to determine a file's type by the use of 'magic numbers' -- specific bytes stored in particular locations, typically near the beginning of the file.

Actually, file performs three checks. First it looks to see if the file is empty or is some sort of special file like a directory or a link. Then it checks for known magic numbers. If that fails it checks if the file is plain text, and if so what type -- ASCII, for example, or ISO-8859-x, non-ISO 8-bit extended-ASCII, or UTF-8-encoded Unicode, etc. If all those checks fail, the file is reported as being 'data'.

A simple file call can tell you quite a lot about a file's contents, such as in the following examples:

Backup.zip:         Zip archive data, at least v2.0 to extract
Bike Ride.mpg:      MPEG sequence, v2, program multiplex
Help.rtf:           Rich Text Format data, version 1, ANSI
myzip.tar.gz:       gzip compressed data, from Unix, last modified: Thu Jun 11 02:30:36 2009
Notes:              ASCII text
Pictures:           symbolic link to `/home/geoff/Pictures'
print.gif:          GIF image data, version 89a, 560 x 174
Shorts.avi:         RIFF (little-endian) data, AVI, 320 x 240, ~30 fps, video: XviD, audio: MPEG-1 Layer 3 (stereo, 22050 Hz)
Ski Jump.mov:       ISO Media, Apple QuickTime movie
Turino.wmv:         Microsoft ASF
Video.mov:          ISO Media, Apple QuickTime movie
Web Notes           UTF-8 Unicode English text, with very long lines
yuk!.exe:           MS-DOS executable PE  for MS Windows (GUI) Intel 80386 32-bit

Add the -s parameter and you can look at special files, such hard disk formats! (Note, you need admin priviliges for this, hence the 'sudo'.)

sudo file -s /dev/sda

/dev/sda: x86 boot sector; partition 2: ID=0x83, starthead 254, startsector 954646560, 21880530 sectors; partition 3: ID=0x83, starthead 254, startsector 566419770, 388226790 sectors; partition 4: ID=0x5, starthead 1, startsector 63, 566419707 sectors, code offset 0x4, Bytes/sector 1766, sectors/cluster 87, reserved sectors 36434, FATs 192, root entries 64763, sectors 191 (volumes <=32 MB) , Media descriptor 0x6, sectors/FAT 185, heads 165, hidden sectors 1568, sectors 3141645394 (volumes > 32 MB) , physical drive 0xaa, physical drive 0x2a, reserved 0x55, dos < 4.0 BootSector (0x31)


You can also look at individual partitions ...

sudo file -s /dev/sda{1,2,3,4,5}

/dev/sda1: Linux rev 1.0 ext3 filesystem data
/dev/sda2: x86 boot sector; partition 2: ID=0x5, starthead 254, startsector 29302560, 204941205 sectors, extended partition table
/dev/sda3: Linux rev 1.0 ext3 filesystem data (needs journal recovery) (large files)
/dev/sda4: ERROR: cannot open `/dev/sdb4' (No such file or directory)
/dev/sda5: Linux/i386 swap file (new style) 1 (4K pages) size 487973 pages

You'll find a list magic numbers in /usr/share/file/magic. You can add your own file types in /etc/magic (to make them system-wide) or $HOME/.magic locally. The format is described -- with no offence to feminists intended -- in man magic.



<--Previous Hidden Linux      Next Hidden Linux -->

Comments

Just one correction:
"Linux uses file to determine a file's type..."

It doesn't use "file", file managers use "libmagic", same as "file" command. Otherwise, far too much child processes would be spawned.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

Subscribe
Newsletter & SubscriptionsPC World is New Zealand’s top selling computing and technology magazine.

It provides up-to-the-minute editorial, insight and buying advice for personal computing, cell phones, game consoles, digital entertainment and broadband.
SIGN UP
PCWorldUpdate
PC World's weekly round-up of tech news, gear and game reviews, software selections, and handy How Tos.