arvindh arvindh - 1 year ago 82
Python Question

what is best for string extraction or pattern matching regex / awk / emacs lisp?

I am suppose to find if a given file is a media file, not through extension, but through the header information. So i opened some

file format with emacs to just observe what can be done what is inside etc . On analyzing the contents i found that some strings where not only in the first line(header information), but was also at the last few lines too. So basically strings i was looking for was on few bunch of lines at both starting and at the ending of the file.

Also It was inappropriate to find specific strings manually, so though of automating the process..

for example : this was the first line.

\00\00\00 ftypqt \00qt \00\00\00\00\00\00\00\00\00\00\00\00\00\00\00wide\00\CF\E1mdat\00\00\00wide\00\00\00\00mdat\00\00\00\00\00\00\00\00\E0\00\00\00\00\FF\A6\00\00\00\00\00\00 \00\00\00\008\00\00\82X\00\00\00@\80\00\87\F4N\CD

the last line was:

\F7\00\80\004\8D\00Z\A2\00\84p\00\9D\8F\00\B6\A5\00\CDt\00\DF\00\ED\8F\007\004\8C\00A\9D\00\00\00udta\00\00\00\00\00\00\00Wudta\00\00\00hinv7.6\00\00\00@hnti\00\00\008rtp sdp b=AS:265

so i had to scan the whole file line by line for specific type of string. But at first i had to know what are all the strings that i have to be looking for in each line of the file. So i thought of scanning some random media files and extract contents that where looking like a word(inside these files a word wasn't having any space character on either side, but what i was looking for was within
a/A - z/Z and 0-9
) to me.

Having this as my scenario, first thing came in mind was to use
. But i later realized from SO that

can do paragraph-oriented operations.

Then i came across here saying that

Emacs Lisp is a good choice if you need sophisticated
string or pattern matching capabilities.

So, Finally i wanted to get inside each file(various files with extension like
and so on
) and look for words(that looked like words to me, say any english alphabet that appears consecutively with atleast 3 characters and above. say for example, in the above block commented section mentioned as first line/header information, you can see
, which contains english alphabets and has more that 3 characters consecutively
). Then write those words into a different file so that i can open that file and see only those words that are picked from each line form each file.

Can anyone please give some idea about which would be fine, using
emacs lisp
or anything else. Please forgive if my english is bad.

Answer Source

As you said you are suppose to find if a given file is a media file, not through extension. I can offer you 2 possibility:

  1. Magic number in file

    So have a look at wikipedia definition of magic number for file:

Magic numbers are common in programs across many operating systems. Magic numbers implement strongly typed data and are a form of in-band signaling to the controlling program that reads the data type(s) at program run-time. Many files have such constants that identify the contained data. Detecting such constants in files is a simple and effective way of distinguishing between many file formats and can yield further run-time information.

To read it in python :

How to check type of files without extensions in python?

  1. Use a tool that do this(read also header):

    Extract metadata from files like this:

hachoir-metadata extracts metadata from multimedia files: music, picture, video, but also archives. It supports most common file formats:

    Archives: bzip2, gzip, zip, tar
    Audio: MPEG audio (“MP3”), WAV, Sun/NeXT audio, Ogg/Vorbis (OGG), MIDI, AIFF, AIFC, Real audio (RA)
    Misc: Torrent
    Program: EXE
    Video: ASF format (WMV video), AVI, Matroska (MKV), Quicktime (MOV), Ogg/Theora, Real media (RM)
$ hachoir-metadata pacte_des_gnous.avi
- Duration: 4 min 25 sec
- Comment: Has audio/video index (248.9 KB)
- MIME type: video/x-msvideo
- Endian: Little endian Video stream:
- Image width: 600
- Image height: 480
- Bits/pixel: 24
- Compression: DivX v4 (fourcc:"divx")
- Frame rate: 30.0 Audio stream:
- Channel: stereo
- Sample rate: 22.1 KHz
- Compression: MPEG Layer 3