I have a script that lets the user upload text files (PDF or doc) to the server, then the plan is to convert them to raw text. But until the file is converted, it's in its raw format, which makes me worried about viruses and all kinds of nasty things.
Any ideas what I need to do to minimize the risk of these unknown files. How to check if it's clean, or if it's even the format it claims to be and that it does not crash the server.
As I commented to Aerik but it's really the answer to the question.
If you have PHP >= 5.3 use
finfo_file(). If you have an older version of PHP you can use
mime_content_type() (less reliable) or load the Fileinfo extension from PECL.
Both of these functions return the mime type of the file (by looking at the type of data inside them). For PDF it should be
For a word doc it could be a few things. Generally it should be
If your server is running *nix then make sure the files you're saving aren't executable. Even better: save them to a folder that isn't accessible by the web server. You can still write code to access the files but someone requesting a web page won't be able to access them at all.