I just love to see where these technical discussions sometimes end up, or start off for that matter. Reading on a /. post on Largest Hacking Scam in Canadian History, an interesting debate shot off on “There is no data which is also not a program/piece of executable(in some way or the other) code“. Following are excerpts from the thread:
It doesn’t even really matter at this point. Let’s be honest… the average computer user doesn’t know the difference between U2-Somesong.mp3 and U2-SomeSong.exe.
To make matters worse, some attacks may even occur if you are dealing with safe file types, like a PNG [microsoft.com] or even PDF [softpedia.com]. Some security problems exist due to the user’s ignorance or idiocy but “some” isn’t exactly the same thing as “all”.
There are no safe file types. All files can be viewed as programs meant to run in a specialized virtual machine (the program which is used to open them). For example, a PNG file is a program which, when run, will compute an array of bytes (the image pixels). The same goes to PDF. In this view, since all files are programs, it is in principle possible that any of them could contain code which can result in unexpected behavior of the virtual machine executing them.
Of course some file types are easier to compromize than others, either due to sheer complexity or ambiguity of the specification or because they are Turing complete. However, it is impossible to guarantee that every viewer for any file type is free of defects. Anyone still remember ANSI codes for DOS, which could be embedded to text to change color but also to set macros to keyboard keys when the file was viewed ? And of course SQL injection attacks are based on formatting a text string so it will cause unexpected results, not to mention causing a buffer overflow with an overlong string.
I repeat: there are no safe file types. They all have a potential to contain malicious code, because there is no such thing as data which is not also a program. From a certain point of view, GIMP is simply a very specialized compiler…
Is a text file containing a single line of text followed by a carriage return a program? How about the standard input device? When I type at the console keyboard, is that a program feeding into a “virtual machine” created by the console driver? If not, why is a disk device different from another device?
I think you’re missing the fundamental theorem of modern computer science — that “data” and “instruction” are completely interchangeable. See generally, the halting problem.
Is a text file containing a single line of text followed by a carriage return a program?
It can be. For example:
‘; ROLLBACK; UPDATE users SET admin = true WHERE username = ‘ultranova’; ‘If the virtual machine which handles the username field of Slashdot login form naively passed this string to the database layer without specifically quoting it, this text string would make my account an admin account; well, actually, since I haven’t studied Slashdcode, it propably wouldn’t, but the point still stands: even text is not an inherently safe data format in all circumstances.
How about the standard input device? When I type at the console keyboard, is that a program feeding into a “virtual machine” created by the console driver?
The virtual machine in this case would be whatever program receives the input. And yes, the text you type is indeed a program being executed by that machine; each time it receives a keypress from you, that keypress instructs it to do something, right ? Even if that something is merely to output the letter (altought a text editor would also store the input internally, of course). And that is what a program is: a list of instructions.
If not, why is a disk device different from another device?
It isn’t.
I’m with you on this. I know there may be a True Computer Science definition that makes the GP true, but I don’t tend to think of data as a program. Some binary data could be considered code to execute, but surely not text files that are parsed?
Okay, sure, there are scripts, but they have special parsers that turn the text into Real Code that CAN execute. I don’t think notepad can turn a text document into Real Code.
The OP has suggested a view that I have often thought about myself, although I have rarely found anyone who quickly grasps the concept.
Think of notepad since you have mentioned it. When notepad opens a file it looks at the contents and does certain things depending on the content of the file. If the first character is hex 61 then notepad will display an “a” in the first character location on the screen. OK, so that is because hex 61 is ascii “a” but that is an arbitrary choice that has been standardised. You can if you like look at notepad as if it is an interpreter for a rather strange and limited language where 0×61 is one of the commands. In some ways it is rather like those old interpreted basics since it is responding both to the file you have opened and to the keys you press on the keyboard. There have been attempts to make languages where instead of typing in commands you select icons with a GUI and join them up in a flowchart like sequence. The ones I saw were interpreted but there is nothing to stop them being a compiled language and thus eventually resulting in real code in a binary file. It is only a small step from there to looking at say photoshop as being a sort of real time mode interpreted language. (Real time in the sense that the commands execute straight away, like the mode in the old basics.)
In some ways this insight is interesting, although not necessarily very useful. But it should serve to remind us that much of our thinking about computers is based on elaborate analogies which the computer itself has no knowledge of. So the distinction between data and code is purely arbitrary. This tends to be more obvious when you play around with assembly, where the machine will happily let you attempt to execute data. For example you can set up a jump into a block of what is meant to be data and the machine will not object in the slightest. The results will of course be unlikely to have any meaning in terms of the analogies we have set up for ourselves, but the machine neither know nor cares since it has no means of doing so.
So Notepad will in fact execute certain real code in response to both the contents of the data file and the keyboard actions of the user. That is fine and good and need not be of any concern to the user, unless what it does is not what we expected in terms of the intended behaviour. An example of this sort of thing would be a buffer overflow allowing an external person to push what should (in terms of our analogies) be data into a place where it will get executed as if it was code.
This is the case for Von Neumann machines [wikipedia.org] because they have a single memory area for programs and data. An attacker only has to move the current program control flow to some compromised place in the data (say some lines of machine code hidden in a corrupt bitmap) and the processor will happily compute those instructions. In other architectures, namely Harvard architecture [wikipedia.org], there are physically seperate memory locations for programs and data and the processor WILL not carry out instructions “hidden” in data. A shift towards seperate memory architectures is required to secure computers. Unfortunately a paradigm shift at this level is all but impossible in general purpose computing.
No, but whatever program is running on the processor and interpreting the data will. SQL database, Python interpreter, Mozilla… all of these are based on treating text (data) as a list of instructions (program). It is obvious in the case of Python, since that is openly a programming language, but HTML itself can be considered a series of instructions for building the DOM tree, which then gets rendered, as dictated by default rules and those given by optional CSS; and of course there is always Javascript.
It is impossible for a general purpose computing to be immune for this class of attacks. Not just “all but impossible”, but flat out impossible due to a logical flaw: the very ability to simulate different machines which treat data as a list of instructions - program - is what makes it a “general purpose” computer. If you can program it, you can program it to misbehave when it reads a suitably malformed PDF/PNG/HTML/SQL/whatever file. The only way around that would be for the computer to be intelligent and capable of common sense, so it could understand that the programmer propably didn’t mean for it to execute any random piece of SQL someone feeds into a Web forum login box; but then it would be vulnerable to social engineering.



























