pdfsearch - search tool for PDF, PS files

                PDF indexer README                      7 sep 2002

                        jose nazario 

i had this problem in grad school. i love reading papers, they're such a
great way to learn stuff. however, i wind up with piles and stacks of
papers. so i try and keep PDFs on my laptop, but i find that they're hard
to sift through to find the ones i need to read. so, after some discussion
with another of scooter's groomsmen bob i hacked a bit of shell scripting
magic to make an index of the PDF and PS files in my home directory and
allow me to search them.

they're in two parts: the first is mk_pdf_index, a small shell script to
reformat PDFs and PS files into text; the second is search, which does the
actual searching. some notes: you'll need the xpdf package, which contains
pdftotext, and ghostscript 5.5 or later, which contains ps2pdf. if you have 
"antiword" or "pptHtml", you can also index word docs and powerpoint 
presentations, respectively. the index maker detects these (in /usr/local/bin)
and indexes them. 

it works by converting the files it finds into ascii text and then splitting
it into words. you then look for these words in the index file. it keeps 
the file location and the first 20 lines at the top of the index file. it
doesn't work for all files, but for most.

this has only been tested on openbsd.

lastly, it needs some refinement, which maybe i'll do soon. the search is 
doing a boolean OR, and maybe boolean AND would be more useful. however, 
it works:

	$ search paxson      
	    matches      filename
		1        /home/jose/papers/SP-supplement.pdf
		4        /home/jose/papers/norm-usenix-sec-01.ps
		17       /home/jose/papers/stationarity-May00.ps
		4        /home/jose/papers/tbit.ps

so, i found some papers i didn't even realize i had. how cool is that? so,
no more printing out PDF papers for me, i can keep them organized. i run
the index generator every week or so, it takes about 30 minutes to fully
run (i have a very full home directory).

it doesn't work on all papers, some have protection embedded, and some
have been made by scanning images of pages. however it works for most PDFs
out there you'll run across.

INSTALLATION

you'll need to compile "wsplit", a small utility to split text files into
their component words, for this. in the directory 'wsplit' run the
Makefile (via make). you will need "flex" to build this.

copy the three files, mk_pdf_index, search, and the wsplit, into a
directory in your path. i use ~/bin, you can use that or /usr/local/bin,
for example.

now run the indexer: mk_pdf_index. this will take a while. now you can
search your PDF and PS files using "search".

CHANGELOG

18 jul 02
        initial version released

7 sep 02
        version 0.2 released
        now supports indexing of word docs
        now supports ppt presentations
        handles spaces in names
                - support done by Anton Chuvakin, PhD, with tweaks
        case insensitive filename extensions with [] matching

LICENSE: BSD type.

DOWNLOAD:

pdfsearch-0.1.tar.gz
pdfsearch-0.2.tar.gz

KNOWN ISSUES:

some PDF files cannot be processed by the pdftotext tool, and will return a one line index file with only the filename in it. this will break the search in this index file.

PDF, Portable Document Format, PS, and PostScript are all registered trademarks of Adobe Systems, Inc. Word and PowerPoint are trademarks of Microsoft.