Grep cache

I mentioned in my last post that I’m working for a games company now. I’m also working with the largest code base that I’ve ever seen, or really heard of. To give you an idea, the mainline code base is over 85Mb of text in 14,000 files. If you include the 3rd party packages that we link against, then we have 600+Mb of code text in something like 85k files. I think that this is large by any reasonable standard. Oh, I just did a line count. 18 million lines of text, though I am counting multiple versions of some packages.

Most developers at work use Visual Studio and Visual Assist to edit and navigate their code. I can’t really stand Visual Studio as an editor, so I use Emacs with Vimpulse mode (this is basically Vim key bindings). For a while I was using Visual Assist to navigate the code. On a code base this large I think that the single most useful tool VA provides is Alt+g, which jumps to the definition of the symbol you are on. The context info is quite smart, for instance if the cursor is on something like someVar->SomeFunction() then VA will figure out the type of someVar and put the cursor on typeof_someVar::SomeFunction. Not too shabby, and generating the VA database is not too painful really. But there are two things here, how do I find every place that SomeFunction is used? And, besides I use Emacs.

So I do what any Unix guy in a Windows world would do – install Cygwin and the unix tools for Win32 & try the standard code indexing tools. Tools like ctags, cscope, GNU Global, and source navigator. All of them choke, there are just too many files. I did manage to build a subset database for ctags, which resulted in a 70Mb tags file. Emacs just took too long looking up tags. Using find and grep also takes far too long, something in the order of 5+ minutes. None of this works in a satisfactory manner, and I end up using Visual Studio to navigate the code and Emacs to edit. After much grumbling at my situation I looked at the raw code size numbers and thought to myself “how long does it actually take to grep a 500Mb file?”. I did some tests, and it turns out not that long. On my machine, grepping a single 500Mb file is less than 10 seconds. Grepping an 80Mb file is so fast as to be instant.

Enter Grep Cache

The basic idea of grep cache is that you concatenate all of your code data into one huge file and grep that file instead. Grepping a single file, you will get back the line number of the big file and the matching line of text. The information that you don’t get directly back is the original file that contained that grep hit. My solution to this is to create a dumb index file, which simply has text in the format

<number-of-line-in-file> <filename>\n

Then a Perl script to linear search the index file, as soon as the input line number is greater than the sum of all previous line numbers, we have found the original file. A little bit of text massage to get the Perl script output into a grep format that Emacs expects & we’re done.

How fast is this then? I can grep the smaller (85Mb) code base that I usually work with pretty much instantly. The bigger database takes 10-20 seconds. Most of the time is spent in the Perl script doing the dumb-as-a-rock linear search. If I wanted to grep the bigger file faster I’d write a proper index file where I could use a binary search. But for right now, I’m happy with my new tool. Super fast grepping actually lets me do pretty much everything that a proper tags based system does.

Code

Here’s the code, this runs on Windows with the Unix tools installed. Thanks to Dean for the Perl code.

Bat file to replace grep (fast-grep.bat):
grep -n %* database.txt | perl lookup-file-name.pl

Generate database (gendatabase.bat):
set ROOT=C:\dev\project

del files.lst
del index.txt
del database.txt
ufind %ROOT% -iname *.cpp -or -iname *.h -or -iname *.c | sed s_\\_/_g > %ROOT%\files.lst
for /f %%i in (%ROOT%\files.lst) do @wc -l %%i >> index.txt
for /f %%i in (%ROOT%\files.lst) do @cat %%i >> database.txt

Perl code (lookup-file-name.pl):
#! perl.exe -w

use strict;

my $indexName = “all-index.txt”;
my $index;
open $index, $indexName or die “Can’t open index file $indexName\n”;

my $a = 0;
my @files;

sub addItem {
my ($numLines,$filepath) = @_;

push @files, [$a, $filepath];
$a += $numLines;
}

sub findItem {
my $num = shift;
my $filename;
my $offset;
map { my @a = @{$_}; if ( $a[0] <= $num) { $filename = $a[1]; $offset = $num – $a[0]; } } @files;

($offset,$filename);
}

LINE: while ( ) {
next LINE unless ( /^\s+(\d+) (.*)$/ );

my $numLines = $1;
my $filepath = $2;

#print “$numLines $filepath\n”;
addItem( $numLines, $filepath);
}

#map { print “@{$_}\n”; } @files;
#print findItem( 1000);

FILE: while ( ) {
next FILE unless ( /^(\d+):(.*)/);

my $num = $1;
my $line = $2;

my ($offset,$filepath) = findItem($num);

print “${filepath}:${offset}:$line\n”;
}

This entry was posted on June 7, 2007 at 10:02 pm and is filed under Code, Programming tools. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

15 Responses to “Grep cache”

Krishna Says:
June 8, 2007 at 4:59 pm | Reply
Have you tried Xrefactory? OpenGrok is another indexer that comes to mind. It works quite well with my source tree of 3000+ files.

Cheers!
Evil Otto Says:
June 8, 2007 at 5:49 pm | Reply
Sounds like you just reinvented glimpse.
Eric Says:
June 8, 2007 at 7:03 pm | Reply
How do you keep the cache up to date? Does it take much time to create?
bradbeveridge Says:
June 8, 2007 at 8:10 pm | Reply
Krishna – I’ve not used (or heard of) Xrefactory. It does sound very cool, but the price tag basically puts it out of the running.
Otto – I haven’t used Glimpse before. The price is much better, but still costs. Also, it doesn’t look like it will do regular expressions.
Eric – I update the cache nightly. Yes, when I change code it is out of date – but usually I read code rather than write much, or I’m working only in a file or two. I think it takes about an hour or so to create.
sean Says:
June 8, 2007 at 8:52 pm | Reply
Another approach might be to prepend the filename and line number when creating the index, effectively faking the “grep -n” output that Emacs uses:

find src … | xargs awk ‘{ print FILENAME “:” FNR “:” $0 }’ > index
bradbeveridge Says:
June 8, 2007 at 9:36 pm | Reply
The Perl code used to be broken. I had a <= sign – WordPress ate some of the code after that. There may still be breakages.
mike Says:
June 9, 2007 at 10:24 am | Reply
you have seen the “Find References” entry on the “VAssistX” menu, haven’t you?
phil Says:
June 11, 2007 at 3:22 am | Reply
Have you tried tagbrowse?r It’s a gui tool (Win, Mac, Linux) for searching withing exuberant ctag files. It can search tag files of 10’s of MB in less than a second.

http://www3.telus.net/ngan/tagbrowser/tagbrowser.html

If you like vim and use visual studio you should also check out viemu.com. The nice thing is that you still get the facilities of visual studio as well as vim editing functions.

— Phillip
(author of tagbrowser)
krishna Says:
June 11, 2007 at 4:20 pm | Reply
Tried OpenGrok? OpenSolaris uses it for cross referencing. It works well with large code bases.
Larry Clapp Says:
June 18, 2007 at 12:39 pm | Reply
Brad,

Thanks for posting this. I finally got around to reading it.

It sounds like an interesting tool. I’d probably go with sean’s suggestion and put the filename and line number right in the “database”.

That said — have you read the ctags FAQ? (http://ctags.sourceforge.net/faq.html#15) It has several tips for using ctags in a directory hierarchy. Combine that with Vim’s ability to use multiple tag files, or even search an entire directory hierarchy for tags files, and I think ctags would work for you again.

If that doesn’t work, I bet the ctags maintainer would like to hear from you.
bradbeveridge Says:
June 18, 2007 at 1:53 pm | Reply
I haven’t tried the ctags multiple file approach. I tried a single ctags file on our 14Mb code base. I think I had two problems
1) Searching was very slow – this is probably a function of Emacs’ tags lookup more than anything else
2) Our C++ code has many classes that have the same member function names – names like Render. I seem to remember ctags not dealing very well with that 🙂
I’ll probably try ctags again when my frustration levels get high enough.
Larry Clapp Says:
June 18, 2007 at 9:47 pm | Reply
1) Ah, right, I forgot you’d moved to Emacs. Does Emacs not do a binary search on the tags file (like, um, Vim does)? Or does it do something really silly like load it all into a buffer and then search it?

2) Lots of “Render”? That’s stupid. You should get them to fix their code. (Yes I’m kidding. 😉

Anyway, best of luck.
Ignas Says:
June 22, 2007 at 2:49 pm | Reply
1) Ah, right, I forgot you’d moved to Emacs. Does Emacs not do a binary search on the tags file (like, um, Vim does)? Or does it do something really silly like load it all into a buffer and then search it?

IIRC no it does not do a binary search, and yes it suerly loads it and searches it.

I am using GNU id-utils for a fast search. It is a nice indexed text search utility, not sure if there is a windows version though.
Chris J Says:
November 6, 2009 at 1:41 pm | Reply
I’ve been trying a few things out and resorted to trying this method (rejected before due to size of codebase, but lets see how it goes).

Just as a heads up, you might be able to do away with some cygwin’y stuff and speed things up with native Windows tools (removing the overhead of cygwin – which IME can be slow when dealing with lots of stuff). Instead of ‘grep’ look at ‘findstr’; similarly for ufind, try ‘dir /s /b’. I’m still using Perl though (Strawberry Perl) 🙂

Thus I’m creating my index with:
findstr /n /s /i /e . *.asp *.cs *.cls *.xml *.xsl *. …. | perl -pe “s/[ \t]+/ /g” > ..\index.idx … I’ll then use native-NTFS to compress the file and findstr to search it 🙂
Michael Says:
November 30, 2011 at 2:02 pm | Reply
Funny, cscope (under Linux) doesn’t have much trouble with our 270MB code base.