Quetzalcoatal: 2011

Wednesday, December 14, 2011

2011 LLVM Developers' Meeting

This feels a bit late to be talking about the 2011 LLVM Developers' Meeting (seeing as how it happened almost exactly a month ago), but since the slides have been put up over the past week and the talks were only put up on Youtube this week, I suppose I can finally back notes up with links and not talk so abstractly.

Of the talks I went to, I think by far the most interesting one was Doug Gregor's talk on Extending Clang. It covered various extension points in Clang and some of their capabilities with simple examples. It also ended in what might be impolitely titled "Where Extending Clang Sucks" and what was politely titled "Help Wanted." This boils down to "plugins are hard to use" and "one-off rewriters are hard to write". Indeed, I think the refrain of problematic architecture for further tools was repeated in more than a few presentations: Clang has the information you want and need, but squeezing the information out is inordinately difficult.

What was probably the most popular talk was Chandler Carruth's talk on Clang MapReduce—Automatic C++ Refactoring at Google Scale (his slides appear to not yet be posted). From what I recall, the more interesting parts of the talk are closer to the end. He had a discussion on developing a language for semantic queries to identify code that needs to be replaced (the example query was essentially "find all calls to Foo::get()"); the previous things people have tried are regexes (C++ is not syntactically regular, it's recursively enumerable), XPath (unfortunately, ASTs, despite their name, aren't exactly tree structures), and pattern matching ASTs (not always sufficient textual clues). The team's idea was to use a matcher library on "AST" values as predicates. He also spoke a bit about some of the efforts they did on using Clang with Chromium.

There is a point brought up in Chandler's talk that I want to reiterate. One of the problems with writing refactoring tools (or static analysis tools in general) is in getting people to trust that they work. For any sufficiently large codebase—millions of lines of code or more—it is fairly certain that the project will run up against the nasty edge cases in parsing tools. If a tool is intended to run mostly automated analysis on such code, it is impossible for anyone to be able to look at the output and ensure complete correctness. However, most sufficiently large codebases come with massive testsuites to help people believe their code is correct; if the same parser is capable of producing code that passes the entire testsuite, then one only needs to trust that the analysis is done correctly, a much smaller task. Without such a capability, no one would be willing to trust the analysis; in other words, any parser which is not capable of then generating code is never going to be trusted as a basis for further work, at least, not if you are looking at million-line projects.

Speaking of Chromium and Clang, there was talk on this too. As I am sure most of my audience is well aware, Chromium uses Clang for several of its buildbots (and all of its recent Mac development); I wish Mozilla could get Clang to be a tier 2 or tier 1 platform for Mac and Linux. As for why they prefer it, the brief rundown is this: better diagnostics, faster, smaller object sizes, they can write a style checker, and they can build better tools off of it (like AddressSanitizer, which, unsurprisingly, itself had a talk). Again, there were the complaints: building the rewriter proved to be difficult a task (notice a trend here?). Incidentally, I also attended the AddressSanitizer talk, but I'm getting tired of copying down notes of talks, so I'll let those slides and the video speak for themselves.

Finally, I did give a talk on DXR. It seemed to be well-received; I had a few people coming up to me later in the day thanking me for the talk and asking more questions about DXR. Something I did discover when giving it, though, is just how difficult giving a talk really is. I had specific notes on the slides written in my notebook, only to discover that I wasn't able to gracefully retreat to the podium and read the notes long enough to figure out what to say next. If you're wondering what I forgot to mention in the talk, it's mostly a more coherent explanation during the undead demo.

Saturday, October 15, 2011

How I got involved with Mozilla

My journey to become a contributor to Mozilla started during the summer of 2007, which was when I started very heavily using Thunderbird. The initial impetus is probably best traced to this thread in a newsgroup: a long, semi-flame war. I wanted to excise the flame war portions from the rest of the thread, so I filed bug 392404 (the first bug I ever filed!). That bug was marked as a duplicate of bug 11054.

At that time, I had a fair amount of free time (this was around the time my summer internship ended but before school started). I downloaded the source code to Thunderbird and then built it, only to discover that every time I built it, it crashed in a linker. After bugging people about it on IRC, I learned that I needed a swap file, and I finally managed to get a successful build. Then, having had programming experience, I decided just to fix the bug myself. I honestly can't remember for certain, but I believe I built the original patch without much aide from people on IRC--it was only the inability to build that I had needed to ask for help.

I do remember that I needed guidance on how to get a patch committed. This being mailnews code around the time that Mozilla Messaging was being set up, the pool of potential reviewers and superreviewers was rather slim. I had been advised that David Bienvenu and Scott McGregor were my best candidates for review. Unfortunately, around this time, Scott had cut off all work with Mozilla; after two months of not getting any response from him whatsoever, I switched to another reviewer.

While I was waiting for the patch to be reviewed, I recall Dan Mosedale talking about bug 132340 indirectly; later, when I asked if there was a fix that people would be interested in, that bug was what I was pointed towards. This patch took about two months and four review cycles to be accepted, but it eventually satisfied my reviewer and superreviewer and became my first contribution to Mozilla. That first patch I worked on? It turned out to be significantly more complicated than I first appreciated and finally was closed 9 months after I started work on it. After bug 132340, I started getting more and more involved with Mozilla development and have been a contributor since then.

Monday, September 19, 2011

Public Service Announcement #2

A brief note to all application and operating system developers:

Failure to respond to a prompt does not indicate accession to perform the actions that would otherwise occur. Especially if those actions have a high risk of losing my data. So I would like to thank you (whichever product is responsible for my most recent undesired restart) for completely obliterating my class notes. At the very least, you did not destroy all of my research too, just a day's worth.

Wednesday, August 3, 2011

Not-so-random mozilla-central factoids

Back when I was looking at reducing disk usage of running DXR, I used the sizes of the generated CSV files to make a rough estimate of how many times the average line of mozilla-central needs to be parsed and compiled (the answer is around 20). Now, having just respun a build of mozilla-central with the newest version of DXR, I have a larger database with some fairly accurate statistics of its size. This all started when I wondered what our most common function name was, so I decided to make a list of factoids (it's also a chance to refresh my SQL memory). All of the statistics come from a build of Firefox today on Linux x86-64 with debug and tests disabled.

Notes on accuracy

DXR tries as hard as possible to correlate data back to the original source code, and it is very likely that it can get itself confused in some weird circumstances. I don't yet have a good idea of where all of the buggy cases are, but I am aware of some broad strokes. Pretty much all information that deals with references is likely missing large swathes of the codebase. Information about callers only counts explicit calls (i.e, call expressions). The most accurate data I have is the macro information and the least accurate is information involving a templated class in almost any fashion. I also suspect that scope information is malformed in a non-empty set of cases, and I'm pretty sure that the number of global objects in any count is overcounted.

Type statistics

I count in mozilla-central just 33,555 distinct types. Of these, we have 13,648 typedefs, 6,523 structs, 6,470 classes, 5,147 enums, 1,410 interfaces, and 357 unions. Of all of these, 13,740 types are nested in another in some fashion, and another 4,524 types are templated. I'm not counting separate template instantiations as new types, but I am counting specializations individually.

Then there's the inheritance. I found 7,715 direct inheritance relations, all but a handful of which (around 200) are public. These relations account for 2,317 distinct base classes and 6,157 distinct subclasses. Naturally, some types are inherited much more than others. The winner, by a factor of 6, is nsISupports, having a whopping 3,156 implementations. Pickle and IPC::Message tie for second place with 501 implementations each; nsIRunnable takes 4th place at 282 implementations. The 5th place goes to nsXPCOMCycleParticipant (242). Rounding out the top 10 are nsIDOMElement (226), nsRunnable (224), nsScriptObjectTracer (209), nsSupportsWeakReference (154), and nsIObserver with 144. Subclass relationships involving templates are not counted, which may bump a few classes up into this list.

Macros

There are 42,457 distinct definitions of macros to produce 38,045 distinct names. Of these, 30,475 look like variables and 7,647 look like functions, which implies that 77 macros take on both depending on how they got defined.

How about calling them? I count just 482,142 macro invocations, so each macro is being invoked about 12 times on average. But... 20,628 of our macros are never used (or almost 48.6% of them), so the average is closer to 25 times. Of course, some macros really get used. Here are the top 5:

Count	Macro name
23,109	nsnull
22,941	PR_FALSE
20,139	NS_OK
13,154	NS_IMETHOD
13,138	NS_IMETHODIMP

Functions

There are 137,903 functions in mozilla-central. Of these, there are 68,867 having a distinct name. Of these, I found 33,693 in the global scope, and 8,943 that were a member of a class. Directly templated functions comprise about 2,291 functions. Just for fun, I found that there are 774 functions named exactly "Init" and a further 1,681 that begin with "Init" (case-insensitively in the last case).

In terms of calling these functions, I found 291,598 distinct edges in the callgraph (this is definitely an underestimate, since I am missing a large number of cases). For my usage, a callgraph is not a traditional directed graph but rather a hypergraph, where each edge goes from a single head to a set of nodes in the tail. These comprise 85,144 distinct callers and 67,098 distinct targets. Of the targets, I found 51,590 distinct functions being called statically, 13,274 distinct virtual functions invoked, and 2,234 distinct function pointers or pointers-to-member-functions being called. If I break it up by calls, 246,086 of the calls are static function calls, 41,770 virtual function calls, and 3,742 function pointer calls. I want to emphasize here that information pertaining to templates, in particular nsCOMPtr is completely missing, so a lot of calls to nsISupport's methods are missing, which is going to horribly skew the statistics.

Counting the function pointers, we have 65 pointers-to-member-functions and around 1700-1800 function pointers (those numbers do not add up to what I should get above, but I'm not sure who's in error here). I count about 19,721 virtual functions that I generated target information for (a subset of all virtual functions), and 65,761 implementations of those virtual functions, so the average virtual function has about 3.4 implementations. In addition, I found about 981 of these virtual functions were also called statically.

Now I'm sure, having mentioned it earlier, that you too are now wondering what the most common function name in mozilla-central is. The answer should be pretty obvious when you consider the most heavily-implemented class. And the winners are...

Count	Function name
1,687	Release
1,680	AddRef
1,600	GetIID
1,587	QueryInterface
774	Init
659	operator=
549	Log
517	Read
506	Write
396	GetType

The top 4 methods are related to XPCOM; the famed Init method is a mere 5th place. Of interesting note is that there are 659 assignment operators; I'm guessing some of these may be default copy constructors implemented for non-POD classes.

Variables

There's not much to say here. We have some 623,237 variables of some kind. Most of these, naturally, are local variables or parameters: 516,627, to be precise. We additionally have 82,008 members of some compound type, and 24,602 global variables. Some interesting statistics would be to compute the number of variables whose names defy our naming convention or the number of static constructors that need to be run before startup, but that data is harder to compute.

Warnings

I count 5,283 reported warnings for mozilla-central. Of these, 1,608 are warning about the use of non-virtual destructors with virtual functions. Another 940 warn about our use of mismatched enumeration types. There are 1,348 warnings of unused things. Finally, there are 1,551 warnings about use of extensions, leaving 466 "miscellaneous warnings".

Build statistics

The final set of statistics I have is mere size statistics for the build information. There are about 8,581,746 non-empty lines of text comprising some 320MiB of data. These are organized into about 51,469 files, including 1,469 files of generated files in the build directory. My output SQLite file was 464MiB, easily comprising around 3.5 million rows of data. We also have about 38MiB of binary files (like PNGs) in the source tree. Almost 15MiB of the generated included files were produced, comprising around 354,893 non-empty lines of text.

I think the best way to summarize this data is "mozilla-central is a massive codebase." It also goes to show you why just looking at source code without compiling is a bad idea: around 3-5% of our source code actually isn't in the source tree to begin with but is instead automatically created at compile time.

Tuesday, August 2, 2011

Fixing sqlite compatibility issues

While testing the separation of database generation and the actual webpage, I had discovered that there was a weird bug in that SQLite cheerily opened the database but then complained that the database was invalid. I thought this was quite weird, and double-checked that the file had copied correctly: same size, correct permissions, file gives the same information (SQLite database, version 3). After a bit of thought, I found the problem:

[mozbuild@dm-dxr01 ~]$ sqlite3 -version
3.3.6
...
jcranmer@xochiquetazal ~ $ sqlite3 -version
3.7.7 2011-06-23 19:49:22 4374b7e83ea0a3fbc3691f9c0c936272862f32f2

The databases were incompatible. So I did some investigation and I found an easy way to fix this:

jcranmer@xochiquetzal ~ $ sqlite3 -line '.dump' spidermonkey.sqlite > /tmp/statements.sql
[mozbuild@dm-dxr01 ~]$ sqlite3 -init /tmp/statements.sql spidermonkey.sqlite

That worked wonderfully. So if you ever need to fix a problem with SQLite-incompatible versions, that is how you dump a database to sqlite and import it again. While I'm on the topic, this is worth paying attention to:

-rw-r--r-- 1 jcranmer   jcranmer   27394048 Aug  2 17:13 spidermonkey.sqlite
-rw-r--r-- 1 jcranmer   jcranmer   27419572 Aug  2 17:17 statements.sql

The list of SQL statements is only 0.1% larger than the SQLite file. If I look at the older database, it's actually smaller than the database. Food for thought.

Tuesday, July 26, 2011

Callgraph

There remains, as of this blog post, one major regression from the original dehydra-based DXR, and that is the lack of support for callgraph. The main problem I have here is trying to pin down the scope of the feature. To give an idea of why this hard, I pose the following challenge: how many different ways are there in C++ of invoking a function, as would be viewed from the generated assembly code? Here is my list:

Global function
Invocation of non-virtual member
Nonvirtual invocation of virtual member
Constructor, copy constructor invocations
Implicit destructor invocation
Functors (objects that implement operator())
C++0x lambdas (which appears to be a functor): Count all of these as "one," if you like, since they're all still more or less boil down to a call addr instruction, so it really boils down to how pedantic you get about the differences. Most of these look fairly distinct in source code level as well too. But, on the plus side, being a statically-known function call, all of these are easy to handle.
Virtual member function invoked virtually: This is clearly different at an assembly level, since it requires a vtable lookup with the added possibility of thunking. From a handling perspective, though, this is more or less statically known, since we know the possible targets of a virtual method call are limited to subclasses of the method, if we assume that people aren't sneaky.
Function pointers: Easy to dip into, impossible to think about off the top of your head (how many tries does it take you to write the type of a function pointer that returns a function pointer without using typedefs?), and it requires data-flow analysis to solve. The main question is if it's even worth trying to at least plan to support function pointers, or if they should just be left out entirely.
Pointer-to-member function pointers: Theoretically easier than function pointers (because we again have a more limited state space), but they still suffer from the same need to data flow analysis. Not to mention that they are much harder to use, and that their implementations are much different from function pointers.
Template dependent name resolution: Templates, as far as C++ is concerned, is essentially a macro system that is more powerful and more inane than the C preprocessor. So if the expression that is to be called is dependent on the type parameter, then the function call can be any (or all) of the above depending on what was passed in.
throw: The entire exception handling semantics in C++ essentially gets handled by the ABI as a combination of magic function calls and magic data sections; undoubtedly a throw itself is a function call which gets to do stack unwinding. Most of this stuff isn't exposed in the code, it just leaks out in semantics.

One goal I've had in mind is to make sure I can do some reasonable support for dynamic languages as well as static languages. To that end, I've looked up other callgraph implementations to see what they do in the realm of indirect function calls. The answer is a mixture of either "do runtime instrumentation to build the callgraph" or "ignore them altogether." The closest I've seen is one implementation that indicated when the address of a function was taken.

Friday, July 22, 2011

Viewsource up and running

This morning, I have made available a public version of the viewsource code. Viewsource was originally a web tool to output a pretty-printed version of the Dehydra objects for simple snippets of source code. Since I found it a useful idea for debugging, I also added dumps for the now-defunct Pork rewriting toolset, JSHydra, and now, finally Clang.

The odd tool out here is clang, as it has neither a debug-useful XML output nor a simple JS representation that can report the output, which requires me to make the tool a two-stage process: the first stage is a compiler plugin that figures out from clang::RecursiveASTVisitor which functions and classes it needs to worry about dumping out, and then writes out the code for the second stage, another compiler plugin that actually dumps the generated AST information to the console. In other words, I have a plugin to write a plugin that is used to dump information to write plugins.

Eventually, I hope to be more complete in the information I can dump out (particularly type locations and type information), but this is complete enough to be at least somewhat useful for finding interesting things, such as understanding what the location information actually refers to. As I work on this tool more and more, this should enable me to find and fix many bugs in DXR.

Thursday, July 21, 2011

DXR alpha 3 release

I am pleased to announce the third release of DXR. Compared to the previous release, the UI has been tweaked and several bugs have been noticed and fixed. In particular, I have improved the support of links, and fixed the bug that prevented the type hierarchy from being properly realized. I have also introduced support for indexing the IDL code in Mozilla's codebase, as well as indexing generated code in general. I have also added yet another tree, comm-central.

Compared to the original implementation of DXR by Dave Humphrey, there remains one unimplemented feature, namely support for the callgraph. My plan for the next release will focus on implementing this feature as well as any other bugs I uncover via real-world testing. The other major features for the next release will be a JSON query API for data and improved support for multiple trees.

In a more general viewpoint, the plugin architecture has been much improved. There is now support for viewing code-coverage data in the source tree (I do not have a visible demo of this yet), and the support for IDL is entirely a separate plugin. Implementing these plugins has caused me to realize that there is need for better hooks in the current architecture (especially with regards to the generated HTML), and these will also be forthcoming in the next release.

The final point to make is that I have moved the viewsource code into its own repository, as well as adding support for a limited subset of the clang AST. The website does not properly support clang at this time owing to problems in the server's current toolchain configuration; when these issues are resolved, I will be demonstating the tool better.

Wednesday, July 13, 2011

Random discovery

I've been playing around with the idea of replacing mork for leveldb, the two being at roughly the same "no feature database" level. The sanest way to do a decent head-to-head comparison involves picking the least powerful wrapping around this API (the message folder cache) and then reimplementing it twice. Actually, I originally intended to try implementing mdb with a leveldb backend for a drop-in replacement to mork. Five minutes successfully dissuaded me, as the mdb interfaces require you to implement about five interfaces to get to "open a database".

For realistic testing, I grabbed a real database and instrumented all of the accesses in real use cases (i.e., startup and shutdown). Since the real data is what I love to call my Profile from Hell™, there's enough data that I can get some reasonable statistical accuracy in timing. I'm also testing what is a path of note in Thunderbird startup, so I made sure to test cold startup, which, incidentally, is very annoying. A 1 second test takes several seconds to reload xpcshell.

The first performance results were very surprising to me. I expected that not touching the knobs would probably result in a mild performance regression (mork is fast by virtue of choosing the "space" option in the space/time tradeoff). What I wasn't expecting was something that was an order of magnitude slower. I eventually got off my butt and hooked up jprof to see what was going wrong. Unfortunately, I forgot that jprof's visualization of output results sucks, so I traipsed back to my very old post to pull down my script to turn the dumb HTML file into a more manageable dot file (which needed fixing, since the only commit to jprof in 3 years changed the output syntax on me).

Clearly, the output file implicates JS as being the hotpath. Since this is xpcshell and I had to hack the jprof initialization in, I decided to move the jprof initialization to the execution point to eliminate the effect of parsing my ~20K-line input file (I did say it loads a lot of data). When I last used jprof for serious performance tracking, I noted that the main regression appeared to be caused by XPCThrower::ThrowBadResult (about 60%, apparently). So here, the main regression appears to be... you guessed it, JS exceptions (the method names have changed in 3 years, though).

That is truly unexpected, since I'm not supposed to be actually changing any API. After looking back at the code and inspecting much more deeply, I found out that I actually was changing API. You see, the nsIMsgFolderCacheElement has two ways to access properties: as strings or as integers. Accessing integers clearly throws an error if the value isn't there; accessing strings also appears to throw an error. Assuming this to be the case, I explicitly threw an error in both cases in my new implementation. It turns out that the current implementation actually doesn't throw for strings (I'm not sure if it's intended to or not). Fixing that greatly improved the times. A brief summary of the real results is that mork is faster on startup, even on general property access, and slower on shutdown, only the last of which is not surprising.

In summary: throwing JS exceptions, at least via XPCOM, is very, very, very slow. If that code is a potential hotpath, just say no to exceptions.

Tuesday, July 12, 2011

Location information

One of the hardest parts about writing a compiler is producing correct location information. Well, it is very easy to get the precise location of a token (most of the time, in any case), but the hard part is assigning token locations to actual AST productions, since they are composed of several tokens. Take a simple function definition: void foo(int bar) {}. Nearly every token in that definition can be argued to be the right location of the function. Hence showing only line information is common for compilers: in most normal usage, you get the same information no matter which token you pick. However, there are times when the actual column number is crucial: if you need to correlate something to the original source code (e.g., for rewriting or for DXR), you want that column number to be spot on.

Now let's consider pathological cases. Suppose we define that function in a templated class. Do I want the location of the template or the location of the template instantiation? The former is often better, but there are times you want the latter. But you can also make the function in a macro: #define FUNCTION(name) void name(int bar) {} and FUNCTION(foo). Do I want the location within the macro definition or the location in the macro instantiation? Macros let you be really evil: #define FUNCTION(name) void start##name(int bar) {}. Now which location do I want? Oh yeah, and while we're on the topic, there is also the issue of autogenerated code (e.g., I want the location to be what it was in the original file). So which location do I use?

There are two orthogonal issues here: which token do I use in the "final" version of the text, and which version of the text do I grab it from. There are several versions of the text (think of the lexer not as a single unit but instead of a chain of steps that pass the information between them); we don't care about some of the intermediate representations (e.g., what the code looks like after replacing trigraphs). The lowest level of these is the position in the last form of text, just before being parsed for real. It's also important to track the position through repeated invocations of macros (most of the time). We also want the location in the text file as it is when fed to the compiler to begin with, as well as the position in the original file before it is fed to whatever output the C/C++ source file.

Clang gives us some of this information for any SourceLocation. The source of the actual token is the spelling location. On the other hand, the instantiation location is the location of the outermost macro invocation to generate the production. It is possible to get the tree of locations by repeatedly calling getImmediateSpellingLocation, which only looks down one level of invocation. To get the result after following #line directives, you need to use the presumed location, which looks at the instantiation location. It is possible to pass in the spelling location to getPresumedLoc to produce the correct results, though. Unfortunately, precise location information goes out the window if your macro uses # or ## (it goes into "scratch space" whose line numbers don't correlate well enough to the original source code for me to make sense of).

The other direction is one of the possible methods to get the source location in the class. A function declaration merely has 6 of these: the start and end of the declaration, the location of the name identifier, the location of the type specifier, the location after the template stuff, and the location of the right brace. I would go in more detail, but I haven't yet catalogued all of the ways to get this information to be sure that the method names actually mean what they claim to mean.

Wednesday, July 6, 2011

DXR alpha 2 release

I am pleased to announce a second alpha release of DXR. Several improvements have been made, including changes to the UI as well as better cooperation with macros. A complete list of changes can be found at the end. As this is the second release, I have expanded the live demo to include not one but two separate indexing trees: clang and mozilla-central, both of which are current as of last night.

The advantage of DXR is that it sees the source code by instrumenting the compiler at build time, so it is able to understand classes most of whose methods are defined by macros, like Clang's RecursiveASTVisitor. Like any source code browser, it is possible to click on an identifier and then go see information about the identifier, such as its points of declaration, or its definition. Unlike LXR and MXR, however, it knows what type of identifier it's working with, so you won't be offered the definition of BuiltinType::getKind if you look up the getKind method of an Expr object.

Another new feature is support for plugins, to allow people to use different compilers, add in support for more languages, or even augment the UI with new features like capturing line-coverage information. The current support is rudimentary, but there is already active work on a dehydra-based indexer.

A list of new features:

Improved search speed at runtime, and dropped glimpse as a prerequisite
The web UI works in newer versions of Firefox
Support for plugins has been added
Disk space requirements have been radically reduced
Setup configuration is easier
Link extents have been fixed in many cases
Declarations can now be seen along with definitions
Macros can now be browsed like source code

Thursday, June 30, 2011

Google Groups Public Service Announcement

For the past few days, it appears that Google Groups has stopped mirroring newsgroups properly; this included the various Mozilla mailing lists. It does, however, appear to be forwarding posted messages, so if you send a message via the UI and you don't see anything show up... don't panic, and don't repost it 5 times.

I don't know how long Google Groups will be down, but, in the meantime, you can use your favorite newsreader to read the newsgroups. Heck, Thunderbird has a built in news client. If you wish to use mozilla mailing lists, just add in the server news.mozilla.org. You can then happily see everything that you would be able to see with Google Groups, if it were working.

Wednesday, June 29, 2011

Mozilla-central and DXR

Some of you may be wondering why it is taking me so long to build a DXR index of mozilla-central. Part of the answer is that I'm waiting to bring DXR on clang back up to the original instantiation in terms of feature support. The other part of the answer is that mozilla-central is rather big to index. How big? Well, some statistics await…

I first noticed a problem when I was compiling and found myself at 4 GiB of space remaining and dwindling fast. I'm used to fairly tight space restrictions (a result of malportioning the partitions on my last laptop), so that's not normally a source of concern. Except I had been at 10 GiB of free space an hour before that, before I started the run to index mozilla-central.

The original indexing algorithm is simple: for each translation unit (i.e., C++ file), I produce a corresponding CSV file of the "interesting" things I find in that translation unit—which includes all of the header files included. As a result, I emit the data for every header file each time it is included by some compiled file. The total size of all data files produced by the mozilla-central run turned out to be around 12GiB. When I collected all of the data and removed duplicate lines, the total size turned out to be about 573MiB.

Step back and think about what this means for a moment. Since "interesting" things to DXR basically boil down to all warnings, declarations, definitions, and references (macros and templates underrepresented), this implies that every declaration, definition, and reference is parsed by the compiler, on average, about 20-25 times. Or, if you take this as a proxy for the "interesting" lines of code, the compiler must read every line of code in mozilla-central about 20-25 times.

The solution for indexing is obviously not to output that much data in the first place. Since everyone who includes, say, nscore.h does so in the same way, there's no reason to reoutput its data in all of the thousands of files that end up including it. However, a file like nsTString.h (for the uninitiated, this is part of the standard internal string code interfaces, implemented using C's versions of templates, more commonly known as "the preprocessor") can be included multiple times but produce different results: one for nsString and one for nsCString in the same place. Also, there is the question of making it work even when people compile with -jN, since I have 4 cores that are begging for the work.

It was Taras who thought up the solution. What we do is we separate out all of the CSV data by the file that it comes in. Then, we store each of the CSV data in a separate file whose name is a function of both the file it comes in and its contents (actually, it's <file>.<sha1(contents)>.csv, for the curious). This also solves the problem of multiple compilers trying to write the same file at the same time: if we open with O_CREAT | O_EXCL and the open fails because someone else created the file… we don't need to do anything because the person who opens the file will write the same data we wanted to write! Applying this technique brings the total generated CSV file data down to around 1GiB (declaration/definition mappings account for the need for duplicates), or down to about 2 times the real data size instead of 20 times. Hence why the commit message for fixing this is titled Generate much less data. MUCH LESS.

Of course, that doesn't solve all of my problems. I still have several hundred MiBs of real data that need to be processed and turned into the SQL database and HTML files. Consuming the data in python requires a steady state of about 3-3.5 GiB of resident memory, which is a problem for me since I stupidly gave my VM only 4GiB of memory and no swap. Switching the blob serialization from using cPickle (which tries to keep track of duplicate references) to marshal (which doesn't) allowed me to postpone crashing until I generate the SQL database, where SQL allocates just enough memory to push me over the edge, despite generating all of the SQL statements using iterators (knowing that duplicate processing is handled more implicitly). I also switched from using multiple processes to using multiple threads to avoid Linux failing to fork Python due to not enough memory (shouldn't it only need to copy-on-write?).

Obviously, I need to do more work on shrinking the memory usage. I stripped out the use of the SQL for HTML generation and achieved phenomenal speedups (I swore that I had to have broken something since it went from a minute to near-instantaneous). I probably need to move to a light-weight database solution, for example, LevelDB, but I don't quite have a simple (key, value) situation but a (file, plugin, table, key, value) one. Still not hard to do, but more than I want to test for a first pass.

In short, DXR indexing for mozilla-central is being recreated as I write this. Enjoy.

Wednesday, June 22, 2011

Why autoconf sucks

I'm going to go out on a limb and guess that the GNU autotools are one of the most heavily used build systems in existence. What I'm not going to guess is that, as a build system, it sucks. I will grant you that C and C++ code in particular are annoying to compile by virtue of the preprocessor mechanics, and I will also grant that autotools can be useful in trying to work out the hairiness of working on several close-but-not-quite-the-same platforms. But that doesn't justify why you need to make a build system that is so bad.

One of the jobs of autotools (the configure script in particular) is to figure out which compilers to use, how to invoke them, and what capabilities they support. For example, is char signed or unsigned? Or does the compiler support static_assert(expr) or need to resort to extern void func(int arg[expr ? 1 : -1])? Language standards progress, and, particularly in the case of C++, compilers can be slow to correct their implementations to the language.

The reason I bring this up is because I discovered today that my configure failed because my compiler "didn't produce an executable." Having had to deal with this error several times (my earlier work with dxr spent a lot of time figuring out how to manipulate $CXX and still get a workable compiler), I immediately opened up the log, expecting to have to track down configure guessing the wrong file is the executable (last time, it was a .sql file instead of .o). No, instead the problem was that the compiler had crashed (the reason essentially boiling down to std::string doesn't like being assigned NULL). That reason, this is the program that autoconf uses to check that the compiler works:

#line 1861 "configure"
#include "confdefs.h"

main(){return(0);}

(I literally copied it from mozilla's configure script, go look around line 1861 if you don't believe me). That is a problem because it's not legal C99 code. No, seriously, autoconf has decided to verify that my C compiler is working by relying on a feature so old that it's been removed (not deprecated) in a 12-year old specification. While I might understand that there are some incredibly broken compilers out there, I'm sure this line of code is far more likely to fail working compilers than the correct code would be, especially considering that it is probably harder to write a compiler to accept this code than not except it. About the only way I can imagine writing this "test" program to make it more failtastic is to use trigraphs (which is legal C code that gcc does not honor by default). Hey, you could be running on systems that don't have a `#' key, right?

Addendum: Okay, yes, I'm aware that the annoying code is a result of autoconf2.13 and that the newest autoconfs don't have this problem. In fact, after inspecting some source history (I probably have too much time on my hands), the offending program was changed by a merge of the experimental branch in late 1999. But the subtler point, which I want to make clearer, is that the problem with autoconf is that it spends time worrying about arcane configurations that the projects who use them probably don't even support. It also wraps the checks for these configurations in scripts which render the actual faults incomprehensible, including "helpfully" cleaning up after itself so you can't actually see the offending command lines and results. Like, for example, the fact that your compiler never produced a .o file to begin with.

Friday, June 17, 2011

Alpha release of dxr

I am pleased to announce an alpha release of DXR built on top of Clang. A live demo of DXR can be found at http://dxr.mozilla.org/clang/, which is an index of a relatively recent copy of the Clang source code. Since this is merely an alpha release, expect to find bugs and inconsistencies in the output. For more information, you can go to #static on irc.mozilla.org or contact the static analysis mailing list. A list of most of the bugs I am aware of is at the end of this blog post.

So what is DXR? DXR is a smart code browser that works by using instrumented compilers to use what the compiler knows about the code to provide a database of the code. For C and C++ in particular, using an instrumented compiler is necessary, since it is the only reliable way to fix the issue of macros. Take, for instance, RecursiveASTVistor in the Clang codebase. Most of the almost 1200 functions are defined via macros as opposed to in raw code; as a consequence, the doxygen output for this class is useless: as far as I can tell, there are only five methods I can override to visit AST nodes. On the other hand, DXR neatly tells me all of the methods that are defined, and can point me to the place where that function is defined (within the macro, of course).

Where can you get the code? DXR is available both as a github repository (use the dxr-clang branch) and as a Mercurial repository. Instructions on how to use can be found on the wiki page.

The following is a list of known problems:

Links occur at odd boundaries
Some lines have id="l234"/> prepended
Non-root installs (i.e., installing to http://dxr.mozilla.org/clang/) cause issues. Interestingly, refreshing the page often causes things to work.
There is a long list of scrolling text when compiling code. Ignore it.
HTML generation produces IndexErrors
.csv files are created in the source directory and HTML code is generated.
Inheritance searches don't match the full hierarchy, only one or two levels.

Friday, June 10, 2011

Documentation lies

For various reasons, I want to get the extent of a particular expression in clang. Most AST objects in clang have a method along the lines of getSourceRange(), which, on inspection, returns start and end locations for the object in question. Naturally, it's the perfect fit (more or less), so I start using it. However, it turns out that the documentation is a bit of a liar. I would expect the end location to return the location (either file offset or file/line) of the end of the expression, give or take a character depending on whether people prefer half-open or fully closed ranges. Instead, it returns the location of the last token in the expression. Getting the true last token requires this mess: sm.getFileOffset(Lexer::getLocForEndOfToken(end, 0, sm, features)). Oh yeah, in case you couldn't have guessed, that causes it to relex the file starting at that point. For an idea of my current mood, pick a random image from Jennifer's recent post and you'll likely capture it.

DXR updates

It's been almost a week since I last discussed DXR, but, if you look at the commit log, you can see that I've not exactly been laying back and doing nothing. No, the main reason is because most of the changes I've been doing so far aren't exactly ground-breaking.

In terms of UI, the only thing that's changed is that I know actually link the variable declarations correctly, a result of discovering the typo that was causing it not to work properly in the midst of the other changes. From a user's perspective, I've also radically altered the invocation of DXR. Whereas before it involved a long series of steps consisting of "build here, export these variables, run this script, build there, run that script, then run this file, and I hope you did everything correctly otherwise you'll wait half an hour to find you didn't" (most of which takes place in different directories), it's now down to . dxr/setup-env.sh (if you invoke that incorrectly, you get a loud error message telling you how to do it correctly. Unfortunately, the nature of shells prohibits me from just doing the right thing in the first place), build your code, and then python dxr/dxr-index.py in the directory with the proper dxr.config. Vastly improved, then. But most of my changes are merely in cleaning up the code.

The first major change I made was replacing the libclang API with a real clang plugin. There's only one thing I've found harder (figuring out inheritance, not that I'm sure I've had it right in the first place), and there are a lot of things I've found easier. Given how crazy people get with build systems, latching onto a compiler makes things go a lot smoother. The biggest downside I've seen with clang is that it's documentation is lacking. RecursiveASTVisitor is the main visitor API I'm using, but the documentation for that class is complete and utter crap, a result of doxygen failing to handle the hurdle of comprehending macros. I ended up using the libindex-based dxr to dump a database of all of the functions in the class, of which there are something like 1289, or around 60 times the functions listed in the documentation. Another example is Decl, where the inheritance picture is actually helpful. It, however, manages to have failed to document DeclCXX.h, which is a rather glaring omission if what you are working with is C++ source.

The last set of changes I did was rearchitecting the code to make it hackable by other people. I have started on a basic pluggable architecture for actually implementing new language modules, although most of the information is still hardcoded to just use cxx-clang. In addition, I've begun work on ripping out SQL as the exchange medium of choice: the sidebar list is now directly generated using the post-processing ball of information, and linkification is now set up so that it can be generated independent of tokenization. In the process, I've greatly sped up HTML generation times only to regress a lot of it by tokenizing the file twice (the latter part will unfortunately remain until I get around to changing tokenization API). It shouldn't take that much longer for me to rip SQL fully out of the HTML builder and shove SQL generation into a parallel process for improved end-to-end time.

Some things I've discovered about python along the way. First, python closures don't let you mutate closed-over-variables. That breaks things. Second, if you have a very large python object, don't pass it around as an argument to multiprocessing: just make it a global and let the kernel's copy-on-write code make cloning cheap. Third, apparently the fastest way to concatenate strings in python is to use join with a list comprehension. Finally, python dynamic module importing sucks big time.

Where am I going from here? I'll have HTMLification (i.e., remove SQL queries and replace with ball lookups) fixed up tomorrow; after that, I'll make the cgi scripts work again. The next major change after that is getting inheritance and references in good shape, and then making sure that I can do namespaces correctly. At that point in time, I'll think I'll be able to make a prerelease of DXR.

Wednesday, June 8, 2011

Fakeserver and future changes

This is perhaps a bit belated, but I thought I'd give a brief overview of where I think fakeserver should be going over the next several months (i.e., when I find time to work on it). For those who don't know, Fakeserver is a generalized testing framework for the 4 major mail protocols: IMAP, NNTP, SMTP, and POP. For various reasons, I'm actually getting around to fixing problems with it right now. The following are the features I am going to be adding:

Multiple connections

This is a long-standing bug in fakeserver, which I first discovered about 3 years ago, when I started testing the IMAP implementation. Fakeserver uses the same connection handler for every connection, which means the state of the connection is shared between all connections, obviously very problematic for a protocol as stateful as IMAP. In my defense, I cribbed from the stateless (more or less) HTTP server, and used the barely-stateful NNTP as my first testbed. This bug is probably the most API-breaking of any change to fakeserver code: every invocation of the server must replace new nsMailServer(new handler(daemon)) with new nsMailServer(function (d) { return new handler(d); }, daemon). Subsequent to that, handlers are no longer easily exposed by the server, which means all manipulation must be on the daemon end. This causes major changes to SMTP code and NNTP code for different reasons; pretty much everyone else hides between helper functions that act as a firewall to minimizing code changes.

Multi-process fakeserver

This change (which has no patch yet) will move the fakeserver running into another process. I have some working prototypes for manipulating objects across two different processes, but I don't yet have a good IPC mechanism (which Mozilla strangely lacks) for actually communicating this information. As for API changes, I haven't gotten far enough yet to know the impact of multiprocess fakeserver, but the answer will probably that handler information goes from "hard to get" to "impossible to get". On the plus side, this should essentially enable mozmill tests to be able to use fakeserver.

SSL fakeserver

I figure fakeserver needs SSL support--it's common in the mail world to use SSL instead of plaintext, so we need to support it. In terms of external API, the only change will be some added methods to start up an SSL server (and methods to get STARTTLS working to, provided my knowledge of how SSL works is correct). The API pseudo-internal to maild.js gets more convoluted, though: essentially, I need to support multiple pumping routines for communication. On the plus side, though, if the API I'm planning on implementing turns out to be feasible, we should also be able to get generic charset decoding for nearly free, for i18n tests.

LDAP fakeserver

LDAP address book code is almost completely untested. I'd like to see that fixed. However, LDAP is a binary protocol (BER, to be precise). If I can implement the SSL support in the API framework I want to, then it shouldn't be that much more to get a BER-to-text-ish layer thrown in on top of it. The downside is that I'm pretty much looking at using our own LDAP library to do BER encoding/decoding since I don't want to write that myself.

Saturday, June 4, 2011

DXR thinking aloud

A primary goal of DXR is to be easy to use. Like any simple statement, translating that into design decisions is inordinately difficult, and I best approach such issues by thinking out loud. Normally, I do this in IRC channels, but today I think it would be best to think in a louder medium, since the problem is harder.

There are three distinct things that need to be made "easy to use". The first is the generation of the database and the subsequent creation of the web interface. The second is extension of DXR to new languages, while the last is the customization of DXR to provide more information. All of them have significant issues.

Building with DXR

Starting with the first one, build systems are complicated. For a simple GNU-ish utility, ./configure && make is a functioning build system. But where DXR is most useful is on the large, hydra-like projects where figuring out how to build the program is itself a nightmare: Mozilla, OpenJDK, Eclipse, etc. There is also a substantial number of non-autoconf based systems which throw great wrenches in everything. At the very least, I know this much about DXR: I need to set environment variables before configuring and building (i.e., replace the compiler), I need to "watch" the build process (i.e., follow warning spew), and I need to do things after the build finishes (post-processing). Potential options:

Tell the user to run this command before the build and after the build. On the plus side, this means that DXR needs to know absolutely nothing about how the build system works. On the down side, this requires confusing instructions: in particular, since I'm setting environment variables, the user has to specifically type ". " in the shell to get them set up properly. There are a lot of people who do not have significant shell exposure to actually understand why that is necessary, and general usage is different enough from the commands that people are liable to make mistakes doing so.
Guess what the build system looks like and try to do it all by ourselves. This is pretty much the opposite extreme, in that it foists all the work on DXR. If your program is "normal", this won't be a problem. If your program isn't... it will be a world of pain. Take also into consideration that any automated approach is likely to fail hard on Mozilla code to begin with, which effectively makes this a non-starter.
Have the user input their build system to a configuration file and go from there. A step down from the previous item, but it increases the need for configuration files.
Have DXR spawn a shell for the build system. Intriguing, solves some problems but causes others.

Conclusion: well, I don't like any of those options. While the goal of essentially being able to "click" DXR and have it Just Work™ is nice, I have reservations about such an approach being able to work in practice. I think I'll go for a "#1 and punt on this issue to someone with more experience."

Multiple language

I could devote an entire week's worth of blog posts to this topic, I think, and I would wager that this is more complicated and nebulous than even build systems are. In the end, all we really need to worry about with build systems is replacing compilers with our versions and getting to the end; with languages, we actually need to be very introspective and invasive to do our job.

Probably the best place to start is actually laying out what needs to be done. If the end goal is to produce the source viewer, then we need to at least be able to do syntax highlighting. That by itself is difficult, but people have done it before: I think my gut preference at this point is to basically ask authors of DXR plugins to expose something akin to vim's syntax highlighting instead of asking them to write a full lexer for their language.

On the other end of the spectrum is generating the database. The idea is to use an instrumenting compiler, but while that works for C++ or Java, someone whose primary code is a website in Perl or a large Python utility has a hard time writing a compiler. Perhaps the best option here is just parsing the source code when we walk the tree. There is also the question about what to do with the build system: surely people might want help understanding what it is their Makefile.in is really doing.

So what does the database look like? For standard programming languages, we appear to have a wide-ranging and clear notion of types/classes, functions, and variables, with slightly vaguer notions of inheritance, macros (in both the lexical preprocessing sense and the type-based sense of C++'s templates), and visibility. Dynamic languages like JavaScript or Python might lack some reliable information (e.g., variables don't have types, although people often still act as if they have implicit type information), but they still uphold this general contract. If you consider instead things like CSS and HTML or Makefiles in the build system, this general scheme completely fails to hold, but you can still desire information in the database: for example, it would help to be able to pinpoint which CSS rules apply to a given HTML element.

This begs the question, how does one handle multiple languages in the database? As I ponder this, I realize that there are multiple domains of knowledge: what is available in one language is not necessarily available in another. Of the languages Mozilla uses, C, assembly, C++, and Objective-C[++] all share the same ability to access any information written in the other languages; contrast this to JS code, which can only interact with native code via the use of a subset of IDL or interactions with native functions. IDL is a third space, which is a middle ground between native and JS code, but is insufficiently compatible with either to be lumped in with one. Options:

Dump each language into the same tables with no distinction. This has problems in so far as some languages can't be shoehorned into the same models, but I think that in such cases, one is probably looking for different enough information anyways that it doesn't matter. The advantage of this is that searching for an identifier will bring it up everywhere. The disadvantage... is that it gets brought up everywhere.
Similar to #1, but make an extra column for language, and let people filter by language.
Going a step further, take the extra language information and build up the notion of different bindings: this foo.bar method on a python object may be implemented by this Python_foo_bar C binding. In other words, add another table which lists this cross-pollination and takes it into account when searching or providing detailed information<.
Instead of the language column in tables, make different tables for every language.
Instead of tables, use databases?

Hmm. I think the binding cross-reference is important. On closer thought, it's not really languages themselves that are the issue here, it's essentially the target bindings: if we have a system that is doing non-trivial build system work that involves cross-compiling, it matters if what we are doing is being done for the host or being done for the target. Apart from that, I think right now that the best approach is to have different tables.

Extraneous information

The previous discussion bleeds into this final one, since they both ultimately concern themselves with one thing: the database. This time, the question is how to handle generation of information beyond the "standard" set of information. Information, as I see it, comes in a few forms. There is additional information at the granularity of identifiers (this function consumes 234 bytes of space or this is the documentation for the function), lines (this line was not executed in the test suite), files (this file gets compiled to this binary library), and arguably directories or other concepts not totally mappable to constructs in the source tree (e.g., output libraries).

The main question here is not on the design of the database: it's only a question of extra tables or extra columns (or both!). Instead, the real question is in the design of the programmatic mechanisms. In dxr+dehydra, the simple answer is to load multiple scripts. For dxr+clang, however, the question becomes a lot more difficult since the code is written in C++ and isn't dynamically loading modules like dehydra does. It also begins to beg the question of the exposed API. On the other hand, I'm not sure I know enough of the problem space to be able to actually come up with solutions. I think I'll leave this one for later

Thursday, June 2, 2011

Who's responsible for libxul

As you might have gathered from my last post, I spend a fair amount of time thinking up ways of visualizing software metrics. Today's visualization is inspired by a lunch conversation I had yesterday. Someone asked "how much space does SVG take up in libxul?" This isn't a terribly hard problem to solve: we have a tool that can tell us the size of symbols (i.e., codesighs), and I have a tool that gives me a database of where every symbol is defined in source code.

Actually, it's not so simple. codesighs, as I might have predicted, doesn't appear to work that well, so I regressed back to nm. And when I actually poked the generated DXR database for mozilla-central using dehydra (I haven't had the guts to try my dxr-clang integration on m-c yet), I discovered that it appears to only lack the information about functions. Both static functions and member C++ functions—my suspicions of breakage are now confirmed. So I resorted to one of my "one-liner" scripts, as follows:

cat <<HEREDOC
<!DOCTYPE html>
<html><head>
<title>Code size by file</title>
<script type="application/javascript" src="https://www.google.com/jsapi"></script>
<script type="application/javascript">
google.load("visualization", "1", {packages: ["treemap"]});
google.setOnLoadCallback(drawChart);
function drawChart() {
  var data = new google.visualization.DataTable();
  data.addColumn('string', 'File');
  data.addColumn('string', 'Path');
  data.addColumn('number', 'Code size (B)');
HEREDOC
nm --format=bsd --size-sort /src/build/trunk/browser/dist/bin/libxul.so | cut -d' ' -f 3 > /tmp/libxul.symbols.txt
find -name "*.o"| xargs nm --format=bsd --size-sort --print-file-name | sed -e 's/:/ /g' | python -c '
import sys
symbols = set(open("/tmp/libxul.symbols.txt").readlines())
files = {}
for line in sys.stdin:
  args = line.split(" ")
  if args[-2] in "tTvVwW" and args[-1] in symbols:
    files[args[0]] = files.get(args[0], 0) + int(args[1], 16)
out_lines = set()
for f in files:
  f2 = f.replace("./", "")
  path = f2.split("/")
  str = "src/"
  for p in path[:-1]:
    out_lines.add((str + p + "/", str, 0))
    str += p + "/"
  out_lines.add((f2, str, files[f]))
print "data.addRows([" + ",\n".join([repr(list(x)) for x in out_lines]) + "]);"
'

cat <<HEREDOC
  data.addRow(["src/", null, 0]);
  var tree = new google.visualization.TreeMap(document.getElementById('tmap'));
  tree.draw(data, {maxDepth: 3, headerHeight: 0, noColor: "#ddd", maxColor: "#ddd",
    minColor: "#ddd", midColor: "#ddd"});
}
</script></head>
<body>
  <div id="tmap" style="width: 100%; height: 1000px;"></div>
</body>
</html>
HEREDOC

It's really 4 commands, which just happens to include two HEREDOC cats and a multi-line python script. The output is a ~~little~~not-so-little HTML file that loads a treemap visualization using the google API. The end result can be seen here. To answer the question posed yesterday: SVG accounts for about 5-6% of libxul, in terms of content, and about another 1% for the layout component; altogether, about as much as our xpconnect code accounts for libxul. For Thunderbird folks, I've taken the time to also similarly divvy up libxul, and that can be found here.

P.S., if you're complaining about the lack of hard numbers, blame Google for giving me a crappy API that doesn't allow me to make custom tooltips.

Visualization toolkits suck

This is a rant in prelude to a blog post I expect to make tomorrow morning (or later today, depending on your current timezone). As someone who is mildly interested in data, I spend a fair amount of time thinking up things that would be interesting to just see dumped out visually. And I'm not particularly interested in small datasets: one of my smaller datasets I've been playing with is "every symbol in libxul.so"; one of my larger datasets is "every news message posted to Usenet this year, excluding binaries" (note: this is 13 GiB worth of data, and that's not 100% complete).

So while I have a fair amount of data, what I need is a simple way to visualize it. If what I have is a simple bar chart or scatter plot, it's possible to put it in Excel or LibreOffice... only to watch those programs choke on a mere few thousand data points (inputting a 120K file caused LibreOffice to hang for about 5 minutes to produce the scatter plot). But spreadsheet programs clearly lack the power for serious visualization information; the most glaring chart they lack is the box-and-whiskers plot. Of course, if the data I'm looking at isn't simple 1D or 2D data, then what I really need won't be satiable with them anyways.

I also need to visualize large tree data, typically in a tree map. Since the code for making squarified tree maps is more than I care to do for a simple vis project, I'd rather just use a simple toolkit. But which to use? Java-based toolkits (i.e., prefuse) require me to sit down and make a full-fledged application for what should be a quick data visualization, and I don't know any Flash to be able to use Flash toolkits (i.e., flare). For JavaScript, I've tried the JavaScript InfoVis Toolkit, Google's toolkit, and protovis, all without much luck. JIT and protovis both require too much baggage for a simple "what does this data look like", and Google's API is too inflexible to do anything more than "ooh, pretty treemap". Hence why my previous foray into an application for viewing code coverage used a Java applet: it was the only thing I could get working.

What I want is a toolkit that gracefully supports large datasets, allows me to easily drop data in and play with views (preferably one that doesn't try to dynamically change the display based on partial options reconfiguration like most office suites attempt), supports a variety of datasets, and has a fine-grained level of customizability. Kind of like SpotFire or Tableau, just a tad bit more in my price range. Ideally, it would also be easy to put on the web, too, although supporting crappy old IE versions isn't a major feature I need. Is that really too much to ask for?

Wednesday, June 1, 2011

Building with dxr-clang

As I'm building yet another program from scratch (I've now run dxr on four separate trees in the past week, and I'm still looking for a good one to use for tests), let me take the time to explain how to set up DXR on your own computer. Instead of stuffing it in this blogpost, most of the information is instead on wiki.mozilla.org. One thing not on that page is that I currently have the sql files in the clang dumper dump to the source directory instead of the object directory, so the DXR config needs to have the source and object directory be the same setting, even if you don't actually build the source in-place!

Friday, May 27, 2011

clang-DXR

As mentioned previously, the first thing I started doing was getting DXR to use libclang to build the indexing. As of right now, I have a tool which kind of works. In that it can successfully build a program but utterly fails to produce reliable SQL output due to constantly swallowed segfaults (I think). My test program uses cmake to build, so it "helpfully" hides the noise of what is actually going on, so debugging will have to wait until next week.

The main unexpected hurdle is that for all of the wonderful praise clang has had lavished on it, I have found it surprisingly difficult to use. I had expected to be able to do a more or less straightforward port of dxr.js, with some cosmetic changes to handle differences in naming, etc., between dehydra and libclang. That expectation is now hopelessly gone; libclang provides nothing near the clean API that dehydra does. For example, to get the fully-qualified name of a class, instead of being able to say type.name, I have to walk up the entire AST hierarchy of the class to find all of its containing classes and namespaces to do it myself. If I want to get one for a function (which, in C++, requires differentiating between overloads, so I need type parameters), I have to build my own type-to-string dumper. Indeed, no less than half of my code so far is merely trying to build the type names I need to dump out. In addition, it seems that the clang people are dismissive of an idea to include a higher level API than is currently present.

If you didn't notice from the links, by the way, dxr is now on github.

Monday, May 23, 2011

Summer DXR work

So, far this summer, I will be mostly putting away Thunderbird work and will instead be focusing my attention on reviving DXR. Sometime this week, I expect to be updating the installation on dxr.mozilla.org (simply so I can match up better on DXR to what I'm doing locally).

What will I be doing on DXR? First and foremost, I will be making it easier to run DXR on anything that is not Mozilla. Or anything that is Mozilla, for that matter. This will be done by cleanly separating the build/instrumentation steps from the rest of DXR. In other words, setting CC/CXX/LD/etc. should be all you need to do build your program.

After that, I will be working on using Clang to get information instead of using gcc+dehydra. If I understand the current Apple development process, supporting clang is essential to supporting Mac OS X builds in the future. I will also work on getting IDL and JS support added.

One other thing I want to do is (time permitting) to get DXR to represent the results of different builds, so that you can get results other than just the current Linux build. I have a list of other ideas to play around with, but I'll leave that for later, since I don't know if I'll time to start them or not.

Don't expect dxr.mozilla.org to see all of the new features immediately. I'll update when I feel like it (i.e., I see enough stability to do it), and it's easier for me to work more locally on my laptop. I also find it easier to do rapid updates on a smaller project than Mozilla which takes quite some time to do clobber rebuilds; I don't know which project(s) I'll be using for all of my work yet, so don't bother asking.

Friday, January 21, 2011

Usage share of newsreaders, update

A few months ago, I logged the usage share of various newsreaders for roughly the month of August. Since then, I ran updated tests, alternating monthly between logging by users and messages, which gives me statistics for September through November. Later months are not available because the newsserver I used has now gone away (without notifying users!), and I did not want to switch this script to the new one, since comparability is lost.

One of the uses I had for original statistics collection was to argue why NNTP support for Thunderbird still matters. During an IRC discussion, it was brought up that August is a poor month for logging since there is a tradition of using that month for vacation. Pulling up the data for the month of October, the last one for which I have this data, indicates that approximately 720,000 messages were posted that month, indicating that August is indeed a poor month for indicating volume.

Have the statistics changed much? Google Groups and Thunderbird are both within .2% absolute difference of the scores I calculated last time (44.02% and 12.3%, respectively). Down the line, things change: Outlook Express had 8.98%, followed by Forte Agent at 8.86%. Live Mail had 2.83% and MT-NewsWatcher had 2.51%. Indeed, the tail is longer, with 20.52% as compared to before.

As my new server has a longer retention time, I no longer wish to use the same script as before. My next goal is to log every header of every message posted this year, so that I may collect more information without having to list everything I need, particularly information useful in determining the user of mail-to-news gateways and information to help identify spamminess of messages. I have lots of ideas for possible analysis of data, but first I want usable data.

Monday, January 17, 2011

News URI handling

I have built a version of Thunderbird that should fix news URI handling issues, obtainable from this site. This is a patch queue based off of trunk just after the 3.3a2 release branch, so it should have all of the features of 3.3a2 if not the version number.

In particular, both news and nntp links of any type should work, including news URIs without a server. If any of those links do not work, please tell me, including the circumstances in which they didn't work (e.g., where did you click it, was TB open, were you subscribed to the group or not, etc.). Also, there is a chance that this could regress other handling in news code or OS command line handling, so if you see such regresses, please also tell me.

Thanks in advance for your testing!

Saturday, January 15, 2011

The Great Codec War

At the Second Battle of Chrome, WebM seems to have struck a surprise victory against H.264, when Google announced that it was dropping support for H.264 in <video>. Well, maybe it was only a pyrrhic victory. Reactions seem to differ a lot, but I think a lot of them miss the mark.

I've seen some people (I'm looking at you, Ars) claim that this will help kill HTML 5 video. This claim seems to me to be bogus: HTML 5 video effectively died years ago when no one could agree on a codec. Of the several video sites I use, the only one to support HTML 5 is YouTube; everyone else uses Flash. Since it's already dead, you can hardly kill it by switching codecs. This just shifts the balance more in favor of WebM. And as for H.264 working nearly everywhere, AppleInsider, your chart is just Blatant Lies.

Another thing that people do is compare video codecs to image codecs, most particularly GIF. But H.264 is not GIF: GIF became wildly popular before Unisys seemed to realize that it violated LZW. It is also unclear, looking back a decade after the fact, if Unisys targeted only encoders or both (lzw.info implies both, but Mozilla had GIF code long before the patent expired, and I don't see any information about LZW licensing). H.264, however, clearly mentions licensing for the decoder. Furthermore, it was relatively easy to make a high-quality image codec that doesn't require stepping on patents (which we now call PNG); the video codec market is much more strangled to make that impossible.

While on the topic, I've also seen a few statements that point out that H.264 is an ISO standard and WebM is not. Since people love to make the comparison between H.264 and GIF, I will point out that GIF is not an ISO standard nor an RFC, ITU, W3C, or IEEE document (although the W3C does have a copy of it on their website, it appears to not be accessible from their website by internal links). The commentary about "open standards" typically means "I can implement it by reading this/these specification(s) without paying a fee to anybody", not "there exists an officially-approved, freely-available standard" (incidentally, ISO standards generally are NOT freely-available).

But what about Flash, most people say, both on the issue of support for H.264 (although it will be supporting WebM as well) as well as the fallacy of open support. The answer to that is two simple words: "Legacy content." Flash works for everybody but a few prissy control freaks, and so much stuff—more than just video, in fact—uses it that not supporting it is impractical. Remember, half the web's users do not have HTML 5-capable browsers.

All of that said, where things go in the future is very much an open question. I see several possible directions:

The U.S. declares software patents invalid. Mr. O'Callahan can tell you one scenario that could cause this. It's actually not implausible: the Supreme Court in Bilski seemed mildly skeptical of expansive patentability claims, and a relatively clean software patent claims would probably allow them to make a coherent "narrow" ruling on software patents in general. And, though the U.S. is not the world, an anti-software patent U.S. ruling would probably lead to nullification of software patents worldwide.
MPEG-LA changes their minds and allows royalty-free decoding (not encoding) of H.264. This solution is fairly implausible, unless MPEG-LA desperately decides to try this gambit to stop H.264 from becoming obsolete. The circumstances which would lead them to do this would probably be on the back of a steep descent in H.264 popularity, so the actual value of this outcome would be minor.
Apple caves in and allows either Flash or WebM on iOS. With alternative browsers on mobile allowing these options, that means only about 17% of the mobile market has no support for video other than H.264. Depending on the success of other OSs, this may force Apple to support one of the two alternatives to allow video to work on iOS. I don't know how plausible this is, but seeing as how Android is both newer than iOS and more popular, a long-term decline in Apple's fortunes is not unreasonable.
The world continues as it does today, with no single solution supporting everybody. Not ideal, but it is the path of least resistance. Unfortunately, it's also probably the most likely.

Monday, January 10, 2011

Developing new account types, Part 4: Displaying messages

This series of blog posts discusses the creation of a new account type implemented in JavaScript. Over the course of these blogs, I use the development of my Web Forums extension to explain the necessary actions in creating new account types.

In the previous blog post, I showed how to implement the folder update. Our next step is to display the messages themselves. As of this posting, I will refer less frequently to my JSExtended-based framework (Kent James's SkinkGlue is a less powerful variant; a final version will likely be a hybrid of the two technologies)—it will be slowly phased out over the rest of these series of blog posts.

URLs and pseudo-URLs

As mentioned earlier, messages have several representations. Earlier, we used the message key and the message header as our representations; now, we will be using two more forms: message URIs and necko URLs [1]. The message URI is more or less a serialization of the folder and key unique identifier. It does not have any further property of a "regular" URL (hence the title); most importantly, it is not (necessarily) something that can be run with necko. To convert them to necko URLs, you need to use the message service.

Because message URIs require an extra step to convert to necko URLs, most of the message service uses the message URI instead of the URL (anytime you see a raw string or a variable named messageURI, or (most of the time) URI, it is this pseudo-URL that is being referred to). Displaying messages involves a call to the aptly-named DisplayMessage. Unfortunately, it's also not quite so aptly-named in that it can also effectively mean "fetch the contents of this message to a stream," but I will discuss this later.

This is where the bad news starts. First off, mailnews is a bit lazy when it comes to out parameters. Technically, XPCOM requires that you pass in pointers to all outparams to receive the values; a lot of the calls to DisplayMessage don't pass this value because they ignore it anyways. Second, one of the key calls needed in DisplayMessage turns out to be a [noscript] method on an internal Gecko object. What this means is you can't actually implement the message service in JavaScript.

There is good news, however. Many of the methods in nsIMsgMessageService are actually variants of "fetch the contents of this message"; indeed, the standard implementations typically funnel the methods to a FetchMessage. My solution is to reduce all of this to a single method that you have to implement, and you get your choice of two ways to run it. Owing to implementation design artifacts, I've done it both ways and can show it to you.

Body channel

The first way is to stream the body as a channel. This is probably not the preferred method. Telling us that you did this is simple:

wfService.prototype = {
  getMessageContents: function (aMsgHdr, aMsgWindow, aCallback) {
    let task = new LoadMessageTask(aMsgHdr);
    aCallback.deliverMessageBodyAsChannel(task, "text/html");
  }
};

Seriously, that's the full code to say that you have a channel. The channel itself implements nsIChannel, but we only use very few methods: asyncOpen (we never synchronously open), isPending, cancel, suspend, and resume. The primary purpose of the channel is just to funnel the input stream of the body (not the message headers; those will be written based on the message header). The channel implementation is moderately simple:

function LoadMessageTask(hdr) {
  this._hdr = hdr;
  this._uri = hdr.folder.getUriForMsg(hdr);
  this._server = hdr.folder.server;
}
LoadMessageTask.prototype = {
  runTask: function (protocol) {
    this._listener.onStartRequest(this, this._channelCtxt);
    this._pipe = Cc["@mozilla.org/pipe;1"].createInstance(Ci.nsIPipe);
    this._pipe.init(false, false, 4096, 0, null);
    /* load url */
  },
  onUrlLoaded: function (document) {
    let body = /* body */;
    this._pipe.outputStream.write(body, body.length);
    this._listener.onDataAvailable(this, this._channelCtxt,
      this._pipe.inputStream, 0, this._pipe.inputStream.available());
  },
  onTaskCompleted: function (protocol) {
  this._listener.onStopRequest(this, this._channelCtxt, Cr.NS_OK);
  },
  QueryInterface: XPCOMUtils.generateQI([Ci.nsIChannel, Ci.nsIRequest]),
  asyncOpen: function (listener, context) {
    if (this._listener)
      throw Cr.NS_ERROR_ALREADY_OPENED;
    this._listener = listener;
    this._channelCtxt = context;

    // Fire off the task!
    this._server.wrappedJSObject.runTask(this);
  }
};

There are some things to note. First, this code can synchronously callback onStartRequest from runTask, which is a necko no-no. However, our magic glue channel gracefully handles this (by posting the call to asyncOpen in another event). Loading the input stream is done with a pipe here, and I'm doing a quick-and-easy implementation that does not take into account potential internationalization issues. I also haven't bothered to implement the other methods I should here, mostly because this code is primarily an artifact of an earlier approach, whose only purpose now is demonstrating channel-based loading.

Body input streams

The second method of implementation is just to give us the message body as an input stream:

getMessageContents: function (aMsgHdr, aMsgWindow, aCallback) {
  let pipe = Cc["@mozilla.org/pipe;1"].createInstance(Ci.nsIPipe);
  pipe.init(false, false, 4096, 0, null);
  aCallback.deliverMessageBodyAsStream(pipe.inputStream, "text/html");
  aMsgHdr.folder.server.wrappedJSObject.runTask(
    new LoadMessageTask(aMsgHdr, pipe.outputStream));
}

function LoadMessageTask(hdr, outstream) {
  this._hdr = hdr;
  this._outputStream = outstream;
}
LoadMessageTask.prototype = {
  runTask: function (protocol) {
    protocol.loadUrl(/* url */, protocol._oneShot);
  },
  onUrlLoaded: function (document) {
    let body = /* body */;
    this._outputStream.write(body, body.length);
  },
  onTaskCompleted: function (protocol) {
    this._outputStream.close();
  }
};

Here, the basic appraoch is still the same: we open up a pipe, stuff our body in one end and give the other end to the stream code. However, we don't need to do the other work that comes with loading the URI, which streamlines the code greatly. We can also pass in to the callback method an underlying request that will take care of network load stopping, etc., for us if we so choose, but the argument is optional.

More implementation

Naturally, you have to add some more contract implementations to get all of the services to work right. The following is a sample of my chrome.manifest as it stands:

component {207a7d55-ec83-4181-a8e7-c0b3128db70b} components/wfFolder.js
component {6387e3a1-72d4-464a-b6b0-8bc817d2bbbc} components/wfServer.js
component {74347a0c-6ccf-4b7a-a429-edd208288c55} components/wfService.js
contract @mozilla.org/nsMsgDatabase/msgDB-webforum {e8b6b6ca-cc12-46c7-9a2c-a0855c311e07}
contract @mozilla.org/rdf/resource-factory;1?name=webforum {207a7d55-ec83-4181-a8e7-c0b3128db70b}
contract @mozilla.org/messenger/server;1?type=webforum {6387e3a1-72d4-464a-b6b0-8bc817d2bbbc}
contract @mozilla.org/messenger/protocol/info;1?type=webforum {74347a0c-6ccf-4b7a-a429-edd208288c55}
contract @mozilla.org/messenger/backend;1?type=webforum {74347a0c-6ccf-4b7a-a429-edd208288c55}
contract @mozilla.org/messenger/messageservice;1?type=webforum-message {7e3d2918-d073-4c98-9ec7-f419a05c29de}

The first and last CIDs, as you'll notice, were not implemented by me (well, kind of). The first is the CID of nsMsgDatabase that I've exposed in one of my comm-central patches; the latter is the CID of my extension message service implementation. Also of importance is that I included a second contract-ID for my service implementation, this is for my new interface msgIAccountBackend, which is the source of the getMessageContents method I implemented earlier, and which you also need to implement to get it to work.

Finally, you need to generate the message URI properly. Fortunately, this just requires you to implement one method:

wfFolder.prototype = {
  get baseMessageURI() {
    if (!this._inner["#mBaseMessageURI"])
      this._inner["#mBaseMessageURI"] = "webforum-message" +
        this._inner["#mURI"].substring("webforum".length);
    return this._inner["#mBaseMessageURI"];
  }
};

Under the hood

For those who wish to know more about is actually going on, I am going to describe the full loading process, from the moment you click on the header to the time you see the output.

Clicking on the header (after some Gecko code that I'll elide) leads you to nsMsgDBView::SelectionChanged. This code is kicked back to the front-end via nsIMsgWindow::commandUpdater's summarizeSelection method. For Thunderbird, this is the method that handles clearing some updates and also decides whether or not to show the message summary (which is "yes if there is a collapsed thread, this pref is set, and this is not a news folder" [2]). Summarization is a topic I'll handle later.

In the case of a regular message, the result of the loading is to display the message. The message URI is constructed, and then passed to nsMessenger::OpenURL, which calls either nsIMsgMessageService::DisplayMessage or nsIWebNavigation::LoadURI, depending on whether or not it can find the message service. The message service converts its URI to the necko URL and then passes that—since it's passed in with the docshell as a consumer—to LoadURI with slightly different flags. And thus begins the real message loading.

Loading URLs by the docshell is somewhat complicated, but it boils down to creating the channel, opening it via AsyncOpen. When the channel is opened (OnStartRequest is called), it tries to find someone who can display it, based on the content type. It turns out that there is a display handler in the core mailnews code that can display message/rfc822 messages, which it does by converting the text into text/html (via libmime) and using the standard HTML display widget. I'm going to largely treat libmime as a black box; it processes text as OnDataAvailable is called and spits out HTML via a mixture of OnDataAvailable and callbacks via the channel's url's header sink, or the channel's url's message window's header sink.

The special extension message service implementation goes a few steps further. By managing the display and channel code itself, it allows new implementors to not worry so much about some of the particular requirements during the loading process. Its AsyncOpen method is guaranteed to not run OnStartRequest synchronously, and also properly manages the load groups and content type manipulation. Furthermore, the channel manually synthesizes the full RFC 822 envelope (the code inspired by some compose code), and ensures that the nsIStreamListener methods are called with the proper request parameter (the original loaded channel must be the request passed).

Alternative implementation

It is still possible to do this without using the helper implementation. In that case, there are alternatives. The first thing to do is to implement the network handler, for which you'll definitely need a protocol implementation, and probably a channel and url as well. A url that does not implement nsIMsgMailNewsUrl and nsIMsgMessageUrl is likely to run into problems with some parts of the code. You can possibly get by without a message service for now, but I suspect it is necessary for some other portions of the code. To get the message header display right, you need a message/rfc822 content-type (which gets changed to text/html, so it has to be settable!).

A possible alternate implementation would be to send a straight text/html channel for the body and then manually call the methods on the header sink, i.e., bypass libmime altogether. A word of caution about this approach is that libmime can output different things based on the query parameters in the URL, and I don't know which of those outputs are used or not.

Next steps

Now that we have message display working, we pretty much have a working implementation of the process of getting new messages and displaying them. There are several ways I can go from here, but for now, I'll make part 5 deal with the account manager. Other parts that I am planning to do soon include dealing with subscription, filters, and other such code.

Notes

If you are not aware, "necko" refers to the networking portion of the Mozilla codebase. The terms "URI" and "URL" also have standard meanings, but for the purposes of this guide, they mean different things. I will try to keep them distinct, but I have a tendency to naturally prefer "URI" most of the time, so I may slip up.
Unfortunately, a lot of the front-end code has taken it upon itself to hardcode checks for certain implementations to enable/disable features. Hopefully, as Kent James and I progress on this work, these barriers can be reduced.