Note: this page is somewhat of a work-in-progress. Please feel free to edit with additions, clarifications, better formatting, etc.

Getting started

1. Skim this document to get a feeling for IR: http://llvm.org/docs/LangRef.html

Look at the table of contents to get the basic idea. Check out the "Type System" section, and browse some of the instructions.

2. Learn to read IR, and then IR-generating functions. First focus on the IR (most codegen functions have sample IR output at the top of the implementation), then how the IR is built.

Some examples to get started:

  • hdfs-avro-scanner.cc, CodegenMaterializeTuple
    • understand how we type the tuple* based on the tuple layout

3. Review the Adding a builtin function to Impala documentation for the relatively easy-to-understand UDF use case.

4. Cross-compiled functions (TODO) 


I do not recommend plowing through the LLVM tutorial. It has some useful info, but is mostly useless for our purposes.

Debugging

Most codegen bugs will manifest by Impala crashing. Here are things to look for to help diagnose:

  • LLVM includes many asserts that will trigger when you try to generate bad IR (e.g. creating a call instruction with the wrong arguments). These will cause a message to be printed to stderr (redirected to impalad.ERROR by default). There's no stack or line numbers included in these asserts, so you will have to use log statements or gdb to find which call is triggering the assert.
  • LlvmCodegen::FinalizeFunction() calls VerifyFunction(), which will catch more problems (e.g. basic block not ending with a terminating instruction). Search for "Function corrupt" in the ERROR log (it may be far up because it dumps the whole function afterward). Currently this doesn't crash Impala directly, but will either crash Impala or silently disable codegen because it returns a NULL Function*.
  • If you segfault while running the generated code after the module has been successfully optimized, you're screwed (just kidding, but these are the hardest crashes to debug).
    • Look at the offending assembly in gdb (command to view instruction at frame: x/i $pc)
    • Study the generated IR (more on that below)
    • If you're running a cross-compiled function, comment out ReplaceCallSites() calls to narrow in on the buggy IR function

Here are some more things you can do to debug:

  • Run a single impalad instance if you're not already. It's generally easy to manage and reason about. (start-impala-cluster.py -s 1)
  • Check that the query executes successfully with codegen disabled. If not, it's probably a bug in cross-compiled code. 
  • Inspect the generated IR
    • Every LLVM Value* object (including Function*) has a dump() method that will print human-readable IR to stderr.
      • This is useful for printing instructions, e.g. when LLVM complains about a calling a function with the wrong args, dump the offending instruction.
    • If you'd like to dump an object to a string instead of stderr, use LlvmCodegen::Print().
    • The -dump_ir impalad process flag will make the impalad call dump() on every generated function (see LlvmCodegen::FinalizeFunction()). When using start-impala-cluster.py, use the flags "--impalad_args -dump_ir". Output goes to the INFO log file in $IMPALA_HOME/logs/cluster/impalad.<something>.INFO.<timestamp>. Look for "Dump of Function" in the log.
    • The -opt_module_dir process flag takes an already-created directory and will write every optimized module to that directory (there's a similar -unopt_module_dir flag for the unoptimized module, but usually -dump_ir is sufficient). However, the unoptimized IR is usually more useful (especially since the optimized module does not include any unused functions, including inlined functions, usually leaving you with one or two giant unreadable functions).
    • The -asm_module_dir is similar to -opt_module_dir, except it outputs assembly for generated functions.
  • By default we don't build a debug version of LLVM. If you want, you can manually build a debug version to link against so you can introspect LLVM functions in gdb. I generally find this not very helpful though, it's faster and more effective to add liberal dump() calls.

Common pitfalls:

  • Calling a JIT'd function with the wrong signature. Check that your function typedefs match the IR signature.
  • Every basic block must end with a terminator: br or ret (usually)
    • Even if it's returning from a void function or branching to the next block
  • LLVM does NO type casting, all types must be exact. Use bitcasts for pointers and truncation/extension for ints.
  • Make sure your codegen'd code is actually running! There are error paths that disable codegen without crashing or failing the query. Try adding a DCHECK(false) to the non-codegen'd path.

Misc. tips

  • Never use IRBuilder::CreateAlloca() directly. Instead, use LlvmCodegen::CreateEntryBlockAlloca(). Creating alloca's in the middle of functions will yield strange results (e.g. bad optimizations, blowing the stack).
  • When you create a new cross-compiled file (i.e. a *-ir.cc file), you need to add it to codegen/impala-ir.cc to include it in the IR module.
  • After generating an IR function, you should call OptimizeFunctionWithExprs() or FinalizeFunction() on it, as well as AddFunctionToJit() (TODO: expand on this)

Code Overview

The planner divides a query into fragments, the BE runs multiple fragment instances. Each instance is represented by FragmentInstance in fragment-instance-state.cc. The Open() method checks if code generation should be done, and if so, triggers the codegen work by calling the CodeGen() method on on the set of exec nodes (operators), which will call CodeGen() on expressions (for those nodes that have expressions.)

Consider an expression in the SELECT clause. Impala is row-based, with each row represented by TupleRow (tuple.h). TupleRow::MaterializeExprs() iterates over the expressions to be used to materialize a row, which will include either a slot reference or a scalar expr (ScalarExpr). For each, GetValue() is called, which is either a code generated function or an interpreted function. This function itself can be code generated as the MaterializeExprs function.

Useful APIs and references

https://github.com/apache/impala/blob/master/be/src/codegen/llvm-codegen.h: the main entry point for generating IR

https://github.com/apache/incubator-impala/blob/master/be/src/codegen/codegen-anyval.h: internal API for handling Expr output, grep for CreateCallUnwrapped for example usage

http://llvm.org/docs/LangRef.html: IR reference

http://llvm.org/docs/ProgrammersManual.html: Guide for writing LLVM C++ code. Not super applicable but has some useful patterns (e.g. iterating through a BasicBlock)

LLVM hosts generated docs for each class, which are indispensable as the C++ API is very large. You can usually just google "llvm foo" where foo is the class you're interested in. Some important ones:

http://llvm.org/docs/doxygen/html/classllvm_1_1IRBuilder.html

http://llvm.org/docs/doxygen/html/classllvm_1_1IRBuilderBase.html: most of what you want is in IRBuilder, but IRBuilderBase has a few unexpected gems)

http://llvm.org/docs/doxygen/html/classllvm_1_1Value.html

http://llvm.org/docs/doxygen/html/classllvm_1_1Function.html

http://llvm.org/docs/doxygen/html/classllvm_1_1Type.html

http://adriansampson.net/blog/llvm.html: LLVM for Grad Students; gives a good and simple introduction to LLVM

  • No labels