Troubleshooting FAQ

This is a FAQ for common questions that occur when debugging the operations of a running Flume cluster.

Configuration and Settings
Operations

Configuration and Settings

How can I tell if I have a library loaded when flume runs?

From the command line, you can run flume classpath to see the jars and the order Flume is attempting to load them in.

How can I tell if a plugin has been loaded by a flume node?

You can look at the node's plugin status web page – http://<master>:35871/extension.jsp

Alternately, you can look at the logs.

Why does the master need to have plugins installed?

The master needs to have plugins installed in order to validate configs it is sending to nodes.

How can I tell if a plugin has been loaded by a flume master?

You can look at the node's plugin status web page – http://<master>:35871/masterext.jsp

Alternately, you can look at the logs.

How can I tell if a plugin has been loaded by a flume node?

You can look at the node's plugin status web page – http://<node>:35862/staticconfig.jsp

Alternately, you can look at the logs.

How can I tell if my `flume-site.xml` configuration values are being read properly?

You can go to the node or master's static config web page to see what configuration values are loaded.

http://<node>:35862/staticconfig.jsp
http://<master>:35871/masterstaticconfig.jsp

I'm having a hard time getting the LZO codec to work.

Flume by default reads the $HADOOP_CONF_DIR/core-site.xml which may have the io.compression.codecs setting set. You can make the settting <final> so that flume does not attempt to override the setting.

Operations

I lose my configurations when I restart the master. What's happening?

The default path to write information is set to this value. You may want to override this to a place will be persistent across reboots such as /var/lib/flume.

<property>
<name>flume.master.zk.logdir</name>
<value>/tmp/flume-$

Unknown macro: {user.name}

-zk</value>
<description>The base directory in which the ZBCS stores data.</description>
</property>

How can I get metrics from a node?

Flume nodes report metrics which you can use to debug and to see progress. You can look at a node's status web page by pointing your browser to port 35862. (http://<node>:35862).

How can I tell if data is arriving at the collector?

When events arrive at a collector, the source counters should be incremented on the node's metric page. For example, if you have a node called foo you should see the following fields have growing values when you refresh the page.

LogicalNodeManager.foo.source.CollectorSource.number of bytes
LogicalNodeManager.foo.source.CollectorSource.number of events

How can I tell if data is being written to HDFS?

Data in hdfs doesn't "arrive" in hdfs until the file is closed or certain size thresholds are met. As events are written to hdfs, the sink counters on the collector's metric page should be incrementing. In particular look for fields that match the following names:

*.Collector.GunzipDecorator.UnbatchingDecorator.AckChecksumChecker.InsistentAppend.append*

*.appendSuccesses are successful writes. If other values like appendRetries or appendGiveups are incremented, they indicate a problem with the attemps to write.

I am getting a lot of duplicated event data. Why is this happening and what can I do to make this go away?

tail/multiTail have been reported to restart file reads from the beginning of files if the modification rate reaches a certain rate. This is a fundamental problem with a non-native implementation of tail. A work around is to use the OS's tail mechanism in an exec source (exec("tail -n +0 -F filename")). Alternately many people have modified their applications to push to a Flume agent with an open rpc port such as syslogTcp or thriftSource, avroSource.

In E2E mode, agents will attempt to retransmit data if no acks are recieved after flume.agent.logdir.retransmit milliseconds have expried (this is a flume-site.xml property). Acks do not return until after the collector's roll time, flume.collector.roll.millis , expires (this can be set in the flume-site.xml file or as an argument to a collector) . Make sure that the retry time on the agents is at least 2x that of the roll time on the collector.

If that was in E2E mode goes down, it will attempt to recover and resend data that did not receive acknowledgements on restart. This may result in some duplicates.

I have encountered a "Could not increment version counter" error message.

This is a zookeeper issue that seems related to virtual machines or machines that change IP address while running. This should only occur in a development environment – the work around here is to restart the master.

I have encountered an IllegalArgumentException related to checkArgument and EventImpl.

Here's an example stack trace:

2011-07-11 01:12:34,773 ERROR
com.cloudera.flume.core.connector.DirectDriver: Driving src/sink
failed! LazyOpenSource | LazyOpenDecorator because null
java.lang.IllegalArgumentException
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:75)
at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:97)
at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:87)
at com.cloudera.flume.core.EventImpl.<init>(EventImpl.java:71)
at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.buildEvent(SyslogWireExtractor.java:120)
at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.extract(SyslogWireExtractor.java:192)
at com.cloudera.flume.handlers.syslog.SyslogWireExtractor.extractEvent(SyslogWireExtractor.java:89)
at com.cloudera.flume.handlers.syslog.SyslogUdpSource.next(SyslogUdpSource.java:88)
at com.cloudera.flume.handlers.debug.LazyOpenSource.next(LazyOpenSource.java:57)
at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:89)

This indicates an attempt to create an event body that is larger than the maximum allowed body size (default 32k). You can increase the size of the max event by setting flume.event.max.size.bytes in your flume-site.xml file to a larger value. We are addressing this with issue FLUME-712.

I'm getting OutOfMemoryExceptions in my collectors or agents.

Add -XX:+HeapDumpOnOutOfMemoryError to the JVM_MEM_OPTS env variable or flume-env.sh file. This should dump heap upon thse kinds of errors and allow you to determine what objects are consuming excessive memory by using the jhat java heap viewer program.

There have been instances of queues that are unbounded. Several of these have been fixed in v0.9.5.

There are situations where queue sizes are too large for certain messages. For example, if batching is used, each event can takes up more memory. The default queue size in thrift sources is 1000 items. With batching individual events can become megabytes in size which may cause memory exhaustion. For example making batches of 1000 1000-byte messages with a queue of 1000 events could result in flume requiring 1GB of memory!

In these cases, reduced the size of the thrift queue to bound potential the memory usage by setting flume.thrift.queuesize

<property> 
<name>flume.thrift.queuesize</name> 
<value>500</value> 
</property>

Child pages

Troubleshooting FAQ

Configuration and Settings

How can I tell if I have a library loaded when flume runs?

How can I tell if a plugin has been loaded by a flume node?

Why does the master need to have plugins installed?

How can I tell if a plugin has been loaded by a flume master?

How can I tell if a plugin has been loaded by a flume node?

How can I tell if my flume-site.xml configuration values are being read properly?

I'm having a hard time getting the LZO codec to work.

Operations

I lose my configurations when I restart the master. What's happening?

How can I get metrics from a node?

How can I tell if data is arriving at the collector?

How can I tell if data is being written to HDFS?

I am getting a lot of duplicated event data. Why is this happening and what can I do to make this go away?

I have encountered a "Could not increment version counter" error message.

I have encountered an IllegalArgumentException related to checkArgument and EventImpl.

I'm getting OutOfMemoryExceptions in my collectors or agents.

How can I tell if my `flume-site.xml` configuration values are being read properly?