When we are profiling operators, the call count of each will be 2x the actual value in AggregateStats. In another word, there is a 2x duplication. This issue is isolated to operators. AggregateStats entries in "Device Storage", "MXNET_C_API", or other domains do not have this issue.
In profiler.h, we have a bunch of classes such as ProfileTask, ProfileEvent, ProfileOperator, etc. Those "profile classes" have start() and stop() functions to call before and after a particular event that we want to profile. Within those classes, we also have subclasses OprExecStat, EventStat, TaskStat etc. Those "stat classes" possess the information that we want to dump for each event we profile.
The idea is that the "profile classes" will send an instance of the "Stat classes" to AggregateStats. And in AggregateStats, we will process the stats one by one, and we will map the names of the stats to their AggregateStats entries.
With that said, we are able to know what's causing the duplication: when we profile operators, we use ProfileOperator; however, within ProfileOperatpor, we also has a member variable "as_task_" which is of class ProfileTask. The intention is to generate two events/stats that fall into different domains for one single operator call. However because those two stats have the same operator name, they will cause duplication in AggregateStats.
"MXNET_C_API" calls will not cause duplications, because for them, we use ProfileTask only. In other words, we are only generating one event/stat for each call.
All the "stat classes" inherit from ProfileStat in profiler.h. There, we can add a new bool member variable "enable_aggregate_". This variable defaults to true and controls whether we want to use or skip this stat in AggregateStats (an if statement is added in OnProfileStat() in aggregate_stats.cc). Also, we want to add yet another "enable_aggregate_" to ProfileTask. The idea is that we can set this bool, and we can propagate this value to ProfileStat's "enable_aggregate_" through the lambda function in ProfileTask::SentStat(). Finally, in ProfileOperator, we want to set the "enable_aggregate_" of "_as_task"/ProfileTask to false. This way, we are continuing to produce two events/stats for each operator call, but only the one generated by ProfileOperator will get registered in AggregateStats. The stat generated by "as_task_"/ProfileTask will be skipped, so we no longer have a duplication.