No announcement yet.

No more graphs, SQL DB is populated

  • Filter
  • Time
  • Show
Clear All
new posts

  • No more graphs, SQL DB is populated


    Since a few days the graphs are not updating anymore. We did not touch ou centreon server for weeks (possibly months) and suddenly the graphs are no more working.
    After some troubleshooting I realized that the "central rrd master output" is creating queues files in /var/lib/centreon-broker and is late in processing data.
    In the logs the only errors we can see are :
    [1525437814] error: RRD: ignored update error in file '/data/centreon/metrics/18111.rrd': /data/centreon/metrics/18111.rrd: illegal attempt to update using time 1525355270 when last update time is 1525372156 (minimum one second step)
    As you can see the first timestamp indicates the current time (log entry), the second timestamp indicates the time of the metric and the third timestamp indicates the last entry time in the RRD file.
    When looking at the logs the second timestamp is going up (meaning centreon broker is still processing but is nearly 1 day behind current time).

    The number of hosts or service monitored did not change so I don't understand why we have a such thing all of a sudden.
    The fact that the date being processed is hours behind the schedule is also a mystery. How can this happen?

    When I look at the poller statistics we clearly see that there is a huge queue and the Central RRD master output is in state : Blocked
    The server is working fine, lots of free memory and CPU available, I don't understand why the RRD output is not processing in real time like it was before.

    I tried to stop centreon and deleted these "central-rrd-master.queue.central-rrd-master-output" files but graphs don't come back when I restart the service and after an hour I see the "central-rrd-master.queue.central-rrd-master-output" files coming back in /var/lib/centreon-broker.

    Nothing at all in the logs (centreon-engine or centreon-broker). Everything is working fine except the graphs. When i check the SQL DB I can see the metrics history just fine.

    We have about 800 hosts with 8600 services. In the /metrics directory i can see 25000 .rrd files, is that normal to have that 3 time more rrd files than monitored services? Is there a limit where centreon can't handle .rrd files anymore?

    Does anybody has an idea on what could be wrong?


  • #2
    We were able to see what's causing this but no solution for the moment :

    In our architecture we have custom agents installed on the servers that are sending metrics every 5mins to a self developped application acting as a proxy. This application acts like a broker and puts the metrics in its own DB but also transfer them to NSCA. Once it reached this stage it's the classic Centreon architecture where NSCA listens for the passive checks and transfer them to the Centreon broker.

    What changed and caused the issue is that we added a second "proxy" (which is not sending anything for the moment) so now NSCA gets its data from two sources instead of one. That's the only change.
    As soon as we disable the new "proxy", centreon broker RRD starts to process the queued data at a high rate (like 500 events/second). When we re-enable it, it starts to throttle again and can't get more than 20 events/s. I don't understand why it's acting like this because the architecture didn't change, Centreon still get its data directly from NSCA and there are no performance issue (RAM, CPU, I/O are low).

    Any idea on this strange behaviour? i couldn't get details on the way the RRD broker is processing the information and why it would get throttled (some security if a certain condition is meet).