Announcement

Collapse
No announcement yet.

break in ods graphs on server restart

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • break in ods graphs on server restart

    I'm using the new ods function which is logging to rrd only. THe problem is, whenever i restart nagios i get a break in my graphs. Does anyone else have a similiar problem?

  • #2
    The problem is that rrd database are configure to have a defined number of datas, with a defined interval (based on your service configuration => normal check interval...).
    So, if Nagios doesn't get back a value for your service (restart, latency in check...) the rrd database is not fill, and if the datas arrives for the next counter, and the previous is empty => break.

    So, tune your Nagios configuration in order to record state of each service, scheduling infos, service & host check spread...
    Romain Le Merlus
    Centreon Forge
    MERETHIS

    Comment


    • #3
      Never stop nagios and all still be ok :-D
      Intel(R) Xeon(TM) CPU 3.4GHz - MemTotal : 1034476 kB
      Centreon 2.4.1 - Nagios 3.2.1 - Nagios Plugins 1.4.15 - Manubulon Plugins tuné
      Fedora Core 5 - 2.6.20-1.2320

      Comment


      • #4
        ODS bug?

        There are a lot of breaks in my ODS graphs too and I haven't restarted Nagios the last days.
        I think there is a bug in ODS.
        Last night one of my internet-routers was down (CRITICAL) from 0:15 to 8:15 and to my surprise I miss all the ODS graphs info exactly from 0:15 to 8:15 for ALL my services.
        And these services have nothing to do with this internet-router..
        Going to keep an eye on this.

        Regards
        Menno van Bennekom

        Comment


        • #5
          What is your check frequency ?
          300 sec or 5mn seems to be the best choice for having good graph.
          Hope it's help you. :wink:
          Intel(R) Xeon(TM) CPU 3.4GHz - MemTotal : 1034476 kB
          Centreon 2.4.1 - Nagios 3.2.1 - Nagios Plugins 1.4.15 - Manubulon Plugins tuné
          Fedora Core 5 - 2.6.20-1.2320

          Comment


          • #6
            Originally posted by DonKiShoot View Post
            What is your check frequency ?
            300 sec or 5mn seems to be the best choice for having good graph.
            Hope it's help you. :wink:
            I'm not sure this is directed to me, but the check interval is the standard 5 minutes. Certainly not 8 hours ;-)

            Regards
            Menno

            Comment


            • #7
              I have the same issues menno. I have the default check interval at five minutes. I have cacti running at the same location monitoring the same devices without any breaks in the data.

              Comment


              • #8
                Cacti made a global check every 15 minutes...... So you can't have break.

                Nagios schedule his checks, and rrd filling depend on it, it's more complicated...
                Romain Le Merlus
                Centreon Forge
                MERETHIS

                Comment


                • #9
                  still breaks

                  These breaks in the ODS-graphs are still a problem to me, I can't find a relationship with other occurances, only that it seems to happen more often when some service is critical or down. The breaks don't happen at the same time in all graphs, each has different breaks at different times.
                  Is nobody else having this problem?
                  Random example attached.

                  Regards
                  Menno
                  Attached Files

                  Comment


                  • #10
                    ODS RRD files incorrect heartbeat

                    I finally found the cause for the holes in my ODS graphs
                    The measurements were stored correctly in ODS-mysql, in the perfdata-file, and in the check_graph_traffic RRD file, but sometimes not in the ODS-RRD file.
                    With a dump of the ODS RRD file (rrdtool dump) I saw some measurements appear with value 'NaN', not-a-number.
                    Then with rrdtool info you can see that the RRD is created with step 300 (5 minutes) but the metric is created with a heartbeat of 300 too.
                    That is too strict normally, because if the measurement comes in after 301 seconds it already gets a 'NaN' value, you HAVE to respond within 300 seconds..
                    So I changed all the RRD files to a heartbeat of 600:
                    Code:
                    cd /usr/local/oreon/OreonDataStorage
                    for f in *.rrd; do rrdtool tune $f --heartbeat metric:600; done
                    **note** This is only simple to do in version 1.3 because in 1.4 the Datasource-name is no longer 'metric'
                    but can be different for each file!!

                    Since that moment the holes/breaks in the graphs are gone.
                    I think this should be adapted in ODS/lib/updateFunctions.pm.
                    In this program the RRD's are created with step and heartbeat as the same parameter ($interval).

                    Update history:
                    The value of $interval is changed by the patches on updateFunctions.pm:
                    original release: $interval = $interval * $data->{'interval_length'} ;
                    first patch: $interval = $interval * $data->{'interval_length'} * 2;
                    fourth patch: $interval = $interval * $data->{'interval_length'} + 10;
                    But I think the $interval value of the original release is the good one, just the line where the RRD is created should be changed, step and heartbeat should not be the same:
                    Code:
                    was:
                    RRDs::create ($_[0]."/".$_[1].".rrd", "-b ".$begin, "-s ".$interval, 
                    "DS:metric:GAUGE:".$interval.":U:U", 
                    "RRA:AVERAGE:0.5:1:".$_[5], "RRA:MIN:0.5:12:".$_[5], "RRA:MAX:0.5:12:".$_[5]);
                    
                    my suggestion:
                     RRDs::create ($_[0]."/".$_[1].".rrd", "-b ".$begin, "-s ".$interval, 
                    "DS:metric:GAUGE:". $interval * 2 .":U:U", 
                    "RRA:AVERAGE:0.5:1:".$_[5], "RRA:MIN:0.5:12:".$_[5], "RRA:MAX:0.5:12:".$_[5]);
                    Regards
                    Menno van Bennekom
                    Last edited by Menno; 18 September 2007, 12:22. Reason: things changed in version 1.4

                    Comment


                    • #11
                      Julio si il a raison, c'est une sacré bourde, le heartbeat est toujours égal au double du step par principe, non ?
                      Enfin on a toujours fait comme-ça pour les check_graph il me semble ?

                      Cela expliquerait pourquoi les nagios non optimisé ont des courbes toute pourries à cause de leur latency trop élevé et d'un heartbeat trop court.
                      Intel(R) Xeon(TM) CPU 3.4GHz - MemTotal : 1034476 kB
                      Centreon 2.4.1 - Nagios 3.2.1 - Nagios Plugins 1.4.15 - Manubulon Plugins tuné
                      Fedora Core 5 - 2.6.20-1.2320

                      Comment


                      • #12
                        Je confirme !

                        Après avoir modifier le fichier updatefunctions.pm je n'ai plus aucun trou dans mes graphiques.

                        Ce que j'ai mis en place :

                        Je n'ai pas touché à la variable $interval (ligne 49 & 99)
                        Code:
                        $interval = $interval * $data->{'interval_length'} + 10;
                        Ajout d'une variable $interval_hb (ligne 50 & 100)
                        Code:
                        $interval_hb = $interval * 2;
                        Modification du code pour la création du rrd (ligne 55 & 106)
                        Code:
                        RRDs::create ($_[0].$_[1].".rrd", "-b ".$begin, "-s ".$interval, "DS:".substr($_[6], 0, 19).":GAUGE:".$interval_hb.":U:U", "RRA:AVERAGE:0.5:1:".$nb_value, "RRA:MIN:0.5:12:".$nb_value, "RRA:MAX:0.5:12:".$nb_value);
                        Modification du code de génération de log (ligne 57)
                        Code:
                        writeLogFile("Creating $_[0]$_[1].rrd -b $begin, -s $interval, DS:".substr($_[6], 0, 19).":GAUGE:$interval_hb:U:U RRA:AVERAGE:0.5:1:$nb_value RRA:MIN:0.5:12:$nb_value RRA:MAX:0.5:12:$nb_value\n");
                        Concernant les graphique déja créés, j'ai été obligé de "tuner" les rrd à la main
                        Code:
                        rrdtool tune [nom_du_rrd].rrd --heartbeat [valeur_metrics]:[valeur_heartbeat]
                        Merci de confirmer si ma méthode est correcte et n'implique aucun problème

                        Un problème reste présent : lors de la regénération du rrd (par centreon) le heartbeat est remis à la même valeur que le step. Si quelqu'un sait quel fichier modifier... je suis preneur.

                        Comment


                        • #13
                          Je te confirme tout ca demain.
                          Romain Le Merlus
                          Centreon Forge
                          MERETHIS

                          Comment


                          • #14
                            etrange quand meme que tobi n'ai pas intégré la possibilité de changer le heartbeat a la creation.....
                            Julien Mathis
                            Centreon Project Leader
                            www.merethis.com |

                            Comment


                            • #15
                              Alors pas de nouvelle ?

                              Parce que le bug est plutot génant... surtout quand on a beaucoup de graphs et qu'il se mettent à foirer du jour au lendemain sans qu'aucune action n'est été faite et qu'on est obligé de regénérer une bonne cinquantaine de graphs (j'aime bien la commande rrdtool mais bon...) !

                              Après avoir regénéré les rrds des graphs en question, je me retrouve avec des "jolies" trous d'une semaine.....

                              J'ai vraiment de gros doutes sur la fiabilité des vues oreon.....
                              Last edited by rzd; 2 October 2007, 13:18.

                              Comment

                              Working...
                              X