View Full Version : break in ods graphs on server restart
breakintheweb
17th April 2007, 05:48
I'm using the new ods function which is logging to rrd only. THe problem is, whenever i restart nagios i get a break in my graphs. Does anyone else have a similiar problem?
rom
17th April 2007, 09:43
The problem is that rrd database are configure to have a defined number of datas, with a defined interval (based on your service configuration => normal check interval...).
So, if Nagios doesn't get back a value for your service (restart, latency in check...) the rrd database is not fill, and if the datas arrives for the next counter, and the previous is empty => break.
So, tune your Nagios configuration in order to record state of each service, scheduling infos, service & host check spread...
DonKiShoot
17th April 2007, 10:17
Never stop nagios and all still be ok :-D
Menno
18th April 2007, 11:00
There are a lot of breaks in my ODS graphs too and I haven't restarted Nagios the last days.
I think there is a bug in ODS.
Last night one of my internet-routers was down (CRITICAL) from 0:15 to 8:15 and to my surprise I miss all the ODS graphs info exactly from 0:15 to 8:15 for ALL my services.
And these services have nothing to do with this internet-router..
Going to keep an eye on this.
Regards
Menno van Bennekom
DonKiShoot
18th April 2007, 11:07
What is your check frequency ?
300 sec or 5mn seems to be the best choice for having good graph.
Hope it's help you. :wink:
Menno
18th April 2007, 11:30
What is your check frequency ?
300 sec or 5mn seems to be the best choice for having good graph.
Hope it's help you. :wink:
I'm not sure this is directed to me, but the check interval is the standard 5 minutes. Certainly not 8 hours ;-)
Regards
Menno
breakintheweb
18th April 2007, 17:56
I have the same issues menno. I have the default check interval at five minutes. I have cacti running at the same location monitoring the same devices without any breaks in the data.
rom
18th April 2007, 18:12
Cacti made a global check every 15 minutes...... So you can't have break.
Nagios schedule his checks, and rrd filling depend on it, it's more complicated...
Menno
26th April 2007, 12:26
These breaks in the ODS-graphs are still a problem to me, I can't find a relationship with other occurances, only that it seems to happen more often when some service is critical or down. The breaks don't happen at the same time in all graphs, each has different breaks at different times.
Is nobody else having this problem?
Random example attached.
Regards
Menno
I finally found the cause for the holes in my ODS graphs :D
The measurements were stored correctly in ODS-mysql, in the perfdata-file, and in the check_graph_traffic RRD file, but sometimes not in the ODS-RRD file.
With a dump of the ODS RRD file (rrdtool dump) I saw some measurements appear with value 'NaN', not-a-number.
Then with rrdtool info you can see that the RRD is created with step 300 (5 minutes) but the metric is created with a heartbeat of 300 too.
That is too strict normally, because if the measurement comes in after 301 seconds it already gets a 'NaN' value, you HAVE to respond within 300 seconds..
So I changed all the RRD files to a heartbeat of 600:
cd /usr/local/oreon/OreonDataStorage
for f in *.rrd; do rrdtool tune $f --heartbeat metric:600; done
**note** This is only simple to do in version 1.3 because in 1.4 the Datasource-name is no longer 'metric'
but can be different for each file!!
Since that moment the holes/breaks in the graphs are gone.
I think this should be adapted in ODS/lib/updateFunctions.pm.
In this program the RRD's are created with step and heartbeat as the same parameter ($interval).
Update history:
The value of $interval is changed by the patches on updateFunctions.pm:
original release: $interval = $interval * $data->{'interval_length'} ;
first patch: $interval = $interval * $data->{'interval_length'} * 2;
fourth patch: $interval = $interval * $data->{'interval_length'} + 10;
But I think the $interval value of the original release is the good one, just the line where the RRD is created should be changed, step and heartbeat should not be the same:
was:
RRDs::create ($_[0]."/".$_[1].".rrd", "-b ".$begin, "-s ".$interval,
"DS:metric:GAUGE:".$interval.":U:U",
"RRA:AVERAGE:0.5:1:".$_[5], "RRA:MIN:0.5:12:".$_[5], "RRA:MAX:0.5:12:".$_[5]);
my suggestion:
RRDs::create ($_[0]."/".$_[1].".rrd", "-b ".$begin, "-s ".$interval,
"DS:metric:GAUGE:". $interval * 2 .":U:U",
"RRA:AVERAGE:0.5:1:".$_[5], "RRA:MIN:0.5:12:".$_[5], "RRA:MAX:0.5:12:".$_[5]);
Regards
Menno van Bennekom
DonKiShoot
8th May 2007, 14:56
Julio si il a raison, c'est une sacré bourde, le heartbeat est toujours égal au double du step par principe, non ?
Enfin on a toujours fait comme-ça pour les check_graph il me semble ?
Cela expliquerait pourquoi les nagios non optimisé ont des courbes toute pourries à cause de leur latency trop élevé et d'un heartbeat trop court.
rzd
18th September 2007, 12:54
Je confirme !
Après avoir modifier le fichier updatefunctions.pm je n'ai plus aucun trou dans mes graphiques.
Ce que j'ai mis en place :
Je n'ai pas touché à la variable $interval (ligne 49 & 99)
$interval = $interval * $data->{'interval_length'} + 10;
Ajout d'une variable $interval_hb (ligne 50 & 100)
$interval_hb = $interval * 2;
Modification du code pour la création du rrd (ligne 55 & 106)
RRDs::create ($_[0].$_[1].".rrd", "-b ".$begin, "-s ".$interval, "DS:".substr($_[6], 0, 19).":GAUGE:".$interval_hb.":U:U", "RRA:AVERAGE:0.5:1:".$nb_value, "RRA:MIN:0.5:12:".$nb_value, "RRA:MAX:0.5:12:".$nb_value);
Modification du code de génération de log (ligne 57)
writeLogFile("Creating $_[0]$_[1].rrd -b $begin, -s $interval, DS:".substr($_[6], 0, 19).":GAUGE:$interval_hb:U:U RRA:AVERAGE:0.5:1:$nb_value RRA:MIN:0.5:12:$nb_value RRA:MAX:0.5:12:$nb_value\n");
Concernant les graphique déja créés, j'ai été obligé de "tuner" les rrd à la main
rrdtool tune [nom_du_rrd].rrd --heartbeat [valeur_metrics]:[valeur_heartbeat]
Merci de confirmer si ma méthode est correcte et n'implique aucun problème ;)
Un problème reste présent : lors de la regénération du rrd (par centreon) le heartbeat est remis à la même valeur que le step. Si quelqu'un sait quel fichier modifier... je suis preneur.
rom
18th September 2007, 17:03
Je te confirme tout ca demain.
julio
18th September 2007, 20:29
etrange quand meme que tobi n'ai pas intégré la possibilité de changer le heartbeat a la creation.....
rzd
2nd October 2007, 14:06
Alors pas de nouvelle ?
Parce que le bug est plutot génant... surtout quand on a beaucoup de graphs et qu'il se mettent à foirer du jour au lendemain sans qu'aucune action n'est été faite et qu'on est obligé de regénérer une bonne cinquantaine de graphs (j'aime bien la commande rrdtool mais bon...) !
Après avoir regénéré les rrds des graphs en question, je me retrouve avec des "jolies" trous d'une semaine.....
J'ai vraiment de gros doutes sur la fiabilité des vues oreon.....
icedance
2nd October 2007, 14:19
http://forum.oreon-project.org/showthread.php?t=4475&page=4
a partir du post 38
si c'est important
rzd
2nd October 2007, 14:43
je vais essayer de remplacer les fichiers. je vous tiens au courant.
PS : je vois que ma solution posté plus haut à servi dans ces fichiers modifiés ;)
rzd
2nd October 2007, 16:39
Bon je viens de mettre en place les 2 fichiers.
Pour l'instant je n'ai aucune amélioration :(
Toujours des graphs "troués".
J'ai bien supprimé les graphs posant problème.
Là ça me dépasse !
rzd
3rd October 2007, 12:24
Chose encore plus incroyable : j'ai désactivé TOUS les graphs (meme ceux des services qui fonctionnaient) et là SURPRISE : ca continue de grapher !!!!!!!
J'ai arrété le daemon ODS, j'ai tout vider : la base ODS (sauf la table config), le dossier Oreondatastorage, les données rrd.
Franchement c'est un grand n'importe quoi !