Icinga / icinga-powershell-mssql

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Backup Status - Execution Timeouts and Improvements

agougo opened this issue · comments

This script needs to be improved as it is not designed to function in HA environments with critical Data.

If you have an application with critical data and a backup is running every 15 minutes (e.g. for the transaction log) this will create a lot of entries in the database. This script will need to go through these entries one by one before applying thresholds. To my experience this script will run forever and will create a memory issue on your PROD MSSQL server. If you set a timeout of 60 seconds then it will timeout and will never give you back any useful data.

A better approach will be to change the query in Get-IcingaMSSQLBackupOverallStatus.psm1 to something like this:

$Query = "SELECT
            msdb.dbo.backupset.database_name,
            msdb.dbo.backupset.backup_start_date,
            msdb.dbo.backupset.backup_finish_date,
            msdb.dbo.backupset.is_damaged,
            msdb.dbo.backupset.type,
            msdb.dbo.backupset.backup_size,
			msdb.dbo.backupset.backup_set_uuid,
            msdb.dbo.backupmediafamily.physical_device_name,
            msdb.dbo.backupmediafamily.device_type,
            sys.databases.state,
            sys.databases.recovery_model,
            DATEDIFF(MI, msdb.dbo.backupset.backup_finish_date, GETDATE()) AS last_backup_hours,
            DATEDIFF(MI, msdb.dbo.backupset.backup_start_date,  msdb.dbo.backupset.backup_finish_date) AS last_backup_duration_min
        FROM msdb.dbo.backupmediafamily
            INNER JOIN msdb.dbo.backupset ON msdb.dbo.backupmediafamily.media_set_id = msdb.dbo.backupset.media_set_id
            LEFT JOIN sys.databases ON sys.databases.name = msdb.dbo.backupset.database_name
			INNER JOIN (select max(backup_start_date) as backup_start_date, database_name, type
				from msdb.dbo.backupset
				--where database_name = 'CBO'
				--where type in ('D', 'I', 'L')
				group by msdb.dbo.backupset.database_name, msdb.dbo.backupset.type) a
				on a.database_name = msdb.dbo.backupset.database_name
				and a.type = msdb.dbo.backupset.type
				and a.backup_start_date = msdb.dbo.backupset.backup_start_date
		WHERE sys.databases.source_database_id IS NULL
        ORDER BY
            msdb.dbo.backupset.database_name,
            msdb.dbo.backupset.backup_finish_date"

And all counting should be changed to minutes, not days! For example

$LastBackupLogAge = ($Entry.last_backup_hours * 60 * 60)
should be changed to $LastBackupLogAge = ($Entry.last_backup_hours * 60)

'LastBackupAge' = (([long]$Entry.last_backup_hours) * 60 * 60);
should be changed to 'LastBackupAge' = (([long]$Entry.last_backup_hours) * 60);

and so on ...

In addition to the above, the script will not give you the backup of a specific database and will put tremendous pressure on your machine. You need to change

$BackupSet = Get-IcingaMSSQLBackupOverallStatus -SqlConnection $SqlConnection;
to
$BackupSet = Get-IcingaMSSQLBackupOverallStatus -SqlConnection $SqlConnection -IncludeDatabase $IncludeDatabase;

Thank you for the idea and issue. How would the query update affect current implemented monitoring solutions? Is there anything special to be aware of?

With #42 the backup handling was already modified, by allowing to limit the result to a certain amount of days and include the database into the backup - that part was missed on our side.

For the counting: We are using the hour value and convert it to seconds, not days.

#48 will improve the granularity to get the data on a minute base instead of hourly basis.
For the INNER JOIN I would require some more feedback on how this would impact the current implementations and if anything else is dropped in this case.

Thank you for the idea and issue. How would the query update affect current implemented monitoring solutions? Is there anything special to be aware of?

With #42 the backup handling was already modified, by allowing to limit the result to a certain amount of days and include the database into the backup - that part was missed on our side.

For the counting: We are using the hour value and convert it to seconds, not days.

Because of the fact that I backup the transaction log every 30 mins there were too many backup entries in the database, so when the initial query on the script was run it was timing out (timeout value was 60 secs). So we modified the query to take only the last backups - i.e. 'D' - 'Database' 'L' - 'Log' 'i' - 'DIFFERENTIAL'.

This will have a big impact to the users already using the backup check command as it will no longer get all backup entries but only the last one and apply thresholds only to that (as our primary interest was the transaction log). I think that if you limit the backup entries to the last X number of days it is going to be a more granular approach and it will also solve the issue that we had initially.

Thanks for improving this.