microsoft / service-fabric-observer

Highly configurable, extensible and performant Service Fabric watchdog service that, out of the box, monitors a broad range of physical machine resources that tend to be very important to Service Fabric services and maps these metrics to SF entities. It targets both Windows and Linux SF clusters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FabricObserver 3.2.15

Deploy to Azure

FabricObserver (FO) is a production-ready watchdog service with an easy-to-use extensibility model, written as a stateless, singleton Service Fabric .NET 6 application that by default

  1. Monitors a broad range of physical machine resources that tend to be very important to all Service Fabric services and maps these metrics to the related Service Fabric entities.
  2. Runs on multiple versions of Windows Server and Ubuntu.
  3. Provides an easy-to-use extensibility model for creating custom Observers out of band (so, you don't need to clone the repo to build an Observer). In this way, FabricObserver is also an "Observer" platform.
  4. Supports Configuration Setting Application Updates for any observer for any supported setting.
  5. Is actively developed in the open.

FabricObserver targets SF runtime versions 9 and higher.

FO is a Stateless Service Fabric Application composed of a single service that runs on every node in your cluster, so it can be deployed and run alongside your applications without any changes to them. Each FO service instance knows nothing about other FO instances in the cluster, by design.

Running side-by-side with existing monitoring services, FabricObserver provides useful and timely health information for the nodes (VMs), apps, and services that make up your Service Fabric deployment.

Read more about Service Fabric health monitoring

FabricObserver is one member of a growing family of open source Service Fabric observability services. The latest member of the family is FabricHealer, which works in conjunction with FabricObserver to auto-mitigate service, node and VM level issues reported by FO.

If you run your apps on Service Fabric, then you should definitely consider deploying FabricObserver to all of your clusters (Test, Staging, Production).

Using FabricObserver

To quickly learn how to use FO, please see the simple scenario-based examples.
You can clone the repo, build, and deploy or simply grab latest tested SFPKG with Microsoft signed binaries from Releases section, modify configs, and deploy.

alt text

How it works

Application and Service Level Warnings:

alt text
alt text
alt text

Node Level Warnings:

alt text
alt text
alt text

Node Level Machine Info:

alt text

When FabricObserver gracefully exits or updates, it will clear all of the health events it created.

alt text

FabricObserver comes with a number of Observers that run out-of-the-box. Observers are specialized objects that monitor, point in time, specific resources in use by user service processes, SF system service processes, containers, virtual/physical machines. They emit Service Fabric health reports, diagnostic telemetry and ETW events, then go away until the next round of monitoring. The resource metric thresholds supplied in the configurations of the built-in observers must be set to match your specific monitoring and alerting needs. These settings are housed in Settings.xml and ApplicationManifest.xml. The default settings are useful without any modifications, but you should design your resource usage thresholds according to your specific needs.

When a Warning threshold is reached or exceeded, an observer will send a Health Report to Service Fabric's Health management system (either as a Node or App Health Report, depending on the observer). This Warning state and related reports are viewable in SFX, the Service Fabric EventStore, and Azure's Application Insights/LogAnalytics/ETW, if enabled.

Most observers will remove the Warning state in cases where the issue is transient, but others will maintain a long-running Warning for applications/services/nodes/security problems observed in the cluster. For example, high CPU usage above the user-assigned threshold for a VM or App/Service will put a Node into Warning State (NodeObserver) or Application Warning state (AppObserver), for example, but will soon go back to Healthy if it is a transient spike or after you mitigate the specific problem :-). An expiring certificate Warning from CertificateObsever, however, will remain until you update your application's certificates (Cluster certificates are already monitored by the SF runtime. This is not the case for Application certificates, so use CertificateObserver for this, if necessary).

Read more about Service Fabric Health Reports

FO ships with both an Azure ApplicationInsights and Azure LogAnalytics telemetry implementation. Other providers can be used by implementing the ITelemetryProvider interface.

For more information about the design of FabricObserver, please see the Design readme.

Build and run

It is highly recommended that you only deploy code built from the main branch into your production clusters.

  1. Clone the repo.
  2. Install .NET 6
  3. Build.

Note: By default, FO runs as NetworkUser on Windows and sfappsuser on Linux. If you want to monitor SF service processes that run as elevated (System) on Windows, then you must also run FO as System on Windows. There is no reason to run as root on Linux under any circumstances (see the Capabilities binaries implementations, which allow for FO to run as sfappsuser and successfully execute specific commands that require elevated privilege).

For Linux deployments, we have ensured that FO will work as expected as normal user (non-root user). In order for us to do this, we had to implement a setup script that sets Capabilities on three proxy binaries which can only run specific commands as root. If you deploy from VS, then you will need to use FabricObserver/PackageRoot/ServiceManifest.linux.xml (just copy its contents into ServiceManifest.xml or add the new piece which is simply a SetupEntryPoint section).

If you use the FO build script, then it will take care of any configuration modifications automatically for linux build output.

The build scripts include code build, sfpkg generation, and nupkg generation. They are all located in the top level directory of this repo.

FabricObserver can be run and deployed through Visual Studio or Powershell, like any SF app. If you want to add this to your Azure Pipelines CI, see FOAzurePipeline.yaml for msazure devops build tasks. Please keep in mind that if your target servers do not already have .net6 installed (if you deploy VM images from Azure gallery, then they will not have .net6 installed), then you must deploy the SelfContained package.

Deploy FabricObserver

Note: You must deploy this version (3.2.15) to clusters that are running SF 9.0 and above. This version also requires .NET 6. You can deploy FabricObserver (and ClusterObserver) using Visual Studio (if you build the sources yourself), PowerShell or ARM. Please note that this version of FabricObserver no longer supports the DefaultServices node in ApplicationManifest.xml. This means that should you deploy using PowerShell, you must create an instance of the service as the last command in your script. This was done to support ARM deployment, specifically. The StartupServices.xml file you see in the FabricHealerApp project now contains the service information once held in ApplicationManifest's DefaultServices node. Note that this information is primarily useful for deploying from Visual Studio. Your ARM template or PowerShell script will contain all the information necessary for deploying FabricObserver.

Deploy FabricObserver using ARM

Learn how to deploy FabricObserver using ARM

Deploy FabricObserver using Client (PowerShell)

After you adjust configuration settings to meet to your needs (this means changing settings in Settings.xml for ObserverManager (ObserverManagerConfiguration section) and in ApplicationManifest.xml for observers).

NOTE: In version 3.2.0 and higher and you must create a service instance after you create the application.

#cd to the top level repo directory where you cloned FO sources.

cd C:\Users\me\source\repos\service-fabric-observer

#Build FO (Release)

./Build-FabricObserver

#create a $path variable that points to the build output:
#E.g., for Windows deployments:

$path = "C:\Users\me\source\repos\service-fabric-observer\bin\release\FabricObserver\win-x64\self-contained\FabricObserverType"

#For Linux deployments:

#$path = "C:\Users\me\source\repos\service-fabric-observer\bin\release\FabricObserver\linux-x64\self-contained\FabricObserverType"

#Connect to target cluster, for example:

Connect-ServiceFabricCluster -ConnectionEndpoint @('sf-win-cluster.westus2.cloudapp.azure.com:19000') -X509Credential -FindType FindByThumbprint -FindValue '[thumbprint]' -StoreLocation LocalMachine -StoreName 'My'

#Copy $path contents (FO app package) to server:

Copy-ServiceFabricApplicationPackage -ApplicationPackagePath $path -CompressPackage -ApplicationPackagePathInImageStore FO3215 -TimeoutSec 1800

#Register FO ApplicationType:

Register-ServiceFabricApplicationType -ApplicationPathInImageStore FO3215

#Create FO application (if not already deployed at lesser version):

New-ServiceFabricApplication -ApplicationName fabric:/FabricObserver -ApplicationTypeName FabricObserverType -ApplicationTypeVersion 3.2.15   

#Create the Service instances (-1 means all nodes, which is what is required for FO):  

New-ServiceFabricService -Stateless -PartitionSchemeSingleton -ApplicationName fabric:/FabricObserver -ServiceName fabric:/FabricObserver/FabricObserverService -ServiceTypeName FabricObserverType -InstanceCount -1

#OR if updating existing version:  

Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricObserver -ApplicationTypeVersion 3.2.15 -Monitored -FailureAction rollback

Observer Model

FO is composed of Observer objects (instance types) that are designed to observe, record, and report on several machine-level environmental conditions inside a Windows or Linux (Ubuntu) VM hosting a Service Fabric node.

Here are the current observers and what they monitor:

Resource Observer
Application (services) resource usage health monitoring across CPU, File Handles, Memory, Ports (TCP), Threads AppObserver
Looks for dmp and zip files in AppObserver's MemoryDumps folder, compresses (if necessary) and uploads them to your specified Azure storage account (blob only, AppObserver only, and still Windows only in this version of FO) AzureStorageUploadObserver
Application (user) and cluster certificate health monitoring CertificateObserver
Container resource usage health monitoring across CPU and Memory ContainerObserver
Disk (local storage disk health/availability, space usage, IO, Folder size monitoring) DiskObserver
SF System Services resource usage health monitoring across CPU, File Handles, Memory, Ports (TCP), Threads FabricSystemObserver
Networking - general health and monitoring of availability of user-specified, per-app endpoints NetworkObserver
CPU/Memory/File Handles(Linux)/Firewalls(Windows)/TCP Ports usage at machine level NodeObserver
OS/Hardware - OS install date, OS health status, list of hot fixes, hardware configuration, AutoUpdate configuration, Ephemeral TCP port range, TCP ports in use, memory and disk space usage OSObserver
Service Fabric Configuration information SFConfigurationObserver
Another resource you find important Observer that you implement

To learn more about the current Observers and their configuration, please see the Observers readme.

Just observe it.

Operational Telemetry

Please see FabricObserver Operational Telemetry for detailed information on the user agnostic (Non-PII) data FabricObserver sends to Microsoft (opt out with a simple configuration parameter change). Please consider leaving this enabled so your friendly neighborhood Service Fabric devs can understand how FabricObserver is doing in the real world. We would really appreciate it!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Please see CONTRIBUTING.md for development process information.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

About

Highly configurable, extensible and performant Service Fabric watchdog service that, out of the box, monitors a broad range of physical machine resources that tend to be very important to Service Fabric services and maps these metrics to SF entities. It targets both Windows and Linux SF clusters.

License:MIT License


Languages

Language:C# 96.2%Language:PowerShell 3.1%Language:C 0.5%Language:Shell 0.2%Language:Batchfile 0.1%