Commit 5c2bbac9 authored by Rafael Monnerat's avatar Rafael Monnerat

slapos_crm: Merge ComputeNode_check* alarms into project based alarm

   Refactor the implementation of the alarms:

      - Search from Project rather them query compute nodes, directly
          respecting if the project can create tickets

      - Split ticket creation from messaging introducing
          *_getReportedErrorDict to calculate the error messages
	  (so we can re-use them)

      - Merge 2 alarms (for check compute node and check compute software
          installations) info a single one, so we launch at least half
	  the amount of activities for compute nodes.

      - Drop the specific alarm and to issue a ticket if the user has an
	instance on a 'close/forever' compute node

      - Reimplement SupportRequest_recheckMonitoring

   Merge SoftwareInstallation_getReportedErrorDict into ComputeNode_getReportedErrorDict:

     We only check for problems from Compute Node perspective, so we should
     report only one ticket per time.

     SupportRequest_recheckMonitoring can guarantee that the compute node
     don't have other problems while closing the ticket. So, this reduces
     the scope of checks, for example, if the computer isn't connecting,
     there is no point in check all software installations inside. Another
     example is, if multiple software releases are failing we dont need to
     report each one, since the administrator is informed and create
     multiple tickets for the same compute node will only spam the
     administrator.
parent 37d89868
<?xml version="1.0"?>
<ZopeData>
<record id="1" aka="AAAAAAAAAAE=">
<pickle>
<global name="Alarm" module="erp5.portal_type"/>
</pickle>
<pickle>
<dictionary>
<item>
<key> <string>active_sense_method_id</string> </key>
<value> <string>Alarm_checkComputeNodeState</string> </value>
</item>
<item>
<key> <string>automatic_solve</string> </key>
<value> <int>0</int> </value>
</item>
<item>
<key> <string>description</string> </key>
<value> <string>Check if a public or a friend compute_node contacted master recently and create a ticket if the compute_node stops to contact master after some time.</string> </value>
</item>
<item>
<key> <string>enabled</string> </key>
<value> <int>1</int> </value>
</item>
<item>
<key> <string>id</string> </key>
<value> <string>slapos_crm_check_compute_node_state</string> </value>
</item>
<item>
<key> <string>periodicity_hour</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_hour_frequency</string> </key>
<value>
<none/>
</value>
</item>
<item>
<key> <string>periodicity_minute</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_minute_frequency</string> </key>
<value> <int>30</int> </value>
</item>
<item>
<key> <string>periodicity_month</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_month_day</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_start_date</string> </key>
<value>
<object>
<klass>
<global name="_reconstructor" module="copy_reg"/>
</klass>
<tuple>
<global name="DateTime" module="DateTime.DateTime"/>
<global name="object" module="__builtin__"/>
<none/>
</tuple>
<state>
<tuple>
<float>1288051200.0</float>
<string>GMT</string>
</tuple>
</state>
</object>
</value>
</item>
<item>
<key> <string>periodicity_week</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>portal_type</string> </key>
<value> <string>Alarm</string> </value>
</item>
<item>
<key> <string>sense_method_id</string> </key>
<value>
<none/>
</value>
</item>
<item>
<key> <string>title</string> </key>
<value> <string>Check compute_node\'s state</string> </value>
</item>
</dictionary>
</pickle>
</record>
</ZopeData>
<?xml version="1.0"?>
<ZopeData>
<record id="1" aka="AAAAAAAAAAE=">
<pickle>
<global name="Alarm" module="erp5.portal_type"/>
</pickle>
<pickle>
<dictionary>
<item>
<key> <string>active_sense_method_id</string> </key>
<value> <string>Alarm_searchInstanceOnClosedComputeNode</string> </value>
</item>
<item>
<key> <string>automatic_solve</string> </key>
<value> <int>0</int> </value>
</item>
<item>
<key> <string>description</string> </key>
<value>
<none/>
</value>
</item>
<item>
<key> <string>enabled</string> </key>
<value> <int>0</int> </value>
</item>
<item>
<key> <string>id</string> </key>
<value> <string>slapos_crm_check_instance_on_closed_compute_node</string> </value>
</item>
<item>
<key> <string>periodicity_hour</string> </key>
<value>
<tuple>
<int>1</int>
</tuple>
</value>
</item>
<item>
<key> <string>periodicity_minute</string> </key>
<value>
<tuple>
<int>0</int>
</tuple>
</value>
</item>
<item>
<key> <string>periodicity_month</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_month_day</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_week</string> </key>
<value>
<tuple/>
</value>
</item>
<item>
<key> <string>periodicity_week_day</string> </key>
<value>
<tuple>
<string>Monday</string>
</tuple>
</value>
</item>
<item>
<key> <string>portal_type</string> </key>
<value> <string>Alarm</string> </value>
</item>
<item>
<key> <string>title</string> </key>
<value> <string>Check Instance on closed Compute Nodes</string> </value>
</item>
</dictionary>
</pickle>
</record>
</ZopeData>
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
<dictionary> <dictionary>
<item> <item>
<key> <string>active_sense_method_id</string> </key> <key> <string>active_sense_method_id</string> </key>
<value> <string>Alarm_checkSoftwareInstallationState</string> </value> <value> <string>Alarm_checkProjectMonitoringState</string> </value>
</item> </item>
<item> <item>
<key> <string>automatic_solve</string> </key> <key> <string>automatic_solve</string> </key>
...@@ -16,9 +16,7 @@ ...@@ -16,9 +16,7 @@
</item> </item>
<item> <item>
<key> <string>description</string> </key> <key> <string>description</string> </key>
<value> <value> <string>Check per project and trigger activities to verify all compute nodes (monitored) per project and as well related instances.</string> </value>
<none/>
</value>
</item> </item>
<item> <item>
<key> <string>enabled</string> </key> <key> <string>enabled</string> </key>
...@@ -26,7 +24,7 @@ ...@@ -26,7 +24,7 @@
</item> </item>
<item> <item>
<key> <string>id</string> </key> <key> <string>id</string> </key>
<value> <string>slapos_crm_check_software_installation_state</string> </value> <value> <string>slapos_crm_monitoring_project</string> </value>
</item> </item>
<item> <item>
<key> <string>periodicity_hour</string> </key> <key> <string>periodicity_hour</string> </key>
...@@ -101,7 +99,7 @@ ...@@ -101,7 +99,7 @@
</item> </item>
<item> <item>
<key> <string>title</string> </key> <key> <string>title</string> </key>
<value> <string>Check software installation\'s state</string> </value> <value> <string>Create tickets for Compute nodes and Instance Trees</string> </value>
</item> </item>
</dictionary> </dictionary>
</pickle> </pickle>
......
portal = context.getPortalObject()
portal.portal_catalog.searchAndActivate(
portal_type='Project',
validation_state='validated',
method_id='Project_checkMonitoringState',
activate_kw={'tag': tag}
)
context.activate(after_tag=tag).getId()
...@@ -54,7 +54,7 @@ ...@@ -54,7 +54,7 @@
</item> </item>
<item> <item>
<key> <string>id</string> </key> <key> <string>id</string> </key>
<value> <string>Alarm_checkComputeNodeState</string> </value> <value> <string>Alarm_checkProjectMonitoringState</string> </value>
</item> </item>
</dictionary> </dictionary>
</pickle> </pickle>
......
portal = context.getPortalObject()
monitor_enabled_category = portal.restrictedTraverse(
"portal_categories/monitor_scope/enabled", None)
if monitor_enabled_category is not None:
portal.portal_catalog.searchAndActivate(
portal_type='Compute Node',
validation_state='validated',
monitor_scope__uid=monitor_enabled_category.getUid(),
method_id='ComputeNode_checkSoftwareInstallationState',
activate_kw={'tag':tag}
)
context.activate(after_tag=tag).getId()
portal = context.getPortalObject()
active_process = context.newActiveProcess().getRelativeUrl()
# Closed compute_nodes like this might contains unremoved instances hanging there.
category_close_forever = portal.restrictedTraverse(
"portal_categories/allocation_scope/close/forever", None)
category_close_outdated = portal.restrictedTraverse(
"portal_categories/allocation_scope/close/outdated", None)
return portal.portal_catalog.searchAndActivate(
method_kw=dict(fixit=fixit, active_process=active_process),
method_id="ComputeNode_checkInstanceOnCloseAllocation",
portal_type='Compute Node',
default_allocation_scope_uid=[category_close_forever.getUid(), category_close_outdated.getUid()],
validation_state="validated",
activite_kw={"tag": tag} )
<?xml version="1.0"?>
<ZopeData>
<record id="1" aka="AAAAAAAAAAE=">
<pickle>
<global name="PythonScript" module="Products.PythonScripts.PythonScript"/>
</pickle>
<pickle>
<dictionary>
<item>
<key> <string>_bind_names</string> </key>
<value>
<object>
<klass>
<global name="_reconstructor" module="copy_reg"/>
</klass>
<tuple>
<global name="NameAssignments" module="Shared.DC.Scripts.Bindings"/>
<global name="object" module="__builtin__"/>
<none/>
</tuple>
<state>
<dictionary>
<item>
<key> <string>_asgns</string> </key>
<value>
<dictionary>
<item>
<key> <string>name_container</string> </key>
<value> <string>container</string> </value>
</item>
<item>
<key> <string>name_context</string> </key>
<value> <string>context</string> </value>
</item>
<item>
<key> <string>name_m_self</string> </key>
<value> <string>script</string> </value>
</item>
<item>
<key> <string>name_subpath</string> </key>
<value> <string>traverse_subpath</string> </value>
</item>
</dictionary>
</value>
</item>
</dictionary>
</state>
</object>
</value>
</item>
<item>
<key> <string>_params</string> </key>
<value> <string>fixit, tag, **kw</string> </value>
</item>
<item>
<key> <string>id</string> </key>
<value> <string>Alarm_searchInstanceOnClosedComputeNode</string> </value>
</item>
</dictionary>
</pickle>
</record>
</ZopeData>
from Products.CMFActivity.ActiveResult import ActiveResult
portal = context.getPortalObject()
active_process = portal.restrictedTraverse(active_process)
partition_uid_list = [compute_partition.getUid() for compute_partition in context.objectValues(portal_type="Compute Partition")]
if not partition_uid_list:
return
for software_instance in portal.portal_catalog(
portal_type="Software Instance",
default_aggregate_uid=partition_uid_list):
if software_instance.getSlapState() == "destroy_requested":
continue
active_process.postResult(ActiveResult(
summary="%s" % software_instance.getRelativeUrl(),
severity=100,
detail="%s on %s" % (software_instance.getRelativeUrl(), context.getRelativeUrl())))
<?xml version="1.0"?>
<ZopeData>
<record id="1" aka="AAAAAAAAAAE=">
<pickle>
<global name="PythonScript" module="Products.PythonScripts.PythonScript"/>
</pickle>
<pickle>
<dictionary>
<item>
<key> <string>_bind_names</string> </key>
<value>
<object>
<klass>
<global name="_reconstructor" module="copy_reg"/>
</klass>
<tuple>
<global name="NameAssignments" module="Shared.DC.Scripts.Bindings"/>
<global name="object" module="__builtin__"/>
<none/>
</tuple>
<state>
<dictionary>
<item>
<key> <string>_asgns</string> </key>
<value>
<dictionary>
<item>
<key> <string>name_container</string> </key>
<value> <string>container</string> </value>
</item>
<item>
<key> <string>name_context</string> </key>
<value> <string>context</string> </value>
</item>
<item>
<key> <string>name_m_self</string> </key>
<value> <string>script</string> </value>
</item>
<item>
<key> <string>name_subpath</string> </key>
<value> <string>traverse_subpath</string> </value>
</item>
</dictionary>
</value>
</item>
</dictionary>
</state>
</object>
</value>
</item>
<item>
<key> <string>_params</string> </key>
<value> <string>fixit, active_process</string> </value>
</item>
<item>
<key> <string>id</string> </key>
<value> <string>ComputeNode_checkInstanceOnCloseAllocation</string> </value>
</item>
</dictionary>
</pickle>
</record>
</ZopeData>
from DateTime import DateTime
portal = context.getPortalObject()
if context.getMonitorScope() == "disabled":
return
project = context.getFollowUpValue()
if project.Project_isSupportRequestCreationClosed():
return
error_dict = context.ComputeNode_getReportedErrorDict()
if not error_dict['should_notify']:
return
support_request = project.Project_createSupportRequestWithCausality(
error_dict['ticket_title'],
error_dict['ticket_description'],
causality=context.getRelativeUrl(),
destination_decision=project.getDestination()
)
if support_request is not None:
support_request.Ticket_createProjectEvent(
error_dict['ticket_title'], 'outgoing', 'Web Message',
portal.service_module.slapos_crm_information.getRelativeUrl(),
text_content=error_dict['ticket_description'],
content_type='text/plain',
notification_message=error_dict['notification_message_reference'],
#language=XXX,
substitution_method_parameter_dict=error_dict
)
return support_request
...@@ -54,7 +54,7 @@ ...@@ -54,7 +54,7 @@
</item> </item>
<item> <item>
<key> <string>id</string> </key> <key> <string>id</string> </key>
<value> <string>ComputeNode_checkState</string> </value> <value> <string>ComputeNode_checkMonitoringState</string> </value>
</item> </item>
</dictionary> </dictionary>
</pickle> </pickle>
......
from DateTime import DateTime
portal = context.getPortalObject()
if context.getMonitorScope() == "disabled":
return
project = context.getFollowUpValue()
if project.Project_isSupportRequestCreationClosed():
return
software_installation_list = portal.portal_catalog(
portal_type='Software Installation',
aggregate__uid=context.getUid(),
validation_state='validated',
sort_on=(('creation_date', 'DESC'),)
)
support_request_list = []
should_notify = True
tolerance = DateTime() - 0.5
for software_installation in software_installation_list:
should_notify = False
should_notify, ticket_title, description, last_contact = \
software_installation.SoftwareInstallation_hasReportedError(
tolerance=tolerance)
if should_notify:
project = context.getFollowUpValue()
support_request = project.Project_createSupportRequestWithCausality(
ticket_title,
description,
causality=context.getRelativeUrl(),
destination_decision=project.getDestination()
)
if support_request is None:
return
notification_message_reference = 'slapos-crm-compute_node_software_installation_state.notification'
support_request.Ticket_createProjectEvent(
ticket_title, 'outgoing', 'Web Message',
portal.service_module.slapos_crm_information.getRelativeUrl(),
text_content=description,
content_type='text/plain',
notification_message=notification_message_reference,
#language=XXX,
substitution_method_parameter_dict={
'compute_node_title':context.getTitle(),
# Maybe a mistake on compute_node_id
'compute_node_id': software_installation.getReference(),
'last_contact': last_contact
}
)
support_request_list.append(support_request)
return support_request_list
from DateTime import DateTime
portal = context.getPortalObject()
if (context.getMonitorScope() == "disabled"):
return
project = context.getFollowUpValue()
if project.Project_isSupportRequestCreationClosed():
return
reference = context.getReference()
compute_node_title = context.getTitle()
node_ticket_title = "Lost contact with compute_node %s" % reference
instance_ticket_title = "Compute Node %s has a stalled instance process" % reference
ticket_title = node_ticket_title
description = ""
last_contact = "No Contact Information"
issue_document_reference = ""
notification_message_reference = 'slapos-crm-compute_node_check_state.notification'
now = DateTime()
d = context.getAccessStatus()
# Ignore if data isn't present.
should_notify = False
if d.get("no_data") == 1:
should_notify = True
description = "The Compute Node %s (%s) has not contacted the server (No Contact Information)" % (
compute_node_title, reference)
else:
last_contact = DateTime(d.get('created_at'))
if (now - last_contact) > 0.01:
should_notify = True
description = "The Compute Node %s (%s) has not contacted the server for more than 30 minutes" \
"(last contact date: %s)" % (compute_node_title, reference, last_contact)
else:
data_array = context.ComputeNode_hasModifiedFile()
if data_array:
should_notify = True
notification_message_reference = "slapos-crm-compute_node_check_modified_file.notification"
ticket_title = "Compute Node %s has modified file" % reference
issue_document_reference = data_array.getReference()
description = "The Compute Node %s (%s) has modified file: %s" % (compute_node_title, reference, issue_document_reference)
if not should_notify:
# Since server is contacting, check for stalled processes
ticket_title = instance_ticket_title
notification_message_reference = 'slapos-crm-compute_node_check_stalled_instance_state.notification'
last_contact = "No Contact Information"
# If server has no partitions skip
compute_partition_uid_list = [
x.getUid() for x in context.contentValues(portal_type="Compute Partition")
if x.getSlapState() == 'busy']
if compute_partition_uid_list:
instance_list = portal.portal_catalog(
portal_type='Software Instance',
aggregate__uid=compute_partition_uid_list)
if instance_list:
should_notify = True
for instance in instance_list:
instance_access_status = instance.getAccessStatus()
if instance_access_status.get('no_data', None):
# Ignore if there isnt any data
continue
# At lest one partition contacted in the last 24h30min.
last_contact = max(DateTime(instance_access_status.get('created_at')), last_contact)
if (now - DateTime(instance_access_status.get('created_at'))) < 1.05:
should_notify = False
break
if should_notify:
description = "The Compute Node %s (%s) didnt process its instances for more than 24 hours, last contact: %s" % (
context.getTitle(), context.getReference(), last_contact)
if should_notify:
support_request = project.Project_createSupportRequestWithCausality(
ticket_title,
description,
causality=context.getRelativeUrl(),
destination_decision=project.getDestination()
)
if support_request is None:
return
support_request.Ticket_createProjectEvent(
ticket_title, 'outgoing', 'Web Message',
portal.service_module.slapos_crm_information.getRelativeUrl(),
text_content=description,
content_type='text/plain',
notification_message=notification_message_reference,
#language=XXX,
substitution_method_parameter_dict={
'compute_node_title':context.getTitle(),
'compute_node_id':reference,
'last_contact':last_contact,
'issue_document_reference': issue_document_reference
}
)
return support_request
portal = context.getPortalObject()
compute_node = context
software_installation_tolerance = DateTime() - 0.5
reference = context.getReference()
compute_node_title = context.getTitle()
d = compute_node.getAccessStatus()
error_dict = {
'should_notify': None,
'ticket_title': None,
'ticket_description': None,
'notification_message_reference': None,
'compute_node_title': compute_node_title,
'compute_node_id': reference,
'last_contact': None,
'issue_document_reference': None
}
if compute_node.getMonitorScope() == "disabled":
for i in ['ticket_title', 'ticket_description', 'last_contact']:
error_dict[i] = "Monitor is disabled on this Compute Node."
return error_dict
if d.get("no_data") == 1:
error_dict['last_contact'] = "No Contact Information"
error_dict['ticket_title'] = "Lost contact with compute_node %s" % reference
error_dict['ticket_description'] = \
"The Compute Node %s (%s) has not contacted the server (No Contact Information)" % (
compute_node_title, reference)
error_dict['notification_message_reference'] = 'slapos-crm-compute_node_check_state.notification'
error_dict['should_notify'] = True
return error_dict
last_contact = DateTime(d.get('created_at'))
now = DateTime()
if (now - last_contact) > 0.01:
error_dict['should_notify'] = True
error_dict['ticket_title'] = "Lost contact with compute_node %s" % reference
error_dict['last_contact'] = last_contact
error_dict['notification_message_reference'] = 'slapos-crm-compute_node_check_state.notification'
error_dict['ticket_description'] = "The Compute Node %s (%s) has not contacted the server for more than 30 minutes" \
"(last contact date: %s)" % (compute_node_title, reference, last_contact)
return error_dict
data_array = context.ComputeNode_hasModifiedFile()
if data_array:
error_dict['last_contact'] = last_contact
error_dict['should_notify'] = True
error_dict['notification_message_reference'] = "slapos-crm-compute_node_check_modified_file.notification"
error_dict['ticket_title'] = "Compute Node %s has modified file" % reference
error_dict['issue_document_reference'] = data_array.getReference()
error_dict['ticket_description'] = "The Compute Node %s (%s) has modified file: %s" % (
compute_node_title, reference, error_dict['issue_document_reference'])
return error_dict
# Since server is contacting, check for stalled processes
# If server has no partitions skip
compute_partition_uid_list = [
x.getUid() for x in context.contentValues(portal_type="Compute Partition")
if x.getSlapState() == 'busy']
if compute_partition_uid_list:
instance_list = portal.portal_catalog(
portal_type='Software Instance',
aggregate__uid=compute_partition_uid_list)
should_notify = True
instance_last_contact = -1
for instance in instance_list:
instance_access_status = instance.getAccessStatus()
if instance_access_status.get('no_data', None):
# Ignore if there isnt any data
continue
# At lest one partition contacted in the last 24h30min.
instance_last_contact = max(DateTime(instance_access_status.get('created_at')),
instance_last_contact)
if (now - DateTime(instance_access_status.get('created_at'))) < 1.05:
should_notify = False
break
if len(instance_list) and should_notify:
if instance_last_contact == -1:
error_dict['last_contact'] = "No Contact Information"
else:
error_dict['last_contact'] = instance_last_contact
error_dict['should_notify'] = True
error_dict['notification_message_reference'] = "slapos-crm-compute_node_check_stalled_instance_state.notification"
error_dict['ticket_title'] = "Compute Node %s has a stalled instance process" % reference
error_dict['ticket_description'] = "The Compute Node %s (%s) didnt process its instances for more than 24 hours, last contact from the node: %s" % (
compute_node_title, reference, last_contact)
return error_dict
for software_installation in portal.portal_catalog(
portal_type='Software Installation',
aggregate__uid=context.getUid(),
validation_state='validated',
sort_on=(('creation_date', 'DESC'),)
):
si_dict = software_installation.getAccessStatus()
if software_installation.getCreationDate() > software_installation_tolerance or \
si_dict.get("no_data", None) == 1 or \
si_dict.get('text').startswith("#access"):
continue
error_dict['notification_message_reference'] = \
'slapos-crm-compute_node_software_installation_state.notification'
# Error occur, we should notify
access_status_text = si_dict.get('text')
last_contact = DateTime(si_dict.get('created_at'))
if access_status_text.startswith("#building") or \
access_status_text.startswith("#error"):
error_dict['last_contact'] = last_contact
error_dict['should_notify'] = True
error_dict['ticket_title'] = "%s is failing or taking too long to build on %s" % (
software_installation.getReference(), compute_node.getReference())
message_list = (software_installation.getUrlString(),
compute_node_title,
software_installation.getCreationDate())
if access_status_text.startswith("#building"):
error_dict['ticket_description'] = \
"The software release %s is building for mode them 12 hours on %s, started on %s" % message_list
else:
error_dict['ticket_description'] = \
"The software release %s is failing to build for too long on %s, started on %s" % message_list
error_dict['message'] = error_dict['ticket_description']
return error_dict
return error_dict
...@@ -54,7 +54,7 @@ ...@@ -54,7 +54,7 @@
</item> </item>
<item> <item>
<key> <string>id</string> </key> <key> <string>id</string> </key>
<value> <string>ComputeNode_checkSoftwareInstallationState</string> </value> <value> <string>ComputeNode_getReportedErrorDict</string> </value>
</item> </item>
</dictionary> </dictionary>
</pickle> </pickle>
......
portal = context.getPortalObject() portal = context.getPortalObject()
monitor_enabled_category = portal.restrictedTraverse( monitor_enabled_category = portal.restrictedTraverse(
"portal_categories/monitor_scope/enabled", None) "portal_categories/monitor_scope/enabled", None)
if context.Project_isSupportRequestCreationClosed():
return
if monitor_enabled_category is not None: if monitor_enabled_category is not None:
portal.portal_catalog.searchAndActivate( portal.portal_catalog.searchAndActivate(
portal_type='Compute Node', portal_type='Compute Node',
validation_state='validated', validation_state='validated',
monitor_scope__uid=monitor_enabled_category.getUid(), monitor_scope__uid=monitor_enabled_category.getUid(),
method_id='ComputeNode_checkState', follow_up__uid=context.getUid(),
method_id='ComputeNode_checkMonitoringState',
# This alarm bruteforce checking all documents, # This alarm bruteforce checking all documents,
# without changing them directly. # without changing them directly.
# Increase priority to not block other activities # Increase priority to not block other activities
activate_kw={'tag':tag, 'priority': 2} activate_kw={'tag':tag, 'priority': 2}
) )
context.activate(after_tag=tag).getId() context.activate(after_tag=tag).getId()
...@@ -50,11 +50,11 @@ ...@@ -50,11 +50,11 @@
</item> </item>
<item> <item>
<key> <string>_params</string> </key> <key> <string>_params</string> </key>
<value> <string>tag, fixit, params</string> </value> <value> <string>tag</string> </value>
</item> </item>
<item> <item>
<key> <string>id</string> </key> <key> <string>id</string> </key>
<value> <string>Alarm_checkSoftwareInstallationState</string> </value> <value> <string>Project_checkMontoringState</string> </value>
</item> </item>
</dictionary> </dictionary>
</pickle> </pickle>
......
from DateTime import DateTime
if tolerance is None:
tolerance = DateTime() - 0.5
software_installation = context
reference = software_installation.getReference()
d = software_installation.getAccessStatus()
def return_ok(batch_mode):
if batch_mode:
return None
return None, None, None, None
if software_installation.getCreationDate() > tolerance:
return return_ok(batch_mode)
if software_installation.getSlapState() != 'start_requested':
return return_ok(batch_mode)
if d.get("no_data", None) == 1:
return return_ok(batch_mode)
if d.get("text").startswith("#access"):
return return_ok(batch_mode)
last_contact = DateTime(d.get('created_at'))
if d.get("text").startswith("#building"):
if batch_mode:
# is it a problem...?
return last_contact
should_notify = True
ticket_title = "%s is building for too long on %s" % (reference, software_installation.getAggregateReference())
description = "The software release %s is building for mode them 12 hours on %s, started on %s" % \
(software_installation.getUrlString(), software_installation.getAggregateTitle(), software_installation.getCreationDate())
return should_notify, ticket_title, description, last_contact
if d.get("text").startswith("#error"):
if batch_mode:
return DateTime(d.get('created_at'))
should_notify = True
ticket_title = "%s is failing to build on %s" % (reference, software_installation.getAggregateReference())
description = "The software release %s is failing to build for too long on %s, started on %s" % \
(software_installation.getUrlString(), software_installation.getAggregateTitle(), software_installation.getCreationDate())
return should_notify, ticket_title, description, last_contact
<?xml version="1.0"?>
<ZopeData>
<record id="1" aka="AAAAAAAAAAE=">
<pickle>
<global name="PythonScript" module="Products.PythonScripts.PythonScript"/>
</pickle>
<pickle>
<dictionary>
<item>
<key> <string>_bind_names</string> </key>
<value>
<object>
<klass>
<global name="_reconstructor" module="copy_reg"/>
</klass>
<tuple>
<global name="NameAssignments" module="Shared.DC.Scripts.Bindings"/>
<global name="object" module="__builtin__"/>
<none/>
</tuple>
<state>
<dictionary>
<item>
<key> <string>_asgns</string> </key>
<value>
<dictionary>
<item>
<key> <string>name_container</string> </key>
<value> <string>container</string> </value>
</item>
<item>
<key> <string>name_context</string> </key>
<value> <string>context</string> </value>
</item>
<item>
<key> <string>name_m_self</string> </key>
<value> <string>script</string> </value>
</item>
<item>
<key> <string>name_subpath</string> </key>
<value> <string>traverse_subpath</string> </value>
</item>
</dictionary>
</value>
</item>
</dictionary>
</state>
</object>
</value>
</item>
<item>
<key> <string>_params</string> </key>
<value> <string>tolerance=None, batch_mode=False</string> </value>
</item>
<item>
<key> <string>id</string> </key>
<value> <string>SoftwareInstallation_hasReportedError</string> </value>
</item>
</dictionary>
</pickle>
</record>
</ZopeData>
from DateTime import DateTime causality_portal_type_list = [
'Compute Node',
'Instance Tree'
]
if context.getSimulationState() == "invalidated": if (context.getSimulationState() == "invalidated") or \
return "Closed Ticket" (context.getPortalType() != "Support Request") or \
(not context.getCausality(portal_type=causality_portal_type_list)):
# Nothing to check
return
if context.getPortalType() != "Support Request":
return "Not a Support Request"
now = DateTime() document = context.getCausalityValue(portal_type=causality_portal_type_list)
portal = context.getPortalObject() causality_portal_type = document.getPortalType()
document = context.getAggregateValue() if causality_portal_type == "Compute Node":
if document is None: error_dict = document.ComputeNode_getReportedErrorDict()
return True return error_dict['message']
aggregate_portal_type = document.getPortalType()
if aggregate_portal_type == "Compute Node":
if document.getMonitorScope() == "disabled":
return "Monitor is disabled to the related %s." % document.getPortalType()
d = document.getAccessStatus()
if d.get("no_data", None) == 1:
return "No Contact Information"
last_contact = DateTime(d.get('created_at'))
if (now - last_contact) < 0.01:
ComputeNode_hasModifiedFile = getattr(
document, "ComputeNode_hasModifiedFile", None)
if ComputeNode_hasModifiedFile:
data_array = ComputeNode_hasModifiedFile()
if data_array:
return "Compute Node %s (%s) has modified file: %s" % (
document.getTitle(), document.getReference(), data_array.getReference())
# If server has no partitions skip
compute_partition_uid_list = [
x.getUid() for x in document.contentValues(portal_type="Compute Partition")
if x.getSlapState() == 'busy']
if compute_partition_uid_list:
is_instance_stalled = True
last_contact = None
instance_list = portal.portal_catalog(
portal_type='Software Instance',
default_aggregate_uid=compute_partition_uid_list)
for instance in instance_list:
instance_access_status = instance.getAccessStatus()
if instance_access_status.get('no_data', None):
# Ignore if there isnt any data
continue
# At lest one partition contacted in the last 24h30min.
last_contact = max(DateTime(instance_access_status.get('created_at')), last_contact)
if (now - DateTime(instance_access_status.get('created_at'))) < 1.05:
is_instance_stalled = False
break
if is_instance_stalled and len(instance_list):
if last_contact is None:
return "Process instance stalled"
return "Process instance stalled, last contact was %s" % last_contact
return "All OK, latest contact: %s " % last_contact
else:
return "Problem, latest contact: %s" % last_contact
if aggregate_portal_type == "Software Installation":
compute_node_title = document.getAggregateTitle()
if document.getAggregateValue().getMonitorScope() == "disabled":
return "Monitor is disabled to the related %s." % document.getPortalType()
if document.getSlapState() not in ["start_requested", "stop_requested"]:
return "Software Installation is Destroyed."
d = document.getAccessStatus()
if d.get("no_data", None) == 1:
return "The software release %s did not started to build on %s since %s" % \
(document.getUrlString(), compute_node_title, document.getCreationDate())
last_contact = DateTime(d.get('created_at'))
if d.get("text").startswith("building"):
return "The software release %s is building for mode them 12 hours on %s, started on %s" % \
(document.getUrlString(), compute_node_title, document.getCreationDate())
elif d.get("text").startswith("#access"):
return "All OK, software built."
elif d.get("text").startswith("#error"):
return "The software release %s is failing to build for too long on %s, started on %s" % \
(document.getUrlString(), compute_node_title, document.getCreationDate())
if aggregate_portal_type == "Instance Tree":
if document.getMonitorScope() == "disabled":
return "Monitor is disabled to the related %s." % document.getPortalType()
if causality_portal_type == "Instance Tree":
message_list = [] message_list = []
instance_tree = document instance_tree = document
...@@ -98,24 +25,8 @@ if aggregate_portal_type == "Instance Tree": ...@@ -98,24 +25,8 @@ if aggregate_portal_type == "Instance Tree":
# Check if at least one software Instance is Allocated # Check if at least one software Instance is Allocated
for instance in software_instance_list: for instance in software_instance_list:
if instance.getSlapState() not in ["start_requested", "stop_requested"]: error_dict = instance.SoftwareInstance_getReportedErrorDict(tolerance=30)
continue if error_dict['should_notify']:
message_list.append(error_dict['message'])
if instance.getAggregate() is not None:
compute_node = instance.getAggregateValue().getParentValue()
if instance.getPortalType() == "Software Instance" and \
instance.getSlapState() == "start_requested" and \
instance.SoftwareInstance_hasReportedError():
message_list.append("%s has error (%s, %s at %s scope %s)" % (instance.getReference(), instance.getTitle(),
instance.getUrlString(), compute_node.getReference(),
compute_node.getAllocationScope()))
if instance.getPortalType() == "Software Instance" and \
compute_node.getAllocationScope() in ["closed/outdated"] and \
instance.getSlapState() == "start_requested" and \
instance.SoftwareInstance_hasReportedError():
message_list.append("%s on a %s compute_node" % (instance.getReference(), compute_node.getAllocationScope()) )
else:
message_list.append("%s is not allocated" % instance.getReference())
return ",".join(message_list) return ",".join(message_list)
return None return None
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment