How to Localize Causes of Malfunctions In Complex Systems More Precisely
When there was a malfunction in a complex mission critical system, it is important to localize the subsystem where the cause of that malfunction was located.
Once you know the subsystem,
- You can concentrate your efforts on a much smaller area,
- You know which vendor to consult for further analysis. And you can prove to this vendor that something in their subsystem has caused the problem. So you'll avoid finger-pointing across multiple vendors.
Of course, it is necessary to identify the cause of a malfunction correctly, so that it can be fixed. This will prevent further occurrences of malfunctions with that cause.
Work With Facts Instead of Assumptions
When analyzing malfunctions in mission critical systems, you should not depend on assumptions, but work with facts.
Therefore you should not assume that a specific piece of software or hardware has worked correctly. Instead it is necessary to have enough facts so that you can reconstruct the point of failure based on facts.
Otherwise you may not find the real cause of a past malfunction. Therefore you cannot fix it. (But you may try fix something else which is not broken, and that can make that other component worse.) If you do not fix the real cause, it may occur again, repeatedly. This can cause malfunctions again which may be be dangerous and lead to disaster, etc.
Record the Data Flow on All Interfaces Between Subsystems
For that you usually need to record the data flow of all interfaces between subsystems. Only then you'll have the facts that show you which subsystem didn't work correctly.
In this article we will concentrate on the subsystems which are involved in human-computer-interaction.
Subsystems for Human-Computer-Interaction
When looking superficially, then human-computer-interaction means how human interact with application programs.
However, when looking closer, the situation is more complex.
In 1 we show the subsystems that lie between the applications and the human interface hardware.
You see that an application never communicates directly with the human using it. There are several subsystems between application and human user.
And, of course, any of these subsystems can have malfunctions. This is often overlooked. When you overlook that, you are making an assumption which may or may not be true. We already explained how important it is to work with facts instead of with assumptions.
If any of these subsystems has a failure, then applications cannot be used in a normal way.
Often, it is difficult to tell the difference between an application bug and a problem in any of these subsystems. Here's an example:
If the user clicks with the mouse on a button and expects that a pop-window opens, but nothing happens, this could be
- a bug in the application,
- or in the human-input subsystem of the operating system,
- or a hardware problem in the parts of computer which handle human-input, such as a USB controller,
- or a hardware problem in the mouse.
- or a problem with the graphic subsystem of the OS (e.g., the app correctly sends a CreateWindow request to the graphic subsystem but it is ignored.)
- or a problem with the graphics hardware.
Now imagine that this kind of problem has severe consequences such as that the user cannot prevent a disaster because a mouse click does not work.
How will you know what was the cause of that problem?
(We have listed 6 possibilities for the location of the cause. Locating the subsystem of the cause is just the start of finding out the cause. If you already fail at the start of your analysis, you probably won't get far.
How will you prevent this problem from reoccurring when you don't know the cause?
What do you tell senior management, government agents, lawyers, victims or their relatives?
You see that this is a serious issue.
In the next section we propose a solution.
Our Proposed SolutionTo help with analysis of malfunctions as described above, we suggest to record the data stream on two places in the data flow, as depicted in orange color in 2.
By analyzing the recorded data, this helps to localize the location of the malfunction to one or two points from your 6 points list above.
Here's how this works in detail:
- If the mouse click is not contained in the ATG-recording,
then the mouse does work properly (or the user has not clicked).
- If the mouse click is contained in the ATG-recording but not in the
then the computer hardware or the human-input subsystem of tje operating system did not work properly.
- If the mouse click is contained in the API-recording,
but the corresponding CreateWindow request is not in the API-recording,
then the application didn't work properly.
- If the API-recording contains the CreateWindow request but the request
contains wrong data, such as reference to an object that does not exist,
then the application has a bug. (Or this is the consequence of another problem, which will be found out after analyzing why the application has sent a buggy request.)
- If the API-recording contains the CreateWindow request
but the window does not show up in the ATG-recording,
then the graphics subsystem of the OS, or the graphics hardware of your computer did not work properly.
- If the ATG-recording contains the popup-window,
then either this is a very strange malfunction of a monitor (unlikely) or the popup-window was displayed, but the user didn't see it (more likely). (or the user is lying or has false memory.)
So, you see that you can discriminate 6 different cases where you previously had everything lumped together.
This means that your search-space for further analysis is reduced by a factor of 6 on average.
And in most cases, you have only one or two vendors to talk to. And you can show them the wrong data their system has generated. Or when some output-data is missing, you can show them the input to their subsystem which should have generated that output-data.
How We Can Help You With This
We can provide the API-Recorder: With DisplayRecorder, we provide a field-proven software which records human-machine-interaction at API level.
In addition to our DisplayRecorder, you will also need to purchase an ATG-recording device from a third party vendor. For that you can either search for "at the glass recording" in your favorite search engine, or you ask your existing vendors whether they offer an ATG solution.