Returned error codes // what to do with them?
The wrong way
A new application has started. You are full of ideas, eager to start from scratch, and this time it will be much better than the other times. You’ve got a great idea regarding errors reporting in your application. Each function will return 0 if it succeeds. It will return a value greater than zero to designate a warning, and a negative value to designate an error. Sounds great! This program is going to be a piece or art!
Time goes by, and you find yourself in a tight schedule. Many features are still missing, what you can’t say about bugs - there are more than a handful of them. You might be thinking (as I did) – what are those error codes good for anyway? Besides of, maybe, not overrunning an array, or avoiding writing a FLASH memory that returned an error upon trying to unlock it, what else can you do with them? If it’s not working, why waste time on checking them? The schedule is already tight, let’s just code! So you decide to throw away non-critical tasks delaying you from advancing towards the final goal, and you stop checking returned errors. Indeed there is no point in checking them, because, either way, the application must work. If it does, no errors occur anyway. If it doesn’t, it must be corrected anyway, so why insist on checking returned error codes?
Debugging time arrives. It happens all the time. You find yourself working for hours to figure out why after data collection stage the application gets stuck in the calculation stage. After several trials you find that one of collected data items contained a corrupted value – a huge value – causing the loop iterating over it go forever. You seek for the cause of the corrupted value, and after some time you find it – the input file, from which you read the data, was renamed two weeks ago, but you forgot to update your application. You haven’t walked that path of the application since, only today. And today the application tried to read a non-existing file, and read whatever it was fed with by the “missing file”. So you fix the software, update the new file name and the problem is solved. Such a talented programmer you are, you did a great detective job and found it. No bug will ever escape you! And it took only a few hours.
Had that been the only bug, I wouldn’t have to write this article. The big problem here is that that bag has many friends bugs, each of which will decide to attack when it is most inconvenient to you. The following few weeks are going to be real fun!
Is it, somehow, related to ignoring returned errors? (Of course it is! That's what this article is all about). Let’s move on to the right way of doing things.
The right way
First I must confess. It’s not the right way. I don’t know what the right way is. I only know what worked extremely well for me in the last few years. I believe this is going to work also for you. Enough with the philosophy.
Being in the programming business for a few years, you have probably been introduced in more than one occasion to the rule or instruction, which says, that all returned error codes must be checked. Sounds reasonable. I say it’s incomplete. You can check 100% of your functions returning error codes, and then what?
err = do_something() // now what??Well, there are a few cases where you can do something. Consider:
int read_device(char *buf, int maxsize)
{
if (!open_device()) return 0; // fail
if (amount_to_read() > maxsize) return 0; // fail
else {
// Do something hopefully useful here
// And please, do me a favor and don't forget
// to close the device. I intentionally didn't
// close the device, for that require more
// error checking, and I wouldn't like this
// function to get too long.
// Oops! This comment made it far too long,
// Never mind..
}
return 1; // success
}
In the above example we return a failure code if we failed opening the device, and also if it has more data to read than the supplied array’s size. So it seems that the error checking prevents the system from acting in strange ways.
Or does it? If we hadn’t checked for the error returned from open_device(), we would have certainly filled the array with garbage. But now that we have checked it, we still don’t fill the array with anything useful, because, well, there is a fault in the system. So, either way, whatever the application is supposed to do, it doesn’t do it. The error checking we’ve done saves it from crashing. This is good. But it doesn’t give us a clue where the problem is. We will need to crack open the debugger, and try hunt down the exact place where the problem first stroke, locate the faulty device and claim that from this point it’s an hardware team issue, let them fix that device!
And again, what am I supposed to do with the failure code returned from read_device?
What I claim is that checking for returned error codes just to save the application from crashing is certainly not enough. It won’t save our long days of detective work debugging the software. It merely makes the application do nothing and still run. The solution to this is a very important issue. I’ll dedicate the rest of this article to this.
Error reporting
The solution is simple: All error codes and all erroneous situations must be reported.
If we get noticed about each and every failure while the application is running, we may be directed to the problematic lines of code in no time.
The above sentence says, actually, a few things, not just one:
- We must report errors.
- We must get noticed when they happen.
- We need them for debugging purpose, not for improving the application’s operation.
- The error report directs us straight to the place in the code where the problem occurs, and by this, saves us a long debugging time.
The essential thing we must have in our application is an error reporting mechanism. This mechanism consists of 3 components:
- An error reporting API
- A communication link to the outside world (a PC) over which to send the error reports
- A viewer on the outside world (that very PC) on which the reports can be viewed as they occur.
Every potential deviation from the expected behavior of the application must be reported and viewed. Nothing should escape. I remember more than once thinking to myself: “If that specific error, reported on my PC, didn’t actually show, would I ever have a chance to find it”?
I once was on a presentation about a methodology called design by contract, where special language constructs are given to catch erroneous situation and try to recover from them. Recovering from errors is very important, when possible, no doubt about that. But what I’m talking in this article is not about writing systems that never crash. Rather, I try to deal with an earlier stage of the software. It is the stage of software development being held back due to bug hunting. With a good error reporting mechanism, you can go on and more easily build fault tolerant systems, or whatever you like. The error reporting mechanism is for development only (it can also be logged and used for post mortem debugging). It’s for significantly shortening the development time. With a good error reporting, the huge time wasted on bug hunting is shortened – the error reports are the bug hunters!
Let’s explain each component of the report-view mechanism
The reporting API
This is the function that reports an error. In a PC application it may show a popup window reporting the error. If it’s an embedded application, it may send the error report over a communication link to the viewing PC, or log the error in memory, which is periodically polled by a PC over a communication link connected to the embedded system.
The one thing I want to say about this function, that you are going to write, is that it must be super easy to use. It must. Had I known more severe words to emphasize this, I would have used them. If it’s not super easy to use, it won’t be used. When we are concentrated coding something, I believe we tend to move forward. Write the feature, then go to the next one. Doing an if check on something is grasped as a delay in the process of writing the feature. I don’t like writing conditional code checks, but I do it because I know it’s essential. It’s a burden. This burden must be as light as possible.
A software I worked on, had an interesting error report mechanism. It was a PC application, and the guys didn’t want free style error messages. They wanted fixed text, known ahead. They also wanted the system engineers, who were in continuous contact with the customers, to be the ones to set the allowed error messages. So they created an Excel sheet containing all possible error messages. Each error message had an ID associated with it. There was a button on the sheet. Pressing on it created a .h file, containing an enum associated with an array of messages, which the programmers, us, included in the application. Now, some of the errors needed parameters, such as a file name for the error message that said that a file could not be found. So a few error reporting functions have been defined, that took an error enum and zero or more parameters, and displayed the associated message together with the parameters. So whenever you encountered an erroneous situation in the software, which happened quite a bit, you either were familiar with all the possible error messages, or needed to learn them, and find the one fit for the situation, which happened part of the time, or, as happened the other part of the time, no existing error message fit the situation, so you needed to create a new error message, in which case you would turn to the Excel sheet, define the new error message and its ID, click the button to create the new header file, recompile the entire application, and … wait a minute! This is Wroooong (as goes the Simply Red song), how can someone continue coding like this? Soon enough no more error reporting was done.
It took me half an hour to find that an unreported DLL-loading failure caused an exception of using a NULL value somewhere far away from the unreported failure. What’s funny (?) about that is that I didn’t dare going inside the described error reporting mechanism and report the failure using a new message, and a month later the same failure, same bug, same debugging time bit me again! That was the point of change. The way I write software has changed. I understand now that the error reporting mechanism must be very easy to use. Let me say it directly: It must be a one liner.
If you need to do this upon an error:
if (!load_file(filename)) {
sprint(msgbuf, "Cannot open file %s", filename);
errormsg(buf);
}
then it’s one line too much, and good chances are that you are going to miss more than a few chances to report errors, because 2 lines for error reporting feel too much like a construction of something, too much a deviation from the important thing you are now working on. You will feel that spending time on message construction is less valuable than the code you write, and you may just skip it. In the PC software I worked on, I came up with a very convenient error reporting APi, something like this (C++):
reporterror << "Cannot open file " << filename << " for stage no. " << hex << stage ;Those of you familiar with C++ will recognize the stream-like syntax. Explaining it will deviate from the subject of this article, but I’ll go for it anyway, because it’s cool, and you might want to try it. You may feel free to skip the following description, but you must do it carefully so you won’t miss the part right after it :-). Here is the definition:
#include <iostream>
#include <sstream>
using namespace std;
class ReportError : public ostringstream {
public:
virtual ~ReportError() {
cout << str() << endl;
}
};
So you can go: ReportError() << "Cannot open file " << filename;and it will be printed to the console. Of course, in my application ~ReportError() was implemented using a message box instead of writing to the console. One problem I encountered in the compiler I worked with and its associated STL headers, is that the temporary object introduced by ReportError() thought that strings fed to it using << where address values instead of strings, so I needed to create a macro to cast it to a const:
#define reporterror (const_cast<ReportError &>ReportError())And then I could use it as shown in the beginning of this description (but with the the g++ compiler this is unnecessary, and also doesn't compile).
Back from the technical stuff. The above reporting mechanism is so easy to use, that it’s almost fun. I now easily and effortlessly report everything suspicious, and don’t let any unwanted situation slip away. There can be sometimes 4 error reports over one function of 30 lines of code! In my application the reporting mechanism works only in development mode (a global flag is on), so it doesn’t disturb the customers with errors about the internals of the application, but all the errors are always logged.
The API is different in an embedded application, where you are usually limited by time and memory. Strings construction are time consuming, especially if there are a lot of them (as there should be) and they use dynamic memory allocations. If that’s not a problem in your system, use strings. If it is, however, a different approach should be taken. An embedded application also can’t just popup messages. As I said earlier, I must send the error over some kind of link to a PC. However, the requirement for a super easy API for reporting errors still holds and is still essential.
An embedded system errors reporting mechanism indeed requires more creativity. I’ll tell you what I did on one of the real-time-embedded application I worked on, not so long ago. I allocated a small memory buffer for recording errors, a kind of a tiny error log. Every entity consisted of two numbers, an error code and the number of times it has been reported. I allocated enough space just for about 20 entries (which make 40 integers). As it turned out, this was more than enough, for usually the buffer remained clean, and when an error stroke, it was listed there, together with a few more errors derived from it. So 20 entries were enough.
Each new report used a fresh new error code. Wait a minute!, you say, This makes you maintain a database of error codes. Each time you want to report a new error, you will need to check for the last error code and increment that number. It’s not so easy!. You are absolutely right. I used – hang on tight – some automation that filled the numbers for me. Each time I wanted to report an error, I merely wrote something like ~err~. Many IDEs support running whatever application you like at the pre-build and at the post-build processes. So I configured mine to run a script that I wrote, in its pre-build process. The script scanned my entire code, and replaced all the occurrences of the ~err~ pattern with an error reporting function, using a fresh new code. It replaced all those place holders with something similar to report_error(83), where 83 is an example of a fresh new code. The details of that script, although interesting, are not within the scope of this article. Perhaps I’ll publish that script one day, but I must say, it’s not at all difficult to write. It’s the place where the ways of Python and embedded programming are crossed.
So, again, the error reporting API was super easy to user. Just write ~err~ and it’s reported. The function report_error(int code) simply stored the error code in the 20-entries buffer. If the code already existed there, it merely increased its counter. If not, it stored it in the next free place. If no more place remained, it was discarded. This brings me to the next two components of the error reporting mechanism.
Viewing the errors
In the PC example the very same function that is used to report the error also shows it as a popup dialog box. So there is nothing interesting to talk about. The situation is usually different for an embedded system. The ~err~ super-simple API I talked about in the previous paragraph just stores the error in a buffer. We must be able to view that buffer in real time. If this can’t be achieved, it’s just as we’ve never reported any error. Error reporting doesn’t exist. Error reporting must consist of two things – a super simple API and a real time errors viewer. If one of these components is missing, then the whole thing is worthless.So there are two things you must now do. The first thing is to create a communication link with a PC. Be it Ethernet, serial, non-standard, whatever. If the system you work on doesn’t have any such link, it’s a problem. This kind of link must be considered at the beginning of every project. The second thing to do is to create a PC application which uses that link to read the errors buffer from the embedded system and to display it. It can periodically poll for that buffer. This means that some request-reply command scheme with the embedded system must also be established. This isn’t so hard to do. Sometimes a link with a PC is already a requirement for your project, and the only thing left to do is create a command to read the errors buffer. However, if that link doesn’t exist, you must create it. If you don’t, it'a exactly as if you simply don’t have an error reporting mechanism.
I consider this so essential, that I claim that creating a link to a PC and writing an error viewer for the PC is, actually, the first thing to do in every new embedded application. It will loyally serve you from the very beginning of the project.
Back to the embedded system example. Suppose, for example, the PC viewer shows an error code of 83 in the errors buffer. I notice an error showing there, so I search my code for report_error(83) and arrive at the very line of code where the problem originated. No error ever slips away. This mechanism is simply precious.
Start reporting everything, making it a habit. When you make your error reporting mechanism easy to use, you will notice that you use it a lot, and, as a result, the way you code change. I recall writing in the past:
// x is 1 or 2 if (x == 1) do_this(); else do_that();These days are over. Now I write:
// x is 1 or 2 if (x == 1) do_this(); else if (x == 2) do_that(); else ~err~;
Oh, and one last thing. The error viewer that runs on the PC – please make a habit to activate it and use it all the time. It turns out that error reporting can be so easily held back for so many reasons. I sometimes found myself trying to hunt a bug that was actually pinpointed by my error reporting mechanism. It’s just I forgot to turn it on and look at it!
Summary
Here are a few numbers. The last project I worked on was pretty small. It had about 5000 lines of code and slightly above 100 error reports (using the ~err~ substitutions). It was the only project I've started from scratch with error reporting in mind, and its the error reports to lines of code ratio was about 1:50. The error reporing was good and helpfule in that project. Nevertheless, even a 1:50 ratio seems to me a bit too large. But comparing this to other two projects I worked on, to which I've added an easy to use error reporting mechanism only after already reaching maturity, a difference can be clearly viewed. The first project was an embedded one, already containing about 40000 lines of code when I started reporting errors using the ~err~ substitutions, contained on its release 47000 lines of code and about 280 error codes. This made a ratio of 1:167 - one error report per 167 lines. A PC application I worked on, consisting of about 150000 lines of code, had eventially 250 error reports, which make a 1:600 ratio. That project, although developed on a PC using a good IDE, was very difficult to debug due to the lack of error reports. Now, with a 1:600 reports ratio, I can gladly say, the situtation has much improved. So these are the numbers.
Do you report errors? What is your error-reports to lines-of-code ratio?
Instead of staring at the screen, scratching your head and wondering what went wrong, build the error reporting mechanism at the beginning of the project, or right now, if you haven’t done so. In order to be effective it must consist of two things:
- A super simple API for reporting errors
- An error viewer, which must be on all the time
You may need to participate in the system design, and require a physical communication for the system. It’s essential.
Finally, you will have to write a PC application for viewing errors using the link. It’s all easy. Just realize it’s very important.

