Chapter 9: Miscellany 2

9.1. Reading characters

9.2. Arrays of strings

9.3. Identifier conventions

9.4. The use of #define

9.5. Debugging your program

9.6. Quotations

9.1. Reading characters

There are many programming applications in which we have to read text a character at a time to analyse it in some way.

1. A word processor. We are reading text from the user, and laying it out on a page according to certain rules, perhaps padding out lines to a set width using additional spacing.
2. A spelling checker. We are reading text, splitting it into separate words, and comparing the words with those in a "dictionary" (just a list of words, on UNIX it is stored in the file /usr/dict/words which is publicly readable and should contain about 30000 words) to check if they occur in the dictionary.
On UNIX, use the command
to check the words in the named file against British spelling.
3. A compiler. The compiler is reading a program, and converting it into equivalent executable instructions for a particular processor.
4. Analysis of Ceilidh program output under test. The program is run against some test data, and the output is search to see if expected words and phrases occur.
5. A text editor such as "ed" or "vi" or "emacs".

The simple use of the standard input "cin" to read characters as in

char ch;
while(
    cin >> ch,
    cin !='.'
) { ...

has the logical but inconvenient habit that all white space is ignored. Any spaces, tab characters or newlines in the input are completely invisible to the program. The loop in the above example read characters from the input until a specific character (a full stop / period) is encountered.

In all of the applications discussed above, we need to be aware of white space characters. We must use the function

cin.get( ch )

to read each character, and we need a new technique to detect the end of the data. To indicate the end of data from a terminal, you type "control-D". This is referred to as "end of file" or "EOF" whether we are reading from the terminal (and encounter a control-D) or from a genuine file and encounter the end of the file. The C++ feature for this purpose is the function

cin.eof()

which delivers TRUE if we are at the end of the data, and FALSE otherwise. This detects end-of-file when reading from "cin" as in

prog99 < data_file

If you are reading from an file which you have opened from within the program, you may need to use "fin.eof()" instead.

We thus have the program outline

while ( ! cin.eof() ) {
  cin.get( ch );
  if (
    ch == ' '
    || ch == '\t' // tab
    || ch == '\n' // newline
  ) { ...
    ....;
  } // end if white space
  ....;
} // end while not EOF

while ( ! cin.eof() ) {
  cin.get( ch );
  switch( ch ) {
  case 'p' : ....; break;
  case 'q' : ....; break;
  default  : ....;
  } // end switch
  ....;
} // end while not EOF

while ( ! cin.eof() ) {
  cin.get( ch );
  if ( ch >= 'a' && ch <= 'z' ) {
    ....;
  }
  ....;
} // end while not EOF

What is a word?

While reading characters, many of the applications above require us to split the input text into words. If we are word processing (laying out text on a page), each word includes its terminating punctuation. If a word and its punctuation cannot be fitted onto a line, it must all be moved to the start of the next line. We determine the end-of-word by looking for white space.

If we are writing a spelling checker, we split the text into words and compare them with words in a dictionary. In this case, the terminating punctuation is not part of the word. We determine end-of-word by finding a non-alpha character, i.e. one which is not a letter.

The definition of a word thus depends on the application.

Program structure

Note the possible codings for reading a word at a time from the data. One possibility often seen is

while( ! cin.eof() ) {
  while( get char, it isn't a letter ) {
    skip it;
  }
  while( get char, it is a letter ) {
    store it;
  }
  process the word just read;
} // end while ! eof loop

The inner loops should both contain checks for end of file. The problem just does not need nested loops.

A much better and safer solution is

while( ! cin.eof() ) {
  cin.get( ch );
  if ( ch is a letter ) {
    store it;
  } else if ( ch is the first non-letter ) {
    process the word now stored;
  } else {
    skip it;
  }
} // end while ! eof loop

9.2. Arrays of strings

It is often useful to have a number of strings stored as an array, so that we can print the i-th string of a set, or search for a command name among a set of alternatives.

We declare (in global, because it initialises an array)

char *months[] = {
    "January",
    "February",
    "March",
    "April",
    ""
};

This gives us

months[0]	is the string	"January"
months[1]	is the string	"February"
months[2]	is the string	"March"
months[3]	is the string	"April"
months[4]	is the string	""

We could thus print the name of the i-th month using

cout << months[i] << "\n";

The first character of the name of the the i-th month is

months[i][0]

We could search through the strings for a particular string

char word[20];
cin >> word;
for( i = 0; months[i][0]; i++ ) {
    if ( strcmp( word, months[i] ) == 0 ) {
	// found it ...
    }
}

Although this study properly belongs to the area of pointers (which is covered properly in the next course, but is summarised without exercises in an extra unit at the end of this course) some useful applications are described below.

Arguments to the program.

When you type a UNIX command as in

prog99 this that other

the system generates two arguments which the program can access if it wishes. To access the given parameters, the main program should start

main( int argc, char *argv[] ) { ....

The program is then supplied by the system on startup with two arguments.

The first is an integer, and is set to the number of arguments plus one (i.e. the number of words on the command line including the command name itself).

The second is a "char *[]" containing as strings the command name and the arguments.

Thus in the above example, the arguments are set up as if we had included

int argc = 4;
char *argv[] = {
  "prog99",
  "this",
  "that",
  "other",
  0
};

The convention is to use "argc" (argument count) and "argv" (argument values) as the argument names, although such names are purely local to your main program.

Thus we now have the value of argv[0] as the string "prog99", of argv[1] as the string "this", etc.

Accessing the arguments

We can now check that there is at least one argument using

if ( argc > 1 ) { ...

(the "cp" command for example always checks that it has at least two arguments) and we can access its value by

cout << argv[1] ...

Each of the strings in the array will have a terminating zero on it; and the array of strings finishes with a zero.

The program can loop through all the parameters in turn with

int argno;
for( argno = 1; argno < argc; argno++ ) {
  cout << argv[ argno ] << " ";
}
cout << "\n";

This will print the arguments on a single line separated by spaces. This correspond to the "echo" command in UNIX.

Note that these are not global variables; they are parameters to the main program. They can therefore be accessed only from within the main program, or by being passed as parameters to other functions.

Wild-cards in arguments

Note that if you type, for example

prog99 *.C

then the UNIX shell first expands the "*.C" into the names of all files in the current directory ending ".C", and passes all of these over to the program. The program may thus find large numbers of parameters being passed to it. The asterisk generally will not appear in the program's parameter. Thus the command

echo *.C

echoes the names of all files ending ".C".

If there is NO file matching the requested pattern, the Bourne shell and its derivatives pass over the parameter as a string containing the asterisk.

UNIX-type flags

A program can detect flag arguments (arguments starting with a '-') by a construct such as

int argno;
for ( argno = 1; argv[ argno ][ 0 ] == '-'; argno++ ) {
// argument "argno" is a flag
  switch( argv[ argno ][ 1 ] ) {
    case 'l' : ...; break;
    case 't' : ...; break;
    ....;
  } // end switch
} // for all arguments

The arguments starting with a '-' are examined in turn, and actions taken depending on the letter following the '-'. We leave the loop with "argno" indicating the first argument NOT starting with a '-'.

Values from arguments

It is possible to read numeric values from arguments. For example, you may wish to give the rate of pay and hours worked as integer arguments, and type

prog32 152 45

instead of typing the values as data) the above arrangements would set "argv[1]" to the string "152", and "argv[2]" to the string "45".

The program could then use the library function "atoi" (ASCII to integer) and write

if ( argc > 2 ) {
  rate  = atoi( argv[1] );
  hours = atoi( artv[2] );
}

There is a similar function delivering floating point values "atof".

Always check that there are enough arguments before trying to access them.

The environment

There is a third argument available if you wish, containing details of the program's running environment.

If you write

main( int argc, char *argv[], char *envp[] ) {

as the program heading, the additional third parameter to the main program is another array of strings, this time set to a value such as

char *envp[] = {
  "USER=ef",
  "HOME=/staff/ef",
  "TERM=vt100",
  "EDITOR=emacs",
  "SHELL=bash",
  0
}

Each string in the array is of the form

<environment variable>=<value>

You could search this array to find the settings for any variable in which you are interested.

To make life easier, you can declare

char *getenv( char *);

and use the library function "getenv" as in

char *term = getenv( "TERM" );
char *edit = getenv( "EDITOR" );

This is straying into the territory of pointers, which properly belongs to the next course.

9.3. Identifier conventions

Every large company has its conventions for choosing identifiers. We have not enforced any particular convention. Some companies have conventions with which we violently disagree (such as "identifiers for integer variables must start with "i" or "j" or "k" ...).

Identifier names should certainly be meaningful. They will therefore consist of several words. Some users compose with capitalised initial letter for each word (such as TotalCostPerHour for example), while other prefer total_cost_per_hour with underscores separating the components.

Older compilers sometimes limited the length of identifiers to eight significant characters (if the identifier was longer, only the first eight characters were significant) but there is no known C++ compiler with such a restriction. That is an advantage of a modern language.

9.4. The use of #define

There are occasions where the "#define" facility of the C++ preprocessor can be useful.

If you write

#define MAX 150

near the top of a program, then everywhere that the identifier "MAX" occurs in the text, it is substituted by the string "150" before the program is passed to the compiler.

The string which is substituted can be absolutely any string of characters. Thus if you write

#define NU "The University of Nottingham"

the given string (including the quote symbols) will be substituted at every occurrence. This will result in may occurrences of the actual string. For a specific value such as this, you would normally use a global constant in C++.

If you define

#define TOTAL (n_small + n_medium + n_large)

then every occurrence of "TOTAL" will be substituted by the given expression, which will be compiled and re-evaluated at every run-time encounter in the program.

Consider

#define EVER ;;

and

for ( EVER ) { ...

9.5. Debugging your program

What can I say? This is a skill you must develop yourself.

Develop your program in small steps. Don't write 200 lines of code and expect it all to work and be easy to debug. Create the program in stages, testing each stage as you develop it. You will learn with experience how much it is wise to add at a time.

Put additional printing statements into the program until you are sure that it works. This way you can check intermediate results, and convince yourself that results are correct.

Re-use previously developed and test code wherever possible. This may mean your own code, but more often means the use of library code. It is usually worth the effort of looking up library functions for many operations.

9.6. Quotations

Your programs are intended to useful, and to be economic to develop and run. In this context

Perfection is the enemy of quality. [ref 1] M Tvrdikova, J Tvrdik, Human Factors and the Design of Interactive Applications, Proceedings of the International Conference on Computer Based Learning in Science, Vienna (December 1993)

Appendix: Programming standards

I have appended for interest a summary of the C++ programming standards used by one industrial company, "Ellemtel". The actual description of the standards occupies 82 pages, this is just a one-line summary of each of them. We hope that one day Ceilidh will enforce ALL of them!

They are divided into "Rules" (general application), "Recommendations" (recommended, not mandatory) and "Portability Rules" (only needed for applications which need to be portable, which for any large organisation would mean all applications). Summary of Rules

Rule 0 Every time a rule is broken, this must be clearly documented.

Rule 1 Include files in C++ always have the file name extension ".h".

Rule 2 Implementation files in C++ always have the file name extension ".cc".

Rule 3 Inline definition files always have the file name extension ".icc".

Rule 4 Every file that contains source code must be documented with an introductory comment that provides information on the file name and its contents.

Rule 5 All files must include copyright information.

Rule 6 All comments are to be written in English.

Rule 7 Every include file must contain a mechanism that prevents multiple inclusions of the file.

Rule 8 When the following kinds of definitions are used (in implementation files or in other include files), they must be included as separate include files:

classes that are used as base classes,

classes that are used as member variables,

classes that appear as return types or as argument types in function/member function prototypes.

function prototypes for functions/member functions used in inline member functions that are defined in the file.

Rule 9 Definitions of classes that are only accessed via pointers (\(**) or references (&) shall not be included as include files.

Rule 10 Never specify relative UNIX names in #include directives.

Rule 11 Every implementation file is to include the relevant files that contain:

declarations of types and functions used in the functions that are implemented in the file.

declarations of variables and member functions used in the functions that are implemented in the file.

Rule 12 The identifier of every globally visible class, enumeration type, type definition, function, constant, and variable in a class library is to begin with a prefix that is unique for the library.

Rule 13 The names of variables, constants, and functions are to begin with a lowercase letter.

Rule 14 The names of abstract data types, structures, typedefs, and enumerated types are to begin with an uppercase letter.

Rule 15 In names which consist of more than one word, the words are written together and each word that follows the first is begun with an uppercase letter.

Rule 16 Do not use identifiers which begin with one or two underscores (`_' or `__').

Rule 17 A name that begins with an uppercase letter is to appear directly after its prefix.

Rule 18 A name that begins with a lowercase letter is to be separated from its prefix using an underscore (`_').

Rule 19 A name is to be separated from its suffix using an underscore (`_').

Rule 20 The public, protected, and private sections of a class are to be declared in that order (the public section is declared before the protected section which is declared before the private section).

Rule 21 No member functions are to be defined within the class definition.

Rule 22 Never specify public or protected member data in a class.

Rule 23 A member function that does not affect the state of an object (its instance variables) is to be declared const.

Rule 24 If the behaviour of an object is dependent on data outside the object, this data is not to be modified by const member functions.

Rule 25 A class which uses "new" to allocate instances managed by the class, must define a copy constructor.

Rule 26 All classes which are used as base classes and which have virtual functions, must define a virtual destructor.

Rule 27 A class which uses "new" to allocate instances managed by the class, must define an assignment operator.

Rule 28 An assignment operator which performs a destructive action must be protected from performing this action on the object upon which it is operating.

Rule 29 A public member function must never return a non-const reference or pointer to member data.

Rule 30 A public member function must never return a non-const reference or pointer to data outside an object, unless the object shares the data with other objects.

Rule 31 Do not use unspecified function arguments (ellipsis notation).

Rule 32 The names of formal arguments to functions are to be specified and are to be the same both in the function declaration and in the function definition.

Rule 33 Always specify the return type of a function explicitly.

Rule 34 A public function must never return a reference or a pointer to a local variable.

Rule 35 Do not use the preprocessor directive #define to obtain more efficient code; instead, use inline functions.

Rule 36 Constants are to be defined using const or enum; never using #define.

Rule 37 Avoid the use of numeric values in code; use symbolic values instead.

Rule 38 Variables are to be declared with the smallest possible scope.

Rule 39 Each variable is to be declared in a separate declaration statement.

Rule 40 Every variable that is declared is to be given a value before it is used.

Rule 41 If possible, always use initialization instead of assignment.

Rule 42 Do not compare a pointer to NULL or assign NULL to a pointer; use 0 instead.

Rule 43 Never use explicit type conversions (casts).

Rule 44 Do not write code which depends on functions that use implicit type conversions.

Rule 45 Never convert pointers to objects of a derived class to pointers to objects of a virtual base class.

Rule 46 Never convert a const to a not-const.

Rule 47 The code following a case label must always be terminated by a break statement.

Rule 48 A switch statement must always contain a default branch which handles unexpected cases.

Rule 49 Never use goto.

Rule 50 Do not use malloc, realloc or free.

Rule 51 Always provide empty brackets ("[]") for delete when deallocating arrays.

Summary of Recommendations

Rec. 1 Optimize code only if you know that you have a performance problem. Think twice before you begin.

Rec. 2 If you use a C++ compiler that is based on Cfront, always compile with the +w flag set to eliminate as many warnings as possible.

Rec. 3 An include file should not contain more than one class definition.

Rec. 4 Divide up the definitions of member functions or functions into as many files as possible.

Rec. 5 Place machine-dependent code in a special file so that it may be easily located when porting code from one machine to another.

Rec. 6 Always give a file a name that is unique in as large a context as possible.

Rec. 7 An include file for a class should have a file name of the form <class name> + extension. Use uppercase and lowercase letters in the same way as in the source code.

Rec. 8 Write some descriptive comments before every function.

Rec. 9 Use / / for comments.

Rec. 10 Use the directive #include "filename.h" for user-prepared include files.

Rec. 11 Use the directive #include <filename.h> for include files from libraries.

Rec. 12 Every implementation file should declare a local constant string that describes the file so the UNIX command what can be used to obtain information on the file revision.

Rec. 13 Never include other files in an ".icc" file.

Rec. 14 Do not use typenames that differ only by the use of uppercase and lowercase letters.

Rec. 15 Names should not include abbreviations that are not generally accepted.

Rec. 16 A variable with a large scope should have a long name.

Rec. 17 Choose variable names that suggest the usage.

Rec. 18 Write code in a way that makes it easy to change the prefix for global identifiers.

Rec. 19 Encapsulate global variables and constants, enumerated types, and typedefs in a class.

Rec. 20 Always provide the return type of a function explicitly.

Rec. 21 When declaring functions, the leading parenthesis and the first argument (if any) are to be written on the same line as the function name. If space permits, other arguments and the closing parenthesis may also be written on the same line as the function name. Otherwise, each additional argument is to be written on a separate line (with the closing parenthesis directly after the last argument).

Rec. 22 In a function definition, the return type of the function should be written on a separate line directly above the function name.

Rec. 23 Always write the left parenthesis directly after a function name.

Rec. 24 Braces ("{}") which enclose a block are to be placed in the same column, on separate lines directly before and after the block.

Rec. 25 The flow control primitives if, else, while, for and do should be followed by a block, even if it is an empty block.

Rec. 26 The dereference operator `*' and the address-of operator `&' should be directly connected with the type names in declarations and definitions.

Rec. 27 Do not use spaces around `.' or `->', nor between unary operators and operands.

Rec. 28 Use the c++ mode in GNU Emacs to format code.

Rec. 29 Access functions are to be inline.

Rec. 30 Forwarding functions are to be inline.

Rec. 31 Constructors and destructors must not be inline.

Rec. 32 Friends of a class should be used to provide additional functions that are best kept outside of the class.

Rec. 33 Avoid the use of global objects in constructors and destructors.

Rec. 34 An assignment operator ought to return a const reference to the assigning object.

Rec. 35 Use operator overloading sparingly and in a uniform manner.

Rec. 36 When two operators are opposites (such as == and !=), it is appropriate to define both.

Rec. 37 Avoid inheritance for parts-of relations.

Rec. 38 Give derived classes access to class type member data by declaring protected access functions.

Rec. 39 Do not attempt to create an instance of a class template using a type that does not define the member functions which the class template, according to its documentation, requires.

Rec. 40 Take care to avoid multiple definition of overloaded functions in conjunction with the instantiation of a class template.

Rec. 41 Avoid functions with many arguments.

Rec. 42 If a function stores a pointer to an object which is accessed via an argument, let the argument have the type pointer. Use reference arguments in other cases.

Rec. 43 Use constant references (const &) instead of call-by-value, unless using a pre-defined data type or a pointer.

Rec. 44 When overloading functions, all variations should have the same semantics (be used for the same purpose).

Rec. 45 Use inline functions when they are really needed.

Rec. 46 Minimize the number of temporary objects that are created as return values from functions or as arguments to functions.

Rec. 47 Avoid long and complex functions.

Rec. 48 Pointers to pointers should whenever possible be avoided.

Rec. 49 Use a typedef to simplify program syntax when declaring function pointers.

Rec. 50 The choice of loop construct (for, while or do-while) should depend on the specific use of the loop.

Rec. 51 Always use unsigned for variables which cannot reasonably have negative values.

Rec. 52 Always use inclusive lower limits and exclusive upper limits.

Rec. 53 Avoid the use of continue.

Rec. 54 Use break to exit a loop if this avoids the use of flags.

Rec. 55 Do not write logical expressions of the type if (test) or if (!test) when test is a pointer.

Rec. 56 Use parentheses to clarify the order of evaluation for operators in expressions.

Rec. 57 Avoid global data if at all possible.

Rec. 58 Do not allocate memory and expect that someone else will deallocate it later.

Rec. 59 Always assign a new value to a pointer that points to deallocated memory.

Rec. 60 Make sure that fault handling is done so that the transfer to exception handling (when this is available in C++) may be easily made.

Rec. 61 Check the fault codes which may be received from library functions even if these functions seem foolproof.

Summary of Portability Recommendations

Port.Rec. 1 Avoid the direct use of pre-defined data types in declarations.

Port.Rec 2 Do not assume that an int and a long have the same size.

Port.Rec 3 Do not assume that an int is 32 bits long (it may be only 16 bits long).

Port.Rec 4 Do not assume that a char is signed or unsigned.

Port.Rec 5 Always set char to unsigned if 8-bit ASCII is used.

Port.Rec 6 Be careful not to make type conversions from a "shorter" type to a "longer" one.

Port.Rec 7 Do not assume that pointers and integers have the same size.

Port.Rec 8 Use explicit type conversions for arithmetic using signed and unsigned values.

Port.Rec 9 Do not assume that you know how an instance of a data type is represented in memory.

Port.Rec 10 Do not assume that longs, floats, doubles or long doubles may begin at arbitrary addresses.

Port.Rec 11 Do not depend on underflow or overflow functioning in any special way.

Port.Rec 12 Do not assume that the operands in an expression are evaluated in a definite order.

Port.Rec 13 Do not assume that you know how the invocation mechanism for a function is implemented.

Port.Rec 14 Do not assume that an object is initialized in any special order in constructors.

Port.Rec 15 Do not assume that static objects are initialized an any special order.

Port.Rec 16 Do not write code which is dependent on the lifetime of a temporary object.

Port.Rec 17 Avoid using shift operations instead of arithmetic operations.

Port.Rec 18 Avoid pointer arithmetic.

References

M Tvrdikova

J Tvrdik

Human Factors and the Design of Interactive Applications

Notes converted from troff to HTML by an Eric Foxley shell script, email errors to me!

Chapter 9: Miscellany 2

Contents

9.1. Reading characters

What is a word?

Program structure

9.2. Arrays of strings

Arguments to the program.

Accessing the arguments

Wild-cards in arguments

UNIX-type flags

Values from arguments

The environment

9.3. Identifier conventions

9.4. The use of #define

9.5. Debugging your program

9.6. Quotations

Appendix: Programming standards

Summary of Recommendations

Summary of Portability Recommendations

References