prev back next

Garbage collection in general and in Smalltalk

This document is meant for people who are not too familiar with the concept of garbage collection (GC), and to prevent too much discussion with people who think that garbage Collection is a slow, useless feature which is not needed (i.e. radical C/C++ fans :-)

I have collected some arguments (not my own ones) from literature and discussions - both on the net and private ones. Not being naive enough to believe that this religious war might end soon, it may at least give some common ground on which to base discussions.

Also, it shall prevent useless questions/flames etc. sent to my person in the future.

For those of you who know Smalltalk/GC, it may not give you any new information - just skip & forget about it.

Some of the stuff below may sound a bit drastic, cynic or whatever - think of a big smily :-) after every such statement ...

Introduction: what is GC

Almost every programming system offers the programmer some functionality to dynamically allocate storage in amounts not foreseen at coding time; for example, Pascal offers a "new" operation, the C-libraries provide a family of "malloc" functions, C++ provides "new" and Lisp the CONS, to just name some.

Except in the rare cases when those objects (not in the OO-sense) are used and needed for the whole running time of the program, sooner or later the storage used by them must be freed and given back to some management facility - otherwise your program would continue to grow, eating up tons of unused storage space.

Although this is a bit of a simplification, programming systems can be roughly divided into two classes:

those that require the programmer to explicitly request for the freeing of an unused storage area (i.e. by calling for a "dispose", "free" or "destroy" operation); those that find unused storage automatically and return the storage to some free-memory management facility. This is called "Garbage collection".

What are the arguments of GC opponents

GC is too slow

What are the dangers of NOT doing GC

Simply saying:
many believe that it is almost impossible in a modest sized programming project, to tell for sure when an object is no longer needed

Some even state that it is impossible to create a large system, which will not have memory leaks (without GC).

(large being something where many people work for years, creating code with megabytes in size)

Therefore (in practice) one of three situations arise:

My personal experience supports these arguments - I have heard of (and seen) systems, where more than 100 people created a program which was almost impossible to debug and where reference errors where almost impossible to find in the end. - finally throwing away big parts, inventing some kind of GC and restricting the use of the language by forbidding everything which originally made the used programming language "so efficient" i.e. overloading all assignment, pointer and whatever stuff, wrapping pointers into envelopes, making each pointer reference a virtual function call and slowing down the execution to that of interpreted basic :-)

To make it clear:

I REALLY do NOT think (and do NOT want to give the impression) that all these programmers are incapable of good programming - every one of us has (and had) many errors of this kind - even the very best gurus make those errors !!!
Its simply the complexity of big systems (especially when created by a big group) which makes these errors appear again and again.

When separate subsystems (libraries, especially: binary libraries) are involved, things tend to become even harder, since it is often unclear (from the specification and documentation) who is responsible for freeing of objects. ((just think of a container-class, which keeps references to some other objects; the question is if the container should free its referred-to objects when freed or leave this to the user of the container. If left to the user, how does he know if other parts of the program do not have references to those objects))

How can GC be implemented

There are many strategies for doing GC, which will not be covered here in much detail - there is a lot of literature (*) available in this area.

The most well known and most often used strategies are:

Literature:
For a more information on GC, read Garbage Collection Techniques by Paul R. Wilson, in ACM Computing Surveys, also found in the proceedings of the 1992 International Workshop on Memory Management, Springer Lecture notes in C.S.

Pros & cons of the techniques

What GC enemies suggest

Interrestingly, most GC opponents suggest a device called "smart pointer" (or similar buzz-word) which is actually nothing other than a reference counting garbage collector !! (which - as we have seen above is NOT the best solution to garbage collection (*).

Others use tools, which (more or less successfull) find invalid references to already freed objects and multiple freeing of the same object and so on. To my knowledge, none of these is perfect and can make certain that all such situations are handled: - analysis of the source code is (theoretically) impossible, - analysis of the executing process needs that every possible flow of control through the program be executed (at least once) to detect all possible memory bugs.

For many applications, this is simply not possible. Also - in the strict sense, these check-runs have to be repeated for every change in the program - whatever small this change was.

I doubt that these are very useful for modest size applications, involving a team of many programmers. (not talking about toy programs here :-)

There is finally a very radical, puristic group of GC enemies which simply suggests the following:

"if you cannot get your memory organized, give up programming - since you are simply too dump"

(this is no joke; this statement really occurred in the comp.lang.c++ newsgroup) Big smily here :-)

(*) Notes:

There are other GC's possible; a relatively popular algorithm is the so called conservative GC which scans memory for things which "look like pointer" to track reachable objects.

What type of GC do Smalltalks uses

Traditionally, the early versions used reference counting GC (ST-80 - I guess up to V2.x), the newer ST-80 (OWST) versions use a combination of generation scavenging for new objects and incremental mark and sweep augmented by a compressor for the old objects. (walking on sandy ground here - I do not know for sure; depend on what others told me ...)

I do not know what ST/V and Enfin use.

ST/X uses a modified generation scavenging scheme for new objects (with adaptive aging) and a Baker algorithm for old objects, which is started on demand. There is also an incremental mark & sweep running over the old objects at idle times or (optionally) as a low priority background process.
The implementation is prepared for and its planned to add a third intermediate generation, and to replace the semispace Baker algorithm by some in-place compressor. (the compressor will be somewhat slower, but relaxing VM need)

Oldspace collections happen very infrequent - if no special memory needs arise (as when doing image processing/viewing), the system may run for ever without one. (especially with the incremental background collector running.)

Worst case situations for generational GC's and ST/X's GC in particular

For a fair discussion, some worst-case situations for generational GC's have to be discussed in this paper. In ST/X (and probably other systems using similar algorithms), there are two situations, in which the GC gets into trouble, possibly leading to longer than the above mentioned average pause times and/or GC overhead. The following will discuss these cases and describe how ST/X tries to avoid them:

Conclusion

There is no real argument against GC; except for some small percentage of added run time overhead (except for true real-time applications).

With todays high performance computers, this is an adequate price to pay for the added security, and - not to forget - the time savings of the programmer, who would otherwise spend a lot of his/her time in the debugger - instead of doing productive work.

Of course, there are always special situations in which one algorithm performs better than others - memory allocation patterns vary over different applications. Generation scavenging provides the best average performance over different allocation patterns.

Or, to summarize (and estimate/simplify):

if a programmer spends 80% of his/her time debugging code, where 50% of all errors are dangling pointer, bad free or other GC-avoidable bugs, getting rid of those bugs rewards you with a development cost saving of 40% !!!

Now, those savings will pretty soon amortize the added cost of a 5% to 10% faster CPU.

The exact percentages may vary and not be correct in every situation, but you see the point; don't you ?


Copyright © Claus Gittinger Development & Consulting, all rights reserved

(cg@ssw.de)