back next

Collection classes

Collections are containers for other objects. Smalltalk provides a rich set of collection classes ready to be used.

The protocol offered by these collections is very orthogonal, making good use of polymorphism. For example, most collections provide methods like at:, includes: or do:, even though the implementations are totally different and the performance may vary greatly between different collections and access methods (*).
To get a rough idea of the common protocol, have a look at the class Collection. This is the superclass of all collections, where most of the common operations are implemented. (Although, many subclasses redefine these methods.)

The most useful collection classes are:

less frequently used collections:

seldom used:

Use the systemBrowser to have a more detailed look into the implementation and also to see the full set of operations offered by these classes.

Array

Arrays store references to other objects, and use a numeric key (1 .. ) to access these elements. The elements can be of any type. The size of an array is assumed to be constant (it is possible to change the size, but this is a somewhat expensive i.e. slow operation).
When accessing elements in arrays, the index is always checked for being within the bounds - you will get an indexBounds exception if you try to access an element using an illegal index.
The typical uses of arrays are:

The Smalltalk language allows array constants (so called "Array literals"), the syntax is:
    #( element1 element2 ... elementN )
where the individual elements must be literals (numbers, characters, strings or other arrays). Nested literal Arrays are written as:
    #( 
	(element1.1 element1.2 ... element1.M)
	(element2.1 element2.2 ... element2.M)
	...
	(elementN.1 elementN.2 ... elementN.M)
     )
Notice, the above is an Array of 3 Arrays. Not to confuse with an array containing 9 elements (which you get, if the inner parenthesises are ommited).

Smalltalk/X also allows the objects true, false and nil to be treated like constants in literal arrays. This is NOT compatible with some older Smalltalk implementations, so care should be taken if that code is to be ported to other systems later.
For example,

    #( nil 1 'foo' true #true )
represents a 5-element array literal consisting of nil, the integer 1, the string 'foo', the boolean true and the symbol #true.

String

Strings are special arrays, which are tuned for storage of characters. (actually, they only allow storing characters - an error is reported, if you try to store something else into it). They provide the same basic protocol as arrays, but offer many more string specific operations. For example, pattern matching, substring searching, upper/lowercase translation etc. are provided.
Typical operations on strings:

Constants strings (string literals) are entered by enclosing the characters in single quotes ('). Single quotes within string constants are to be doubled. String literals can cross line boundaries; in this case, the newline character is part of the string.
Examples:

    'a normal string constant'

    'a string constant with embedded ''quotes'''

    'a string constant
     with mutliple
     lines'

Symbol

Symbols are much like strings, with the exception that the system keeps track of all symbols in the system and will make certain that two symbols are identical, if they contain the same characters. In other words: if the system is asked to create a symbol from some string, it will search the existing symbols first for an already existing one with the same characters. If its found, that existing object is returned. If not found, a new symbol is created and entered into this table.

Therefore, symbols can be compared using identity compare (==), which is much faster than value compare (=). Identity compare is technically implemented as a pointer compare, while value compare on strings is implementated by comparing each element of the two objects.

Symbols are internally used by the system to assign names to class objects (in the Smalltalk global table), and to assign method names to method objects (in each classes method/selector association). Also, symbols are perfect to use as keys in IdentityDictionaries.

Since symbols are handled just as fast as integers, they are also perfect for flags, states (state machines) or other variables which are represented by enumeration types or define-constants in other programming languages.

Symbols do not allow change of their character elements; the at:put: method will raise an error. Other than that, all read accesses are the same as for strings (actually, class Symbol inherits all those methods from the String class).

Symbols can be created (at runtime) from a string with:

    sym := aString asSymbol
also, symbol constants (i.e. symbol literals) can be written directly by prefixing some identifier with the #-character as in:
    sym := #helloThere.
These symbol literals are created at compile time - at execution time,so the cost of using them is not more than using integers.

Only alphanumeric identifiers or valid message selectors are allowed in the above example. Examples for valid symbol iterals are:

    #helloThere
    #FooBar
    #+
    #at:
    #at:put:
Use quotes around the characters for other symbol literals:
    sym := #'a symbol with spaces and .... other stuff'
The quotes don't hurt; therefore #+ and #'+' represent the same symbol.
Symbol literals can also be present in array literals as in:
    #( #foo #bar #baz )
Within such an array literal, the leading #-character can be ommitted (for your convenience). Thus, the above is equivalent to:
    #( foo bar baz )
but, NOT equivalent to:
    #( 'foo' 'bar' 'baz' )
which is an array of strings.

Since true, false and nil are treated as constant objects, you have to add a #-prefix as in:

    #( #true #false)
if you want those symbols in an array literal (which is unlikely).

Set and IdentitySet

Sets keep references to other objects, but do not allow for dublicates. Thus an individual object will appear at most once in a set.
To check if some object is already contained in the set, the elements are compared using =, which typically compares objects values.
Thus, 1 (the Integer "one") will be considered equal to 1.0 (the Float "one").
The alternative IdentitySet uses an identity compare (==) and will treat 1 and 1.0 as different objects.
The typical use of sets and identitySets is:

Since sets keep their elements unordered, the keyed access methods (at: and at:put:) are not allowed and will trigger an error.

Sets and identitySets use hashing algorithms internally for inclusion tests. Thus, they are usually fast and show a constant access time which is determined by their fill-grade and the quality of the hash key algorithm; not by the absolute size of the set. The hash key generation is implemented by the elements - not by the set.

Sets automatically grow (and shrink) as required when elements are added or removed. To get good hashing performance, they grow larger (by some percentage) than the actual size required for the elements. Expect some 20% storage overhead when using sets.

IdentitySets are usually much faster than Sets, since both hashKey generation and comparing is implemented more efficient in the first. If your keys are small integers or symbols, you should always use IdentitySets.
However, for strings, identitySet is usually a bad choice.

Dictionary and IdentityDictionary

Dictionaries are somewhat like arrays, in that they allow access to the elements by at: and at:put: messages using an access key. However, in contrast to arrays they allow any other object to be used as key. Thus an array can be thought of as a special dictionary using numeric keys only (if you want to do this, its faster to use an array rightaway - except if your array is filled very sparse).
Like sets, dictionaries come in two flavours: Dictionary uses value compare (=) and IdentityDictionary which uses identity compare (==) for key accesses.
Thus:
	|d value|

	d := Dictionary new.
	d at:1 put:'one'.
	 ...
	value := d at:1.0
will leave the expected 'one' in the variable value.
In contrast,
	|d value|

	d := IdentityDictionary new.
	d at:1 put:'one'.
	 ...
	value := d at:1.0
will raise an error, since there is no element stored under that key (remember: 1 = 1.0 returns true, while 1 == 1.0 returns false).

Like with sets, dictionaries use hashing internally; IdentityDictionaries are usually faster than Dictionaries.

Dictionaries are very powerful collections, and worth a second look. You can put associations between any objects into dictionaries. A simple example:

	|nameToAge|

	nameToAge := Dictionary new.
	nameToAge at:'fred' put:17.
	nameToAge at:'peter' put:21.
	 ...
	age := nameToAgeAt:name ifAbsent:['dont know'].
You can also use Dictionaries as a replacement for simple data structures, which are called record or struct in other languages:
	|peter marry|

	peter := IdentityDictionary new.
	peter at:#age put:23.
	peter at:#firstName put:'peter'.
	peter at:#middleName put:'m'.
	peter at:#name put:'smalltalker'.
	peter at:#salary put:51234.
	 ...
	marry := IdentityDictionary new.
	marry at:#age put:21.
	marry at:#firstName put:'marry'.
	marry at:#middleName put:'s'.
	marry at:#name put:'smalltalker'.
	marry at:#husband put:peter.
	...
	age := peter at:#age.
	firstNameOfHusband := (marry at:#husband) at:#firstName.
Of course, real programmers define a class Person and create an instance of it ... this is both more 'object oriented' and provides faster access than using dictionaries (which are not meant to be used for the above 'constant' key access).

OrderedCollection

OrderedCollections are collections which access their elements via a numeric key - in that, they are much like arrays. However, they are specially tuned for adding/removing elements (i.e. they can easily grow and shrink), which is a very expensive operation if applied to an array.

typical use:

Notice, that the above includes: when applied to large collections is usually much slower for OrderedCollection or Array than it is for Sets or Dictionaries. The reason is that the former have to do a sequential search over all elements, while the later can use a fast hashed access.

SortedCollection

SortedCollections are like OrderedCollections, but they keep their elements in sorted order. By default, sortedCollections sort in ascending order, by comparing individual elements using '>'. You can specify your own sorting rule (so called sortBlock).

Sorting is done using quickSort and binary search (when inserting elements). typical use:

TwoByteString

Are like strings, but the elements are characters in the coderange 0..65535. They can be used to represent strings in Japanese, Chinese, Unicode or any other 16bit characterset.

ByteArray

ByteArrays are tuned for space efficient storage of short (byte-valued) integers in the range 0 .. 255. They are used for bitmap storage and executable intermediate (byte-) code.

Bytearrays are also valuable for data interchange with other programs via files, sockets, shared memory or pipes.
Protocolwise, they behave much like Arrays.

FloatArray and DoubleArray

Float- and DoubleArray are specially tuned for space efficient storage of short (typically 32bit) and double float (typically 64bit) Float values.
For normal users, these are less interresting, but offer performance advantages when matrices or 3D graphics are involved, where large float number collections are often needed.
Protocolwise, they behave much like Arrays.

Interval

An interval represents a set of numbers which can be generated from a start value, a final value and a step. For example, the even numbers between 2 and 10 are represented by (2 to:10 by:2).

LinkedList

A linkedList consists of a chained number of Link or ValueLink instances. They provide reasonable fast insertion methods, but show poor performance for deletion and insertion tests.

Queue and SharedQueue

A queue is a collection where elements are added at one end and removed at the other end. An interresting class to look at is SharedQueue (which is a subclass of Queue) and helps in writing producer/consumer applications with separate writer/reader processes.

WeakArray, WeakIdentitySet, WeakIdentityDictionary and Registry

to be documented

CacheDictionary and ResourcePack

A cacheDictionary is a dictionary which only stores a certain number of elements - removing existing entries when full and new elements are to be added. As the name suggests, they can be used to keep a collection of recently used objects.

MappedCollection

A mappedCollection offers indirect indexing; access is via a map which provides the actual access key.

Notes:

(*)
Some collections are tuned for growing, while others are not; some use hashing or other fast access methods to support fast search and/or inclusion test, while others do sequential searches.
Check out and understand the individual collection classes and use the one which best fits your needs to get best performance.


Copyright © Claus Gittinger Development & Consulting, all rights reserved

(cg@ssw.de)