5.4 The String Data Type
In the Prolog community there have been ongoing discussions about the need
to have a special string data type.
The main argument against strings is that everything that can be done
with strings can as well be done with atoms or with lists, depending
on the application.
Nevertheless, in ECLiPSe it was decided to have the string data type, so that
users that are aware of the advantages and disadvantages of the
different data types can always choose the most appropriate one.
The system provides efficient builtins for converting from one data
type to another.
5.4.1 Choosing The Appropriate Data Type
Strings, atoms and character lists differ in space consumption and in
the time needed for performing operations on the data.
Strings vs. Character Lists
Let us first compare strings with character lists.
The space consumption of a string is always less than that of the corresponding
list. For long strings, it is asymptotically 16 times more compact.
Items of both types are allocated on the global stack, which means that
the space is reclaimed on failure and on garbage collection.
For the complexity of operations it must be kept in mind that the string type
is essentially an array representation, ie. every character in the string
can be immediately accessed via its index.
The list representation allows only sequential access.
The time complexity for extracting a substring when the position is given
is therefore only dependent on the size of the substring for strings,
while for lists it is also dependent on the position of the substring.
Comparing two strings is of the same order as comparing two lists, but
faster by a constant factor.
If a string is to be processed character by character, this is easier to
do using the list representation, since using strings involves keeping
index counters and calling the string_code/3 predicate.
The higher memory consumption of lists is sometimes compensated by the
property that when two lists are concatenated, only the first one needs
to be copied, while the list that makes up the tail of the concatenated
list can be shared.
When two string are concatenated, both strings must be copied to form
the new one.
Strings vs. Atoms
At a first glance, an atom does not look too different from a string.
In ECLiPSe, many predicates accept both strings and atoms (e.g. the file name
in open/3) and some predicates are provided in two versions, one for
atoms and one for strings (e.g. concat_atoms/3 and concat_strings/3).
However, internally these data types are quite different.
While a string is simply stored as a character sequence, an atom is mapped
into an internal constant.
This mapping is done via a table called the dictionary.
A consequence of this representation is that copying and comparing atoms
is a unit time operation,
while for strings both is proportional to the string length.
On the other hand, each time an atom is read into the system, it has to
be looked up and possibly entered into the dictionary, which implies
some overhead.
The dictionary is a much less dynamic memory area than the global stack.
That means that once an atom has been entered there, this space will
only be reclaimed by a relatively expensive dictionary garbage collection.
It is therefore in general not a good idea to have a
program creating new atoms dynamically at runtime.
Atoms should always be preferred when they are involved in unification
and matching. As opposed to strings, they can be used for indexing
clauses of predicates.
Consider the following example:
[eclipse 1]: [user].
afather(mary, george).
afather(john, george).
afather(sue, harry).
afather(george, edward).
sfather("mary", "george").
sfather("john", "george").
sfather("sue", "harry").
sfather("george", "edward").
user compiled 676 bytes in 0.00 seconds
yes.
[eclipse 2]: afather(sue,X).
X = harry
yes.
[eclipse 3]: sfather("sue",X).
X = "harry" More? (;)
no (more) solution.
The predicate with atoms is indexed, that means that the matching
clause is directly selected and the determinacy of the call is recognised
(the system does not prompt for more solutions).
When the names are instead written as strings, the system attempts
to unify the call with the first clause, then the second and so on until
a match is found. This is much slower than the indexed access.
Moreover the call leaves a choicepoint behind (as shown by the more-prompt).
Conclusion
Atoms should be used for representing (naming) the items that a
program reasons about, much like enumeration constants in PASCAL.
If used like this, an atom is in fact indivisible and there should
be no need to ever consider the atom name as a sequence of characters.
When a program deals with text processing, it should choose between string
and list representation.
When there is a lot of
manipulation on the single character level, it is probably best to use
the character list representation, since this
makes it very easy to write recursive predicates walking through the text.
The string type can be viewed as being a compromise between atoms and lists.
It should be used when handling large amounts of input, when the extreme
flexibility of lists is not needed, when space is a problem or when
handling very temporary data.
5.4.2 Builtin Support for Strings
Most ECLiPSe builtins that deliver text objects (like
getcwd/1,
read_string/3,4 and many others) return strings.
Strings can be created and their contents may be read using the string
stream feature (cf. section 10.3.1).
By means of the builtins
atom_string/2,
string_list/2,
number_string/2 and
term_string/2,
strings can easily be converted to other data types.
5.4.3 Quoted lists
As already discussed, many Prologs use the double quotes as a notation for
lists of characters. By default, ECLiPSe does not provide any
syntactical support for such quoted lists. However, the user can
manipulate the quotes by means of the set_chtab/2
predicate.
A quote is defined by setting the character class of the chosen character
to string_quote, list_quote or atom_quote respectively.
To create a list quote (which is not available by default)
one may use:
[eclipse 1]: set_chtab(0'`, list_quote).
yes.
[eclipse 2]: X = `text`, Y = "text", type_of(X, TX), type_of(Y, TY).
X = [116, 101, 120, 116]
TX = compound
Y = "text"
TY = string
yes.