The Department of Computer Science & Engineering |
STUART C. SHAPIRO: CSE
305
|
The standard definition of data type is (see text, p. 234):
The major steps in the evolution of data types were:
The rest of this chapter is a survey of data types and their design issues.
Various coding schemes are possible. Most languages now use binary numbers for positive integers, and twos complement for negative integers.
Represented by binary coded decimal (BCD). Each digit represented
by its binary equivalent. For example, 35
in BCD is
0011 0101
.
Usually represented using IEEE Floating-Point Standard: sign bit, exponent, fraction ("mantissa"). For more details of number representations, see my CSE115 notes on Java arithmetic.
Usually several types, differing on precision (number of bits used for fractional part).
Operations on numbers will be discussed in Chapter 7.
Only some programming languages have an actual Boolean type with
two special values, true
and false
. C uses
the int 0
for false, and any other int for true. Lisp
uses nil
for false and any other value for true.
Often represented in ASCII, which uses 8 bits, and so can code 128 differet characters.
There is a move, started by Java to use Unicode, which uses 16 bits, and can represent character's from most of the languages in the world.
Many languages have a data type named something like
string
, others use arrays of characters. However,
strings are usually implemented as arrays of characters.
The length of a string may be stored with the value or the
variable, or may be indicated by a sentinal. For example, C and C++
terminate strings with the null
character,
'\0'
.
String concatenation is such a common operation that several
languages include an operator for it, such as Java's overloaded
+
. Java uses concatenation to construct output lines.
Other languages use format strings with interpolated control
characters.
Some other common operations are: string length; substring extraction; character at position; string comparison; and substring search.
A major issue is whether string operations are destructive (change
the argument string) or non-destructive (return a string like the
argument string, except...). In Java, String
s are
immutable (have no destructive operations), whereas
StringBuffer
s are like String
s, but are
mutable:
bsh % str1 = "This is a string."; bsh % str2 = str1.replace('i', 'y'); bsh % print(str1); This is a string. bsh % print(str2); Thys ys a stryng. bsh % str3 = new StringBuffer("This is a string."); bsh % print(str3); This is a string. bsh % str4 = str3.replace(8,9,"another"); bsh % print(str3); This is another string. bsh % print(str4); This is another string.
Common Lisp has one kind of string, but both destructive and non-destructive operations:
cl-user(1): (setf str1 "This is a string.") "This is a string." cl-user(2): (setf str2 (substitute #\y #\i str1)) "Thys ys a stryng." cl-user(3): str1 "This is a string." cl-user(4): str2 "Thys ys a stryng." cl-user(5): (setf str2 (nsubstitute #\y #\i str1)) "Thys ys a stryng." cl-user(6): str1 "Thys ys a stryng." cl-user(7): str2 "Thys ys a stryng."
A string's length may be static, as is Java's String
,
dynamic, as is Java's Stringbuffer
, or limited dynamic, as
the text says C's are. However, the program
was an infinite loop. When I killed it,#include <stdio.h> #include <string.h> #define true 1 int main() { char str[10]; int i = 0; while (true) { str[i++] = 'a'; str[i] = '\0'; printf("str = %s; Its length is %d\n", str, strlen(str)); } return 0; } ---------------------------------------------- <cirrus:Programs:1:102> gcc -Wall dstrlen.c -o dstrlen.out <cirrus:Programs:1:103> dstrlen.out str = a; Its length is 1 str = aa; Its length is 2 str = aaa; Its length is 3 str = aaaa; Its length is 4 str = aaaaa; Its length is 5 str = aaaaaa; Its length is 6 str = aaaaaaa; Its length is 7 str = aaaaaaaa; Its length is 8 str = aaaaaaaaa; Its length is 9 str = aaaaaaaaaa; Its length is 10 str = aaaaaaaaaaa; Its length is 11 str = aaaaaaaaaaaa; Its length is 12
str
had a length
of 1,598. Of course, this is C not doing range checking on arrays, again.
Pattern matching is a common operation
on strings that is a very involved subject. A large part of Perl is
devoted to pattern matching. Java has an extensive pattern matching
capability in the package java.util.regex
. C++ also has
a pattern matching library. (X)Emacs supports regular expression
pattern matching for searching and replacing strings. For example,
the regular expression <[^>]*>
will match html
tags.
In C, the typedef identifier is a synonym for its parent type. However, that is not true in all languages with user-defined types. If the new type identifier is not a synomym, a question is, is name type compatibility used, or structure type compatibility.#include <stdio.h> #define KperM 0.62137 #define MperK 1.60935 typedef float kilometer; typedef float mile; kilometer MtoK(mile x) { return x * MperK; } mile KtoM(kilometer x) { return x * KperM; } int main() { mile m = 100; kilometer k = 100; printf("%3.0f miles = %5.2f kilometers.\n", m, MtoK(m)); printf("%3.0f kph = %5.2f mph.\n", k, KtoM(k)); return 0; } ---------------------------------------------------------------- <cirrus:Programs:1:114> gcc -Wall conversion.c -o conversion.out <cirrus:Programs:1:115> conversion.out 100 miles = 160.93 kilometers. 100 kph = 62.14 mph.
In name type compatibility, two expressions having compatible types depends on the type identifier, even if the parent types are the same. In structure type compatibility, it depends on the parent types. For example, in the Ada-like type declarations
type array1type is array(1..10) of Integer; type array2type is array(11..20) of Integer; A: array1type; B: array2type;
A
and B
do not have compatible types under
name type compatibility, but do under structure type compatibility.
Some languages use name type compatibility, some use structure type compatibility, and some have facilities for both.
If a variable is declared with a type expression, such as
the variable is considered to have an anonymous type.A: array(1..10) of Integer;
char
. The integer types are also
considered ordinal types, although the signed integers also have
negatives. The important thing is that, except for the minimal value,
every value of an ordinal type is the successor of a value of its
type, and, except for the maximal value, every value of an ordinal
type is the predecessor of a value of its type. So one should be able
to use any ordinal type as an array subscript, or as a for loop
index.
#include <stdio.h> enum months {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}; int monLength[12] = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}; char* monName[12] = {"January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"}; int main() { enum months m; for (m = Jan; m <= Dec; m++) { printf("%s has %d days.\n", monName[m], monLength[m]); } return 0; } --------------------------------------------------------------- <cirrus:Programs:2:106> gcc -Wall enumtest.c -o enumtest.out <cirrus:Programs:2:107> enumtest.out January has 31 days. February has 28 days. March has 31 days. April has 30 days. May has 31 days. June has 30 days. July has 31 days. August has 31 days. September has 30 days. October has 31 days. November has 30 days. December has 31 days.
int
and its values are treated like int
values. C++, though is more careful:
#include <stdio.h> enum months {Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec}; enum days {Sun, Mon, Tue, Wed, Thur, Fri, Sat}; int main() { enum months m; enum days d = Thur; m = d; printf("It ran.\n"); return 0; } ---------------------------------------------------------------- <cirrus:Programs:2:111> g++ -Wall enumtest.cc -o enumtest.out+ enumtest.cc: In function `int main()': enumtest.cc:18: cannot convert `days' to `months' in assignment
java.awt.Color.blue
. Java's named constants achieve much
of the effect of enumeration types, but it takes more code to create
them:
public class Month { private String name; private int length; private int ordinal; private static Month[] calendar = new Month[12]; public static final Month Jan = new Month(1, "January", 31); public static final Month Feb = new Month(2, "February", 28); public static final Month Mar = new Month(3, "March", 31); public static final Month Apr = new Month(4, "April", 30); public static final Month May = new Month(5, "May", 31); public static final Month Jun = new Month(6, "June", 30); public static final Month Jul = new Month(7, "July", 31); public static final Month Aug = new Month(8, "August", 31); public static final Month Sep = new Month(9, "September", 30); public static final Month Oct = new Month(10, "October", 31); public static final Month Nov = new Month(11, "November", 30); public static final Month Dec = new Month(12, "December", 31); public Month (int ord, String n, int ln){ ordinal = ord; name = n; length = ln; calendar[ordinal-1] = this; } public String toString() { return name; } public int getLength() { return length; } public int getOrdinal() { return ordinal; } public Month getNext() { return calendar[ordinal]; } public boolean leq(Month m) { return getOrdinal() <= m.getOrdinal(); } public static void main (String[] args) { for (Month m = Jan; m.leq(Dec); m = m.getNext()) { System.out.println(m + " has " + m.getLength() + " days."); if (m == Dec) break; } } // end of main () }// Month ------------------------------------------------- <cirrus:Programs:1:142> javac Month.java <cirrus:Programs:1:143> java Month January has 31 days. February has 28 days. March has 31 days. April has 30 days. May has 31 days. June has 30 days. July has 31 days. August has 31 days. September has 30 days. October has 31 days. November has 30 days. December has 31 days.
type Days is (Mon, Tue, Wed, Thu, Fri, Sat, Sun) subtype WeekDays is Days range Mon..Fri; sybtype WeekendDays is Days range Sat..Sun; Day1: Days; Day2: WeekDays; Day3: WeekendDays;
Day1 := Day2
and Day1 := Day3
are legal.Day2 := Day3
and Day3 := Day2
are
illegal.Day2 := Day1
or Day3 := Day1
are only legal if Day1
has a proper value at run-time.
Subrange types are particularly useful for the indexes of arrays,
such as
subtype arrayIndex is Integer range 1..100;
squares: array(arrayIndex) of Integer;
for i in arrayIndex loop
squares[i] := i*i;
end loop;
cl-user(1): (type-of 3) fixnum cl-user(2): (type-of 3.7) single-float cl-user(3): (type-of 'January) symbol cl-user(4): (setf monLength '((January 31) (February 28) (March 31) (April 30) (May 31) (June 30) (July 31) (August 31) (September 30) (October 31) (November 30) (December 31))) ((January 31) (February 28) (March 31) (April 30) (May 31) (June 30) (July 31) (August 31) (September 30) (October 31) ...) cl-user(9): (let (m) (loop (format t "Enter a month or `bye': ") (setf m (read)) (if (eql m 'bye) (return 'Goodbye)) (format t "~A has ~D days.~%" m (second (assoc m monLength))))) Enter a month or `bye': March March has 31 days. Enter a month or `bye': June June has 30 days. Enter a month or `bye': bye Goodbye
cl-user(27): (setf Fibonacci 11235) 11235 cl-user(28): (defun Fibonacci (n) (if (< n 3) 1 (+ (Fibonacci (- n 1)) (Fibonacci (- n 2))))) Fibonacci cl-user(29): (symbol-name 'Fibonacci) "Fibonacci" cl-user(30): (symbol-value 'Fibonacci) 11235 cl-user(31): Fibonacci 11235 cl-user(32): (symbol-function 'Fibonacci) #<Interpreted Function Fibonacci> cl-user(33): (type-of (symbol-function 'Fibonacci)) function cl-user(34): (Fibonacci 10) 55
The ability to compute subscripts makes a subscripted array like a
variable name that can be computed. More precisely, a subscripted
array is an expression evaluated for its l-value. Compare
these two subroutines for the Fibonacci sequence:
#include <stdio.h>
int fibonacci(int n) {
if (n<3) return 1;
int previous = 1,
current = 1,
next, i;
for (i=3; i<=n; i++) {
next = previous + current;
previous = current;
current = next;
}
return current;
}
int Fibonacci (int n) {
if (n<3) return 1;
int num[3] = {1,1},
current = 1,
i;
for (i=3; i<=n; i++) {
current = (current + 1) % 3;
num[current] = num[(current + 1) % 3] + num[(current + 2) % 3];
}
return num[current];
}
int main() {
int i;
for (i=1; i<8; i++)
printf("fibonacci(%d) = %d\n", i, fibonacci(i));
printf("\n");
for (i=1; i<8; i++)
printf("Fibonacci(%d) = %d\n", i, Fibonacci(i));
return 0;
}
------------------------------------------------
<cirrus:Programs:1:140> gcc -Wall indexdemo.c -o indexdemo.out
<cirrus:Programs:1:141> indexdemo.out
fibonacci(1) = 1
fibonacci(2) = 1
fibonacci(3) = 2
fibonacci(4) = 3
fibonacci(5) = 5
fibonacci(6) = 8
fibonacci(7) = 13
Fibonacci(1) = 1
Fibonacci(2) = 1
Fibonacci(3) = 2
Fibonacci(4) = 3
Fibonacci(5) = 5
Fibonacci(6) = 8
Fibonacci(7) = 13
An array can be thought of as a mapping, or even a function. For
example, the C array monLength
, above, is a mapping from
a month's ordinal, 0..11, to its length. The Common Lisp use of
monLength
is more directly represented as a mapping. An
array might also be thought of as a function from a month's ordinal to
its length. This may be clearer by comparing the use of
monLength
in the C program to the use of
getLength()
in the Java Month
class.
Most current programming languages use parentheses around the
arguments of a function, e.g. f(x)
, and brackets around the
subscripts of an array, e.g. a[i]
, but Fortran and Ada use
parentheses for arrays also. Thinking of an array as a function
justifies this, but most programmers find it confusing.
Common Lisp, as usual uses a more functional notation:
Some programming languages, including Java and Common Lisp, do
range-checking. That is, they give a run-time error if the program
tries to use an out-of-range subscript. Others, including C, Perl,
and Fortran, do not. A programming language that does range checking
is clearly more reliable.
Some programming languages have a fixed lowest subscript: in
C-based languages, it is 0; in Fortran, it is 1. Others allow the
programmer to choose the lowest subscript.
The array subscript range might be statically bound (during
compile-time); dynamically bound (during run-time), but then fixed; or
fully dynamic (might change during run-time).
Array storage binding might be static, dynamic on the stack, or
dynamic on the heap.
Some languages provide a convenient way to initialize arrays, such
as the C-based languages,
Some languages provide array operations, i.e., operations on arrays
themselves. For example, in Fortran:
APL is A Programming Language specially designed to operate on
arrays.
Two-dimensional arrays may be thought of
as solid rectangles (rectangular arrays), or as arrays of arrays
(jagged arrays). Some languages insist the programmer think of
arrays one way, some the other, and some support both. Java supports only jagged arrays:
Let's try Fortran:
Jagged arrays needn't have every row have the same number of
columns.
Fortran 95 and Ada allow references to a slice of an array--- a more or
less regular piece of an array.
The entire discussion of two-dimensional arrays extends to
multi-dimensional arrays.
It is most common for a pointer variable to be an address of a
memory cell in the heap, but C and C++ also allow addresses in RAM or
on the stack.
Fortran 77 (and earlier) does not have pointer types, but they can
be simulated by using one array for data and a separate array of
indices into the first array as the pointers.
How can a pointer variable contain an address in RAM or on the
stack? Addresses in RAM or on the stack are allocated when variables
are declared. If In statically scoped languages, the declaration of a pointer
variable must include the type of variable it points to.
If Here's a C program using a pointer whose value is an
address in the stack:
In C and C++, an array name is a constant pointer to the first
element of the array, so subscripting is done by pointer arithmetic,
and pointer expressions may replace subscripted arrays.
Anonymous variables on the heap are manipulated via pointers. The
allocation operators Many novice C programmers find pointers to be confusing, but "if
everything is a pointer, you don't have to think about pointers," and
that is the approach taken by Lisp and Java. In those languages, you
can think you are storing an object (or, at worst, a reference to an
object) in a variable. You just have to remember that a change made
via one reference variable may be seen via another reference variable.
The dangling pointer problem is the problem of a pointer variable
pointing to a memory cell that was already deallocated via another
pointer variable (and possibly even reused).
This C program shows that a pointer may be mistakenly used, even
though the space it points to has been deallocated:
The problem of lost heap-dynamic variables (garbage) is the problem
of memory cells allocated on the heap becoming unreachable when the
pointer variables referring to them end their lifetime or get
reassigned to other heap memory. This problem is solved in Lisp and
Java by automatic garbage collection.
cl-user(33): (setf a (make-array 10))
#(nil nil nil nil nil nil nil nil nil nil)
cl-user(34): (setf days #(Sun Mon Tue Wed Thu Fri Sat))
#(Sun Mon Tue Wed Thu Fri Sat)
cl-user(35): (aref days 3)
Wed
cl-user(36): (setf (aref a 2) 5)
5
cl-user(37): (aref a 2)
5
However, one must distinguish whether the
int[] squares = {0, 1, 2, 9, 16, 25};
{...}
notation
is a general array-valued constructor, allowed on the rhs of
assignment statements, or only a special syntax for declaration
statements.
Program arrayop
Integer A1(5), A2(5), A3(5), A4(5)
Data A1 /1, 2, 3, 4, 5/ A2 /6, 7, 8, 9, 10/
A3 = A1 + A2
A4 = A1 * A2
Print *, A1
Print *, A2
Print *, A3
Print *, A4
End
------------------------------------
<cirrus:Programs:2:109> f77 -o arrayop.fout arrayop.f
NOTICE: Invoking /opt/SUNWspro/bin/f90 -f77 -ftrap=%none -o arrayop.fout arrayop.f
arrayop.f:
MAIN arrayop:
<cirrus:Programs:2:110> arrayop.fout
1 2 3 4 5
6 7 8 9 10
7 9 11 13 15
6 14 24 36 50
Rectangular arrays are indexed with one pair of brackets, such as
a[i, j]
.
Jagged arrays are indexed with two pairs of brackets, such as
a[i][j]
.
Note that
bsh % int[][] a = new int[3][4];
bsh % print(a.length);
3
bsh % print(a[1].length);
4
a
is a 3-element array of 4-element arrays. It
is usual to also think of this as 3 rows of 4 columns each:
An array stored so that all the elements of the first row are stored
before all the elements of the second row, etc. is referred to as
stored in row major order.
bsh % for (int i=0; i<3; i++) for (int j=0; j<4; j++) a[i][j] = 10*i+j;
bsh % for (int i=0; i<3; i++) {
for (int j=0; j<4; j++) {System.out.print(a[i][j] + " ");}
System.out.println();}
0 1 2 3
10 11 12 13
20 21 22 23
We can see this clearly in C:
This shows that C stores arrays in row major order.
#include <stdio.h>
int a[3][4];
int main() {
int i,j;
for (i=0; i<3; i++) {
for (j=0; j<<4; j++) {
a[i][j] = 10*i + j;
}
}
for (i=0; i<12; i++) {
printf("%3d", *(a[0] + i));}
printf("\n");
return 0;
}
--------------------------------------
<cirrus:Programs:2:125> gcc -Wall arrayorder.c -o arrayorder.out
<cirrus:Programs:2:126> arrayorder.out
0 1 2 3 10 11 12 13 20 21 22 23
Fortran stores arrays in column major order. Since Fortran and C
programs can easily call each other, this is an important difference.
Program arrayorder
Integer A(3,4)
Do 50 i = 1, 3
Do 50 j = 1, 4
A(i,j) = 10*i + j
50 Continue
Print *, A
End
-----------------------------------------
<cirrus:Programs:2:127> f77 -o arrayorder.fout arrayorder.f
NOTICE: Invoking /opt/SUNWspro/bin/f90 -f77 -ftrap=%none -o arrayorder.fout arrayorder.f
arrayorder.f:
MAIN arrayorder:
<cirrus:Programs:2:128> arrayorder.fout
11 21 31 12 22 32 13 23 33 14 24 34
map
s in C++, hash
tables in Common Lisp, Map
s in Java, and hashes in Perl,
are generalizations of arrays for which the "index" can be any type.
The "index" is called a key, and the element stored with the key is
called the value.
Here is a use of Perl's hashes to print the length of all the months:
#! /util/bin/perl
@months = ("January", "February", "March", "April",
"May", "June", "July", "August",
"September", "October", "November", "December");
%monLength = ("January" => 30, "February" => 28, "March" => 31, "April" => 30,
"May" => 31, "June" => 30, "July" => 31, "August" => 31,
"September" => 30, "October" => 31, "November" => 30,
"December" => 31);
foreach $month (@months) {
print "$month has $monLength{$month} days.\n";
}
-----------------------------------------------------
<cirrus:Programs:1:176> perl months.perl
January has 30 days.
February has 28 days.
March has 31 days.
April has 30 days.
May has 31 days.
June has 30 days.
July has 31 days.
August has 31 days.
September has 30 days.
October has 31 days.
November has 30 days.
December has 31 days.
C and C++ calls them structs. Common Lisp calls them structures.
Note that C++ and Common Lisp have true, modern, objects as well. See
the text for more details.
nil
, which is an explicitly invalid
address. That is, the value bound to a variable whose type is a
pointer type is either a memory address or nil
.
ptr
is a pointer variable, we want:
ptr := <expression>
, but <expression>
would be evaluated for its r-value. So we need something that
says "evaluate this expression for its l-value." In C and C++
that operator is &
, and its operand must be an expression
that could be on the left-hand side of an assignment statement.
x
is a variable and ptr
is a pointer
variable, what is the meaning of x := ptr
?
x
is also a pointer variable, it's a simple
assignment statement. x
is not a pointer
variable, it's either an error or the compiler must know that
ptr
is to be dereferenced. C and C++ use *
as an explicit dereferencing operator. Fortran 95 does implicit
dereferencing.
Here is an example in Fortran 95, showing implicit dereferencing:
#include <stdio.h>
int* ptr;
void sub1() {
int x, y;
x = 3;
ptr = &x;
y = *ptr;
printf("x = %2d; y = %2d.\n", x, y);
}
void sub2() {
int z = 5;
printf("z = %2d.\n", z);
}
void sub3() {
printf("*ptr = %2d.\n", *ptr);
}
int main() {
sub1();
sub2();
sub3();
return 0;
}
--------------------------------------------
<cirrus:Programs:1:193> gcc -Wall pointerTest.c -o pointerTest.out
<cirrus:Programs:1:194> pointerTest.out
x = 3; y = 3.
z = 5.
*ptr = 5.
Pointer arithmetic is allowed in C and C++. If
Program pointerTest
Integer, Pointer :: ptr
Integer, Target :: x
Integer :: y
x = 3
ptr => x
y = ptr
Print *, "x = ", x, "y = ", y
End
-----------------------------------------------------
<cirrus:Programs:1:237> f95 -o pointerTest.fout pointerTest.f
<cirrus:Programs:1:238> pointerTest.fout
x = 3 y = 3
ptr
is of
type typ
, and i
is of type int
,
the expression ptr + i
evaluates to the address
i*sizeof(typ)
beyond ptr
.
new
, in Java and C++, and
malloc(size)
, in C, return pointers to the newly
allocated heap memory.
The dangling pointer problem is commonly solved by removing explicit
deallocation from the programmer.
#include <stdio.h>
#include <malloc.h>
int* ptr;
int main() {
ptr = malloc(sizeof(int));
*ptr = 3;
free(ptr);
printf("*ptr = %2d\n", *ptr);
return 0;
}
---------------------------------------------------
<cirrus:Programs:1:253> gcc -Wall danglingTest.c -o danglingTest.out
<cirrus:Programs:1:254> danglingTest.out
*ptr = 3