Dick Duck's SAS tips

This is the nest where I keep my SAS eggs. It's all here... Arrays, control loops, links to online SAS resources, you name it.

Thursday, February 23, 2006

Recoding answer files

Say you have a file with responses to a test that is a mixture of multiple choice (answers coded as single characters, A through F) and open-ended questions, scoring either 0 or 1 (a numeric variable of a wasteful length 8). So you can treat all the data the same way, you want to convert all the variables to numerics of length 3. We use a variation on the old char to num conversion (and the variables turn up in the right order if they weren't before to boot).

data s1b (drop=n1-n35 i);
set s1a (rename=(q1-q35=n1-n35));

length q1-q35 3.;

/* convert char to num of length 3 */
array qc $ n1-n4 n6 n8-n9 n11-n18 n20 n23 n25-n35;
array qn q1-q4 q6 q8-q9 q11-q18 q20 q23 q25-q35;

do i = 1 to dim(qc);
select (qc[i]);
when ('') qn[i] = 9;
when ('A') qn[i] = 2;
when ('B') qn[i] = 3;
when ('C') qn[i] = 4;
when ('D') qn[i] = 5;
when ('E') qn[i] = 6;
when ('F') qn[i] = 7;
/* 8 = error flag */
otherwise qn[i] = 8;
end;
end;

/* convert num of length 8 to length 3 */
array qn8 n5 n7 n10 n19 n21 n22 n24;
array qn3 q5 q7 q10 q19 q21 q22 q24;
do i = 1 to dim(qn8);
qn3[i] = qn8[i];
end;
run;

Good housekeeping

To get rid of temporary files after they've served their usefulness, use the following:

proc datasets; delete tmp1 tmp2; run;
Apart from sort, this is the only proc that I write all on one line. You can do a lot more stuff with it, such as repair files, make copies and save files.

Tuesday, February 21, 2006

Reordering variables

You've finally got your data the way you want them, but now they're in the wrong order. Let's say we have the answers to 62 multiple-choice questions, and we want them all nicely lined up. The trick is similar to char to num conversion:

data answers (drop = i t1-t62);
set answers (rename=(q1-q62=t1-t62));
array qq1 $ t1-t62;
attrib q1-q62 length = $1;
array qq2 $ q1-q62;
do i = 1 to dim(qq1);
qq2[i] = qq1[i];
end;
run;
We need to set the length attribute if we want the new variables to have the same length as the old ones.

Setting length and other attributes for existing variables

Say you want to change the length of an existing variable. Here's how:

data pond;
attrib location length=$8;
set pond;
run;

Monday, February 20, 2006

Converting a variable from numeric to character

Earlier, we 'converted' a variable from character to numeric. Sometimes, we want to go the other way. Here is how:

data s1b (drop = i n5 n7 n10 n19 n21 n22 n36); 
set s1 (rename=(q5=n5 q7=n7 q10=n10 q19=n19 q21=n21 q22=n22 q36=n36));
array qq $ q5 q7 q10 q19 q21 q22 q36;
array nn n5 n7 n10 n19 n21 n22 n36;
do i = 1 to dim(qq);
qq[i] = put (nn[i], 8.0);
end;
run;
Instead of using 'input', we use 'put'. Also, this example uses arrays to convert multiple variables at the same time. If the items had been consecutive, we could have shortened the code by using ranges (q1-q5).

Comparing if, else if and select

Suppose we have a paper with sixteen item. We want to set the maximum number of marks for each of the items. We could just code:

if item = 1 then maxmark = 4;
if item = 2 then maxmark = 6;
if item = 3 then maxmark = 6;
if item = 4 then maxmark = 9;
if item = 5 then maxmark = 4;
if item = 6 then maxmark = 6;
if item = 7 then maxmark = 6;
if item = 8 then maxmark = 9;
if item = 9 then maxmark = 4;
if item = 10 then maxmark = 6;
if item = 11 then maxmark = 6;
if item = 12 then maxmark = 9;
if item = 13 then maxmark = 4;
if item = 14 then maxmark = 6;
if item = 15 then maxmark = 6;
if item = 16 then maxmark = 9;
Suppose we have 2000 candidates, or 32,000 items. The above code would have to interpret 32,000 if-statements. How about this:

if item in (1, 5, 9, 13) then maxmark = 4;
else if item in (2, 3, 6, 7, 10, 11, 14, 15) then maxmark = 6;
else maxmark = 9;
Apart from being more succinct, by using 'in' and 'else', the code now only needs to interpret 2500 if-statements. We can do even better, by putting the most frequent case first:

if item in (2, 3, 6, 7, 10, 11, 14, 15) then maxmark = 6;
else if item in (1, 5, 9, 13) then maxmark = 4;
else maxmark = 9;
The code is a bit less accessible, but now only 2000 if-statements need to be evaluated. But there is a third way:

select (item);
when (1, 5, 9, 13) maxmark = 4;
when (2, 3, 6, 7, 10, 11, 14, 15) maxmark = 6;
when (4, 8, 12, 16) maxmark = 9; /* or: otherwise maxmark = 9 */
end;
This has the same computational load as the previous code using 'if else if', but it is clearer what is going on. Furthermore, you can be assured that this code is always the most computationally efficient.

Thursday, February 16, 2006

One to Many and Many to One transformations

A common occurrence: you have a file with candidates' marks, with the marks on the separate items as variables. You want to transform this to a file where each mark on an individual item constitutes a record. In other words, you want to go from:

cand item1 item2 item3 item4
1 4 5 6 7
2 2 0 3 4
3 1 4 5 8
to:

cand item mark
1 1 4
1 2 5
1 3 6
1 4 7
2 1 2
...
Here's how to do it:


1. data marks_by_item;
2. set marks_by_candidate;
3. array items item1-item4;
4. do item = 1 to 4;
5. mark = items[item];
6. output;
7. end;
8. drop item1-item4;
9. run;
In line 3, you set up an array with the item scores. Using the do loop in line 4, we assign the item score to the new variable 'mark' and output that to file. Finally, in line 9 we drop the item variables.
Of course, you might want to go the other way:

 1. data marks_by_candidate_again;
2. set marks_by_item;
3. by cand;
4. array items item1-item4;
5. retain items;
6. if first.candid then
7. do i = 1 to 4;
8. items[i] = .;
9. end;
10. items[item] = mark;
11. if last.cand then output;
12. drop item i mark;
13. run;
Line 3 specifies we are doing this by candidate, as we want to convert the four item scores of a candidate back to one record. In line 4, we create an array with the item scores. Notice that we are creating both the array and the four item variables here. Line 5 specifies that we want to retain the items array, otherwise it will be reset for each new record. In line 6 we are using the 'first.' system variable. This evaluates to 'true' for the first record in the 'by' group. We are using it to initialize the array. Then, for the last record in the 'by' group we output the data to file. Last, we delete the temporary variables.

A more concise way of doing the same (thanks NR), exploiting the 'loopiness' of SAS, is:

1. data marks_by_candidate_again (drop= item mark i); 
2. array items item1-item4;
3. do i = 1 to 4;
4. set marks_by_item;
5. by cand;
6. items[i] =mark;
7. if last.cand then return;
8. end;
9. run;

Select from set of elements

Say you want to select a certain set of candidates, for instance, candidates 1, 5, 7, 13 and 20.
Rather than:

if (cand = 1 or cand = 5 or cand = 7 or cand = 13 or cand = 20)
you can use:

if cand in (1, 5, 7, 13, 20)

Wednesday, February 15, 2006

Tranpose for quicker viewing

SAS can read very large data files indeed. But only the records that you are viewing are held in memory. This means that for two files of the same size, it takes much much longer to view the file with a few records but a large number of variables than a file with many records but only a few variables. (This is why you can have a number of records that is only limited by the size of your hard drive, but 'only' 32,768 variables.)
So the first thing to do if you have one of the former (for instance when using a two-dimensional array) is to transpose it into the latter.

Convert variable from character to numeric

Thanks to Paul Duckman for this. Suppose you have a datafile with a character variable (var) that you want to convert to a numeric variable. While this is not possible, the following code replaces the character variable with a numeric variable of the same name.

1. data datfile (drop = tmp);
2. set datfile (rename = (var = tmp));
3. var = input (tmp, best4.);
4. run;
On line 2, the character variable is renamed. One line 3, a new numeric variable is created, with the same name as the original character variable, and its value is set. On line 1, the temporary variable is deleted.

Tuesday, February 14, 2006

Dick Duck says 'Hi'

Keeping all your SAS tips on post-its is not so smart if you're an aquine animal, so I thought I'd gather them all in one place. This is mainly meant for me to remind me of how to do things, but feel free to take advantage.
I am writing this from my home pond, so it'll have to wait until I fly into my work pond tomorrow to start adding entries...