Dick Duck's SAS tips

This is the nest where I keep my SAS eggs. It's all here... Arrays, control loops, links to online SAS resources, you name it.

Friday, March 10, 2006

Freq's a wreck

proc freq is very useful if you want to see a quick frequency distribution of some variables on screen. But often you want to save that distribution, and use it as the basis for some calculations, or merge it with another file. For instance, you might have a file with all the candidates' item marks and you want to derive a file containing the mark distribution for all the items. To view them on screen you would do something like:

proc freq data = marks;
by item;
tables mark mark_adjusted;
run;
You would think that if you did the following, you would save these results to the file 'mark_distribution':

proc freq data = marks noprint;
by item;
tables mark mark_adjusted;
output out = mark_distribution;
run;
But output is only meant to save to file certain statistics, like the mean and so on. Since you didn't specify any, freq will throw a wobble. Here's how to do what you really want:

proc freq data = marks;
by item;
tables mark / noprint out = mark_distribution;
tables mark_adjusted / noprint out = mark_adj_distribution;
run;
Of course, if you want these distributions combined you will have to merge them:

proc sort data = mark_distribution; by item mark; run;
proc sort data = mark_adj_distribution; by item mark; run;
data mark_distribution;
merge mark_distribution mark_adj_distribution;
by item mark;
run;
proc datasets; delete mark_adj_distribution; run;

Tuesday, March 07, 2006

Bind those keys!

When you hit F9, you'll see a window with all the keybindings (shortcut keys to you and me). F12 is usually empty, and I like to put the following in:

F12     output; clear; log; clear; wpgm
This clears your output window and your log window and then returns you to the editor window. I like to hit F12 before I submit any new commands, so I don't have to search and scroll through the log and the output, but start with nice shiny new ones.

Thursday, March 02, 2006

Widget

Install this widget on your dashboard, and you'll only be ever one click away from my sage advice...
dick duck's sas tips widget

Thursday, February 23, 2006

Recoding answer files

Say you have a file with responses to a test that is a mixture of multiple choice (answers coded as single characters, A through F) and open-ended questions, scoring either 0 or 1 (a numeric variable of a wasteful length 8). So you can treat all the data the same way, you want to convert all the variables to numerics of length 3. We use a variation on the old char to num conversion (and the variables turn up in the right order if they weren't before to boot).

data s1b (drop=n1-n35 i);
set s1a (rename=(q1-q35=n1-n35));

length q1-q35 3.;

/* convert char to num of length 3 */
array qc $ n1-n4 n6 n8-n9 n11-n18 n20 n23 n25-n35;
array qn q1-q4 q6 q8-q9 q11-q18 q20 q23 q25-q35;

do i = 1 to dim(qc);
select (qc[i]);
when ('') qn[i] = 9;
when ('A') qn[i] = 2;
when ('B') qn[i] = 3;
when ('C') qn[i] = 4;
when ('D') qn[i] = 5;
when ('E') qn[i] = 6;
when ('F') qn[i] = 7;
/* 8 = error flag */
otherwise qn[i] = 8;
end;
end;

/* convert num of length 8 to length 3 */
array qn8 n5 n7 n10 n19 n21 n22 n24;
array qn3 q5 q7 q10 q19 q21 q22 q24;
do i = 1 to dim(qn8);
qn3[i] = qn8[i];
end;
run;

Good housekeeping

To get rid of temporary files after they've served their usefulness, use the following:

proc datasets; delete tmp1 tmp2; run;
Apart from sort, this is the only proc that I write all on one line. You can do a lot more stuff with it, such as repair files, make copies and save files.

Tuesday, February 21, 2006

Reordering variables

You've finally got your data the way you want them, but now they're in the wrong order. Let's say we have the answers to 62 multiple-choice questions, and we want them all nicely lined up. The trick is similar to char to num conversion:

data answers (drop = i t1-t62);
set answers (rename=(q1-q62=t1-t62));
array qq1 $ t1-t62;
attrib q1-q62 length = $1;
array qq2 $ q1-q62;
do i = 1 to dim(qq1);
qq2[i] = qq1[i];
end;
run;
We need to set the length attribute if we want the new variables to have the same length as the old ones.

Setting length and other attributes for existing variables

Say you want to change the length of an existing variable. Here's how:

data pond;
attrib location length=$8;
set pond;
run;

Monday, February 20, 2006

Converting a variable from numeric to character

Earlier, we 'converted' a variable from character to numeric. Sometimes, we want to go the other way. Here is how:

data s1b (drop = i n5 n7 n10 n19 n21 n22 n36); 
set s1 (rename=(q5=n5 q7=n7 q10=n10 q19=n19 q21=n21 q22=n22 q36=n36));
array qq $ q5 q7 q10 q19 q21 q22 q36;
array nn n5 n7 n10 n19 n21 n22 n36;
do i = 1 to dim(qq);
qq[i] = put (nn[i], 8.0);
end;
run;
Instead of using 'input', we use 'put'. Also, this example uses arrays to convert multiple variables at the same time. If the items had been consecutive, we could have shortened the code by using ranges (q1-q5).

Comparing if, else if and select

Suppose we have a paper with sixteen item. We want to set the maximum number of marks for each of the items. We could just code:

if item = 1 then maxmark = 4;
if item = 2 then maxmark = 6;
if item = 3 then maxmark = 6;
if item = 4 then maxmark = 9;
if item = 5 then maxmark = 4;
if item = 6 then maxmark = 6;
if item = 7 then maxmark = 6;
if item = 8 then maxmark = 9;
if item = 9 then maxmark = 4;
if item = 10 then maxmark = 6;
if item = 11 then maxmark = 6;
if item = 12 then maxmark = 9;
if item = 13 then maxmark = 4;
if item = 14 then maxmark = 6;
if item = 15 then maxmark = 6;
if item = 16 then maxmark = 9;
Suppose we have 2000 candidates, or 32,000 items. The above code would have to interpret 32,000 if-statements. How about this:

if item in (1, 5, 9, 13) then maxmark = 4;
else if item in (2, 3, 6, 7, 10, 11, 14, 15) then maxmark = 6;
else maxmark = 9;
Apart from being more succinct, by using 'in' and 'else', the code now only needs to interpret 2500 if-statements. We can do even better, by putting the most frequent case first:

if item in (2, 3, 6, 7, 10, 11, 14, 15) then maxmark = 6;
else if item in (1, 5, 9, 13) then maxmark = 4;
else maxmark = 9;
The code is a bit less accessible, but now only 2000 if-statements need to be evaluated. But there is a third way:

select (item);
when (1, 5, 9, 13) maxmark = 4;
when (2, 3, 6, 7, 10, 11, 14, 15) maxmark = 6;
when (4, 8, 12, 16) maxmark = 9; /* or: otherwise maxmark = 9 */
end;
This has the same computational load as the previous code using 'if else if', but it is clearer what is going on. Furthermore, you can be assured that this code is always the most computationally efficient.

Thursday, February 16, 2006

One to Many and Many to One transformations

A common occurrence: you have a file with candidates' marks, with the marks on the separate items as variables. You want to transform this to a file where each mark on an individual item constitutes a record. In other words, you want to go from:

cand item1 item2 item3 item4
1 4 5 6 7
2 2 0 3 4
3 1 4 5 8
to:

cand item mark
1 1 4
1 2 5
1 3 6
1 4 7
2 1 2
...
Here's how to do it:


1. data marks_by_item;
2. set marks_by_candidate;
3. array items item1-item4;
4. do item = 1 to 4;
5. mark = items[item];
6. output;
7. end;
8. drop item1-item4;
9. run;
In line 3, you set up an array with the item scores. Using the do loop in line 4, we assign the item score to the new variable 'mark' and output that to file. Finally, in line 9 we drop the item variables.
Of course, you might want to go the other way:

 1. data marks_by_candidate_again;
2. set marks_by_item;
3. by cand;
4. array items item1-item4;
5. retain items;
6. if first.candid then
7. do i = 1 to 4;
8. items[i] = .;
9. end;
10. items[item] = mark;
11. if last.cand then output;
12. drop item i mark;
13. run;
Line 3 specifies we are doing this by candidate, as we want to convert the four item scores of a candidate back to one record. In line 4, we create an array with the item scores. Notice that we are creating both the array and the four item variables here. Line 5 specifies that we want to retain the items array, otherwise it will be reset for each new record. In line 6 we are using the 'first.' system variable. This evaluates to 'true' for the first record in the 'by' group. We are using it to initialize the array. Then, for the last record in the 'by' group we output the data to file. Last, we delete the temporary variables.

A more concise way of doing the same (thanks NR), exploiting the 'loopiness' of SAS, is:

1. data marks_by_candidate_again (drop= item mark i); 
2. array items item1-item4;
3. do i = 1 to 4;
4. set marks_by_item;
5. by cand;
6. items[i] =mark;
7. if last.cand then return;
8. end;
9. run;

Select from set of elements

Say you want to select a certain set of candidates, for instance, candidates 1, 5, 7, 13 and 20.
Rather than:

if (cand = 1 or cand = 5 or cand = 7 or cand = 13 or cand = 20)
you can use:

if cand in (1, 5, 7, 13, 20)