早上在CPlusPlus版看到这么一个帖,“一道面试题,switch的变态语法”,该帖内容如下:
http://www.newsmth.net/bbstcon.php?board=CPlusPlus&gid=122076
发信人: ajaxchelsea (切尔斯基), 信区: CPlusPlus
标 题: 一道面试题,switch的变态语法
发信站: 水木社区 (Tue Aug 23 21:11:37 2005), 站内/*
* 让你猜猜下面这个函数是干嘛的
* 幸亏俺常在水木漂,不受“语法臭蛋”影响
* 一下就猜出这是strcpy,呵呵
*/
void foo(char* a, char* b, int len){
switch(len & 7) {
default:
while (len > 7) {
len -= 8;
*b++ = *a++;
case 7:*b++ = *a++;
case 6:*b++ = *a++;
case 5:*b++ = *a++;
case 4:*b++ = *a++;
case 3:*b++ = *a++;
case 2:*b++ = *a++;
case 1:*b++ = *a++;
}
}
}
后边有人指出是Duff’s device,于是去google了一下,Duff本人原话如下:
http://www.lysator.liu.se/c/duffs-device.html
From research!ucbvax!dagobah!td Sun Nov 13 07:35:46
1983
Received: by ucbvax.ARPA (4.16/4.13) id AA18997; Sun, 13 Nov
83 07:35:46 pst
Received: by dagobah.LFL (4.6/4.6b) id AA01034;
Thu, 10 Nov 83 17:57:56 PST
Date: Thu, 10 Nov 83 17:57:56 PST
From: ucbvax!dagobah!td (Tom Duff)
Message-Id:
<8311110157.AA01034@dagobah.LFL>
To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr, ucbvax!research!rob
Consider the following routine, abstracted from code which copies an array of
shorts into the Programmed IO data register of an Evans & Sutherland Picture
System II:
send(to, from, count) register short *to, *from; register count; { do *to = *from++; while(--count>0); }
(Obviously, this fails if the count is zero.)
The VAX C compiler compiles
the loop into 2 instructions (a movw and a sobleq,
I think.) As it
turns out, this loop was the bottleneck in a real-time animation playback
program which ran too slowly by about 50%. The standard way to get
more speed out of something like this is to unwind the loop a few times,
decreasing the number of sobleqs. When you do that, you wind up with a
leftover partial loop. I usually handle this in C with a switch that
indexes a list of copies of the original loop body. Of course, if I
were writing assembly language code, I’d just jump into the middle of the
unwound loop to deal with the leftovers. Thinking about this yesterday, the following implementation occurred to me:
send(to, from, count) register short *to, *from; register count; { register n=(count+7)/8; switch(count%8){ case 0: do{ *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; }while(--n>0); } }
Disgusting, no? But it compiles and runs just fine. I
feel a combination of pride and revulsion at this discovery. If no
one’s thought of it before, I think I’ll name it after myself.
It amazes me that after 10 years of writing C there are still little corners
that I haven’t explored fully. (Actually, I have another revolting way
to use switches to implement interrupt driven state machines but it’s too horrid
to go into.)
Many people (even bwk?) have said that
the worst feature of C is that switches don’t break automatically before each
case label. This code forms some sort of argument in that debate, but
I’m not sure whether it’s for or against.
yrs trly
Tom
仔细读了一下上面的文字,并参考了CSDN-文档中心-其它编程语言- 泛型<编程>:类型化缓存(II)http://dev.csdn.net/article/28/28875.shtm,得出结论如下:Duff’s Device可以算作一种loop unrolling方法吧,巧妙利用了switch和while的嵌套,不使用break语句,简单漂亮地解决了边界条件的问题,提高了执行效率。因为减少了7/8的比较判断,因此在循环次数不大的情况下,可以节省很多时间。我们通过检查现代PC的架构来寻找这个现象的解释。处理器比主内存总线快5-10倍。为了加速内存访问有二级缓存。一级在处理器内(1级),另一级在处理器边上(在奔腾三中和处理器封装在一起)最好情况是,一个操作的所有需要处理的内存都在1级缓存内。最坏情况是,你进行随机分散的内存存取操作,这样你总是不能命中缓存,最终都命中了主内存。当填充大量数据时,你不能从缓存中得益,所以填充速度会主要被限定在主内存带宽上。你对循环本身作的优化不能带来很大提高,因为瓶颈在内存,而不是处理器操作。不管你写的循环有多聪明,使用寄存器,或不展开循环,处理器总是会等待主内存。这是为什么Duff’s
Device和for循环对大内存块表现一样。 当填充较少量数据时情况起了变化。数据有更多可能被放在缓存内,而缓存和处理器一样快。现在,处理器执行的代码决定了循环的速度。Duff’s
Device比for循环有优势,因为它执行了较少比较。
当然上文也指出,标准库函数memcpy使用了特殊的循环汇编语句(x86中的术语是rep stos)。对于缓存到缓存拷贝,这种汇编语句比建立在跳转(jump),赋值和比较基础上的循环要快。因此连续内存的拷贝显然还是memcpy效率最高。但如果是循环赋值的话,Duff’s Device要比单纯循环速度快。
另外,上文中也给出了各种编译器下的memcpy, 普通循环和Duff’s Device的效率只比。但是自己还是很好奇,于是写了一段代码实际测试了一下。功能只是拷贝一段连续数据,仅比较了普通循环和Duff’s Device。完了之后用Trimedia的tmsim仿真了一下,(本来想用quantify测一下,不知道为啥,一用俺的机子就慢的不行,一动不动,只好放弃),数据如下:
拷贝数据长度 100,000bytes 1000bytes
普通循环 639494cycles 6091cycles
Diff’s Device 464751cycles 4689cycles
比例和在代码中用clock()函数测得的结果差不多,接近提高25%的效率。数据长度增加100倍,好像也没太大影响,不知道为啥。但结果是肯定的,Duff’s Device确实要快不少。
看了一下,Tom Duff发现这个算法的时候是1983年,嗯,那时候俺差不多快两岁了…长见识了。
做完这些,又在joke版看了一个有关“韩国的排骨有多贵”的帖,差点儿没笑翻了,还以为韩剧里主角容易得个白血病啥的,是因为泡菜吃太多,原来他们根本是啥都吃不着,还嘲笑我们中餐不好吃,推荐mms去韩国减肥,哈哈。http://www.newsmth.net/bbscon.php?board=Joke&id=261783