From local Fri Mar 17 16:39 MST 1989 To: local Subject: post.comm Status: RO In article <2131@mister-curious.sw.mcc.com>, loo@mister-curious.sw.mcc.com (Joel Loo) writes: > In article <978@philmds.UUCP>, leo@philmds.UUCP (Leo de Wit) writes: > > And how about: > > puts(" A comment /* in here */"); > > And you can give more examples showing it isn't that trivial; a challenge > > for the sed adept, perhaps ... > > Leo. > [And a lot of previous articles on the same topic] > > The problem is: sed and vi do not understand C syntax. > > Solution: write a lex program to strip comments. The program must > understand C syntax enough to know what is a comment and what is not. > > Encouragement: it should not be too difficult. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It isn't. Six lines of Lex source (not counting initialization) are enough. A Lex source for ``uncomment'' has been posted in comp.sources.unix: Subject: Volume 16 (Ends January 17, 1989) identlist List identifiers and declarations for C sources Attached is a minimum test for an uncommenting algorithm, including tests for quotes inside and outside comments. John Rupley uucp: ..{uunet | ucbvax | cmcl2 | hao!ncar!noao}!arizona!rupley!local internet: rupley!local@megaron.arizona.edu (O) Dept. Biochemistry, Univ. Arizona, Tucson AZ 85721 - (602) 621-3929 ---------------------------------------------------------------------------- /* * tests for ``uncomment'' * assume C-code conventions: * strings start and end on one line * comments can be multi-line * no tests for varieties of: '"' \'"\' etc * no tests for strings with newline escaped */ string4 "hi /*\"hi there*/there\"" comment1 /*one"*/"*/ comment2 /*\"hi there"*/"*/" comment3 /*\"hi there*/ comment4 /* hello/*hello/*hello/*hello*/ comment5 /*******/ comment6 /*/*/ a /**/ b /***/ c /****/ d /*////*/ comment7 /*/*// a /**// b /***// c /****// d /*////*// 1. /*****//"hello world */" ok /"hello world */" 2. /* hello /* /* world */ ok 3. /* */ hello /* */ ok hello 4. /**// /* this should produce "/ \n" for output */ ok / 5. /* */ hello */ ok hello */ 6. /*/*/ hello ok hello 7. /*////*/ ok 8. /*//*/ ok 9. abc = "/* fake comment"; /* got who ? */ ok abc = "STRING"; 10. /* "start quote "then next line end quote, after more characters than on line 1" more more more */ " ok " ---------------------------------------------------------------------------- From jeenglis@nunki.usc.edu Sat Mar 18 19:48:32 1989 Path: arizona!noao!ncar!ames!mailrus!tut.cis.ohio-state.edu!bloom-beacon!oberon!nunki.usc.edu!jeenglis From: jeenglis@nunki.usc.edu (Joe English) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Message-ID: <3114@nunki.usc.edu> Date: 19 Mar 89 02:48:32 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <978@philmds.UUCP> Reply-To: jeenglis@nunki.usc.edu (Joe English) Organization: N of A Lines: 68 Status: RO leo@philmds.UUCP (Leo de Wit) writes: >In article <4896@cbnews.ATT.COM> smk@cbnews.ATT.COM (Stephen M. Kennedy) writes: >|In article <9900010@bradley> brian@bradley.UUCP writes: >|> The following works in vi: :%s/\/\*.*\*\///g >| >|/* And this */ important_variable = 42 /* doesn't work either! */ > >And how about: > > puts(" A comment /* in here */"); > >And you can give more examples showing it isn't that trivial; a challenge >for the sed adept, perhaps ... Does it *have* to be done in sed/awk/other text processor? This problem is fairly difficult to solve using regexp/editor commands, but it's a piece of cake to do in C: #include void eatcomment(void); main() { int ch; int instring = 0; ch = getchar(); while (ch != EOF) { switch (ch) { case '"' : instring = !instring; break; case '/' : if (!instring) if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); } else putchar('/'); break; case '\\' : /* in case this is a \" in a string, */ putchar('\\'); /* pass it through now and don't let */ ch = getchar(); /* the switch() eat it */ } putchar(ch); ch = getchar(); } exit(0); } void eatcomment(void) { int ch; for (;;) { ch = getchar(); while (ch == '*') if ((ch = getchar()) == '/') return; if (ch == EOF) exit(1); /* oops */ } } ------------ This hasn't been tested thoroughly; it's mostly from memory. Joe English jeenglis@nunki.usc.edu >From ian@ux.cs.man.ac.uk Sun Mar 19 10:12:54 1989 Path: arizona!noao!ncar!ames!lll-winken!uunet!mcvax!ukc!mucs!ian From: ian@ux.cs.man.ac.uk (Ian Cottam) Newsgroups: comp.lang.c Subject: Re: request for C comment stripper Message-ID: <5693@ux.cs.man.ac.uk> Date: 19 Mar 89 17:12:54 GMT Organization: Computer Science, University of Manchester, UK Lines: 30 I usually use a lex script for such things. I didn't have one for C, but the following might do the trick. N.B. Not tested, not proven, no warranty! ________________________________________________________________________ %{ /***** Lex script to strip comments from C texts ******/ %} %s COMMENT STRING CHAR %% \' {BEGIN CHAR; ECHO;} \" {BEGIN STRING; ECHO;} "/*" BEGIN COMMENT; . ECHO; \n ECHO; \\' ECHO; \' {ECHO; BEGIN INITIAL;} \\\" ECHO; \" {ECHO; BEGIN INITIAL;} "*/" BEGIN INITIAL; . ; \n ; %% ----------------------------------------------------------------- Ian Cottam, Room IT101, Department of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL, U.K. Tel: (+44) 61-275 6157 FAX: (+44) 61-275-6280 ARPA: ian%ux.cs.man.ac.uk@nss.cs.ucl.ac.uk JANET: ian@uk.ac.man.cs.ux UUCP: ..!mcvax!ukc!mur7!ian ----------------------------------------------------------------- Path: arizona!rupley From: rupley@arizona.edu (John Rupley) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Summary: Use Lex, if you only want to strip Message-ID: <9797@megaron.arizona.edu> Date: 20 Mar 89 10:34:57 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <3145@nunki.usc.edu> Organization: U of Arizona CS Dept, Tucson Lines: 65 In article <3145@nunki.usc.edu>, jeenglis@nunki.usc.edu (Joe English) writes: > I made a mistake in the comment-eating program I > posted yesterday -- it won't handle > /* something like *//* this. */ > Change the line in the '/' case from: > if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); } > to: > if ((ch = getchar()) == '*') { eatcomment(); ch=getchar(); continue; } > and it will work. If anyone's interested. It still doesn't work. It won't uncomment itself. Or the following line: '"' /* hi there */ '"' Or distinguish a correct string, with escaped newlines, "hi\ /*\*/ /**/\ there" from an incorrect string without the escapes. The point is not _whether_ one can write an ``uncomment'' in C, but how, and in what language, one can do it most simply. It is certainly right to use C if uncommenting is part of a larger design, as in cpp or ctags. But if the whole aim is to uncomment, then a pattern-handling language, such as Lex, is more appropriate. A few lines of Lex source do the job, and assuming familiarity with regular expression syntax, it is easy to write and understand, and hard to get the logic wrong. It should be doable with sed or awk, but probably not as easily, because they see a file as a stream of lines rather than characters. In C, the proper setting up of the switch and flags is not trivial, as the previous posting witnesses. A Lex source for uncommenting is attached (which I hope does not belie the remark above about hard to get the logic wrong :-). John Rupley uucp: ..{uunet | ucbvax | cmcl2 | hao!ncar!noao}!arizona!rupley!local internet: rupley!local@megaron.arizona.edu -------------------------------------------------------------------- %{ /* UNCOMMENT- */ /* regexp for comment recognition based on usenet posting by: */ /* Chris Thewalt; thewalt@ritz.cive.cmu.edu */ %} STRING \"(\\\n|\\\"|[^"\n])*\" COMMENTBODY ([^*\n]|"*"+[^*/\n])* COMMENTEND ([^*\n]|"*"+[^*/\n])*"*"*"*/" QUOTECHAR \'[^\\]\'|\'\\.\'|\'\\[x0-9][0-9]*\' ESCAPEDCHAR \\. %START COMMENT %% {COMMENTBODY} ; {COMMENTEND} BEGIN 0; .|\n ; "/*" BEGIN COMMENT; {STRING} ECHO; {QUOTECHAR} ECHO; {ESCAPEDCHAR} ECHO; .|\n ECHO; --------------------------------------------------------------------------- Path: arizona!noao!ncar!unmvax!pprg.unm.edu!hc!lll-winken!uunet!mcvax!hp4nl!botter!star.cs.vu.nl!maart From: maart@cs.vu.nl (Maarten Litmaath) Newsgroups: comp.lang.c,comp.unix.wizards Subject: Sed wins! It IS possible to strip C comments with 1 sed command! Message-ID: <2186@solo11.cs.vu.nl> Date: 21 Mar 89 01:22:14 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4221@omepd.UUCP> <981@philmds.UUCP> <982@philmds.UUCP> Organization: V.U. Informatica, Amsterdam, the Netherlands Lines: 120 Xref: arizona comp.lang.c:17795 comp.unix.wizards:15880 leo@philmds.UUCP (Leo de Wit) writes: \Can it be proven to be impossible (that is, deleting the comments \with one sed command - multi-line comments not considered) ? No, because the script below WILL do it. It won't touch "/*...*/" inside strings. Multi-line comments ARE considered and handled OK. One can either use "sed -f script" or "sed -n ''". After the script some test input follows (an awful but valid C program). Spoiler: the sequence H x s/\n\(.\).*/\1/ x s/.// deletes the first character of the pattern space and appends it to the hold space; this space contains the characters not to be deleted. ----------8<----------8<----------8<----------8<----------8<---------- #n : loop /^$/{ x p n b loop } /^"/{ : double /^$/{ x p n b double } H x s/\n\(.\).*/\1/ x s/.// /^"/b break /^\\/{ H x s/\n\(.\).*/\1/ x s/.// } b double } /^'/{ : single /^$/{ x p n b single } H x s/\n\(.\).*/\1/ x s/.// /^'/b break /^\\/{ H x s/\n\(.\).*/\1/ x s/.// } b single } /^\\/{ H x s/\n\(.\).*/\1/ x b break } /^\/\*/{ s/.// : comment s/.// /^$/n /^*\//{ s/..// b loop } b comment } : break H x s/\n\(.\).*/\1/ x s/.// b loop ----------8<----------8<----------8<----------8<----------8<---------- main() { /* this * is a comment */ char /* Z /* Z / Z * Z /*/ *s = "/*", /* Z /* Z / Z * Z **/ c = '*', d = '/', f = '\\', g = '\'', *q = "*/", *p = "\ /* these characters are\ inside a string \"\\\ */"; int i = 12 / 2 * 3; exit(0); } -- Modeless editors and strong typing: |Maarten Litmaath @ VU Amsterdam: both for people with weak memories. |maart@cs.vu.nl, mcvax!botter!maart From arizona!noao!ncar!ames!lll-winken!uunet!mcvax!kth!osiris!uplog!lynx!pem Tue Mar 21 20:23:45 MST 1989 Status: RO Subject: Re: Want a way to strip comments from a Article 17801 of comp.lang.c: Path: arizona!noao!ncar!ames!lll-winken!uunet!mcvax!kth!osiris!uplog!lynx!pem >From: pem@zyx.SE (Per-Erik Martin) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Message-ID: <852@lynx.zyx.SE> Date: 21 Mar 89 17:06:39 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <978@philmds.UUCP> <3114@nunki.usc.edu> <983@philmds.UUCP> Reply-To: pem@spunk.zyx.SE (Per-Erik Martin) Organization: ZYX Sweden AB, Stockholm, Sweden Lines: 91 In article <983@philmds.UUCP> leo@philmds.UUCP (Leo de Wit) writes: >In article <3114@nunki.usc.edu> jeenglis@nunki.usc.edu (Joe English) writes: >| >|Does it *have* to be done in sed/awk/other text processor? >|This problem is fairly difficult to solve using regexp/editor >|commands, but it's a piece of cake to do in C: > >Piece of cake? Your program can't even strip its own comments (try it)! Here's another example in C. It *is* a piece of cake (15 minutes work). The problem can be described with a simple automata which is easily coded in in C (with goto's, >yech<). I've tested it on most of the pathological examples given in this group and it seems to work. ---------------------------------------------------------------------------- /* cstrip.c pem@zyx.SE, 1989 */ #include main() { char c, c1; goto into_code; in_code: putchar(c); into_code: switch (c = (char)getchar()) { case EOF: exit(0); case '\'': goto in_char; case '"': goto in_string; case '/': c1 = c; if ((c = (char)getchar()) == '*') goto in_comment; putchar(c1); default: goto in_code; } in_char: putchar(c); switch (c = (char)getchar()) { case EOF: exit(1); case '\\': putchar(c); c = (char)getchar(); default: putchar(c); while ((c = (char)getchar()) != '\'') putchar(c); goto in_code; } in_string: putchar(c); switch (c = (char)getchar()) { case EOF: exit(1); case '"': goto in_code; case '\\': putchar(c); c = (char)getchar(); default: goto in_string; } in_comment: switch (c = (char)getchar()) { case EOF: exit(1); case '*': if ((c = (char)getchar()) == '/') goto into_code; default: goto in_comment; } } ---------------------------------------------------------------------------- -- ------------------------------------------------------------------------------- - Per-Erik Martin, ZYX Sweden AB, Bangardsgatan 13, S-753 20 Uppsala, Sweden - - Email: pem@zyx.SE - ------------------------------------------------------------------------------- In article <852@lynx.zyx.SE>, pem@spunk.zyx.SE (Per-Erik Martin) writes: > Here's another example in C. It *is* a piece of cake (15 minutes work). > The problem can be described with a simple automata which is easily coded > in in C (with goto's, >yech<). I've tested it on most of the pathological > examples given in this group and it seems to work. This one fails, too. Try: /***/ hi there /**/ Goes to show, for a quick and clean coding of a pattern-matching automaton, think Lex. The Lex source that was posted is so simple it would be hard to get the logic wrong. Two out of two C postings suggest that it may be easier to err in coding the same automaton in C. Not to imply that C has no advantages -- following comparison is for size of source and for time of uncommenting main.c of an emacs distribution: timex/real wc -l 13.95 10 eatLex.l Lex 2.53 37 eatC.c C code that works 1:27.13 78 eat.sed Maarten L's recently posted sed script (more lines than the C code :-) :-) As expected, one pays in efficiency for the ease of Lex coding. John Rupley rupley!local@megaron.arizona.edu Path: arizona!noao!ncar!unmvax!tut.cis.ohio-state.edu!mailrus!csd4.milw.wisc.edu!lll-winken!uunet!mcvax!ukc!icdoc!bilpin!jim From: jim@bilpin.UUCP (Jim G) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a C program Message-ID: <1467@bilpin.UUCP> Date: 22 Mar 89 12:35:15 GMT Organization: SRL, London, England Lines: 38 #{ v_langC.1 } Lots of postings on how not do this, and how other peoples suggestions won't work, but something of a shortage of answers. Weep no more ... if( Followup ~ /doesn't work with/ && Followup !~ /solution is/ ) print Followup > "/dev/null" :-) #{ zapcom.sh } # Remove comments from a C program # sed removes comment strings which begin and end on the same line # awk removes comment strings which extend across multiple lines # sed/awk both handle nesting of comments within their context sed -e ':nest' \ -e 's?/\*[^/\*][^/\*]*\*/??g' \ -e 'tnest' yourinput.c \ | awk ' /\/\*/ { if( COML == 0 ) print substr( $0, 1, index( $0, "/*" )-1 ) for( F=1; F<=NF; F++ ) if( $F == "/*" ) COML++ } /\*\// { REST = $0 for( F=1; F<=NF; F++ ) if( $F == "*/" ) { COML-- ; REST = substr( REST, index( REST, "*/" ) ) } if( COML == 0 ) print substr( REST, 3 ) next } COML==0 { print } ' > youroutput.c -- This line has been intentionally left blank. Path: arizona!noao!ncar!ames!hc!lll-winken!uunet!mcvax!kth!osiris!uplog!lynx!pem From: pem@zyx.SE (Per-Erik Martin) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Message-ID: <858@lynx.zyx.SE> Date: 23 Mar 89 16:25:38 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <852@lynx.zyx.SE> <9833@megaron.arizona.edu> Reply-To: pem@lynx.zyx.SE (Per-Erik Martin) Organization: ZYX Sweden AB, Stockholm, Sweden Lines: 33 In article <9833@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: > >This one fails, too. Try: > > /***/ hi there /**/ > Oops! Well, if you change the '*'-case in 'in_comment:' to this: do { if ((c = (char)getchar()) == '/') goto into_code; } while (c == '*'); it should work better. (Funny no one found the other bug yet... What do you expect after 15 minutes? ;-) >Goes to show, for a quick and clean coding of a pattern-matching >automaton, think Lex. The Lex source that was posted is so simple it >would be hard to get the logic wrong. Two out of two C postings suggest >that it may be easier to err in coding the same automaton in C. > >Not to imply that C has no advantages -- following comparison is for >size of source and for time of uncommenting main.c of an emacs distribution: > >[...timings...] Another advantage with C is that it's portable outside the Unix universe... -- ------------------------------------------------------------------------------- - Per-Erik Martin, ZYX Sweden AB, Bangardsgatan 13, S-753 20 Uppsala, Sweden - - Email: pem@zyx.SE - ------------------------------------------------------------------------------- In article <1179@masscomp.UUCP>, ftw@masscomp.UUCP (Farrell Woods) writes: >In article <9833@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: >>This one fails, too. Try: >> /***/ hi there /**/ >Shouldn't it be a requirement that the program to be stripped at least compile? >This example will generate a syntax error. Aw, c'mon... be imaginative... replace "hi there" by a proper statement or whatever: /***/ main() {printf("hi there\n");} /**/ Cpp strips the comments (properly) and passes the program text. The buggy C code, which was being discussed in the previous posting, strips everything. Both of the earlier Lex postings do it right, which probably was the take-home lesson. John Rupley rupley!local@megaron.arizona.edu In article <1453@wpi.wpi.edu>, lfoard@wpi.wpi.edu (Lawrence C Foard) writes: > I just made this C comment stripper, I tried it on it self and it works > ok. If any one finds code it pukes on tell me (there is probably still > something I missed). It pukes ... try: /***/ main() {printf("hi there\n");} /* */ Score, anyone? (recent postings tested on K&RI-type code) sed 1/1 correct Lex 2/2 correct C 2/2 wrong Hmmm. John Rupley rupley!local@megaron.arizona.edu Path: arizona!noao!ncar!mailrus!ulowell!m2c!wpi!lfoard From: lfoard@wpi.wpi.edu (Lawrence C Foard) Newsgroups: comp.lang.c Subject: C comments Message-ID: <1486@wpi.wpi.edu> Date: 24 Mar 89 01:14:44 GMT References: <1167@unisec.usi.com> <5312@turnkey.TCC.COM> <1989Jan30.013936.11995@gpu.utcs.toronto.edu> <13048@steinmetz.ge.com> <1989Jan31.021121.13816@gpu.utcs.toronto.edu> Reply-To: lfoard@wpi.wpi.edu (Lawrence C Foard) Organization: Worcester Polytechnic Institute, Worcester, MA. USA Lines: 52 oops. It appears that the original posting had a bug /** **/ blew up here is a fixed version. Why is this program better than LEX?? Because it is free there are still some people who use PC and buy there own software. --------------Cut here--------------- /* Public domain C comment stripper created by Lawrence Foard */ /* version 2 */ #include char *a="/* this is a test 'of the emergency \" comment stripper \\ \'*/"; /* this is a'nasty\' "comment" meant / * to really confuse it"*//*\*/ int no_com() { int c; static int quote=0,squote=0,slash=0; c=getc(stdin); if (slash || (c=='\\')) { slash=!slash; return(c); } if ((quote^=((c=='"') && !squote)) || (squote^=((c=='\''/*\ and right here two \*/) && !quote))) return(c); if (c=='/') if ((c=getc(stdin))!='*') { ungetc(c,stdin); return('/'); } else { c=0; do { ungetc(c,stdin); while(getc(stdin)!='*'); } while((c=getc(stdin))!='/'); return(no_com()); } return(c); } main() { int c; while((c=no_com())!=EOF) fputc(c,stdout); } -- Disclaimer: My school does not share my views about FORTRAN. FORTRAN does not share my views about my school. Path: arizona!noao!ncar!unmvax!tut.cis.ohio-state.edu!ucbvax!ucsd!rutgers!att!ulysses!mhuxo!mhuxu!m10ux!mnc From: mnc@m10ux.UUCP (Michael Condict) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Summary: lex script to delete comments from C source Message-ID: <893@m10ux.UUCP> Date: 24 Mar 89 05:17:17 GMT References: <7150@siemens.UUCP> <9900010@bradley> <890@m10ux.UUCP> Sender: netnews@m10ux.UUCP Organization: AT&T Bell Labs, Murray Hill Lines: 15 Oops, the previous lex script I posted for deleting comments from C source code is incorrect -- it doesn't recognize: /***...**/ Here is a better one (simpler, too): %% \"([^\\"]*\\(.|\n))*[^\\"]*\" ECHO; "/*"([^*]|"*"+[^/*])*"*"*"*/" ; . ECHO; Okay, I promise to stop now. (Unless there is a bug in this one.) -- Michael Condict {att|allegra}!m10ux!mnc AT&T Bell Labs (201)582-5911 MH 3B-416 Murray Hill, NJ Path: arizona!rupley From: rupley@arizona.edu (John Rupley) Newsgroups: comp.lang.c Subject: Re: not the way ... (was Re: Want a way to strip comments from a) Summary: close but no cigar Message-ID: <9881@megaron.arizona.edu> Date: 25 Mar 89 03:45:06 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4221@omepd.UUCP> <1492@wpi.wpi.edu> Organization: U of Arizona CS Dept, Tucson Lines: 30 In article <1492@wpi.wpi.edu>, lfoard@wpi.wpi.edu (Lawrence C Foard) writes: > I tried the comment stripper I poster earlier today on these pathological > cases and it seems to get the right answer. Close, but no cigar. We're talking real pathology, here.... try: (echo '/*';yes '*//*';echo 'cosmetic */') | stripper_name Recursion blows the stack for your program. Previously posted strippers handle the above. If you insist on a compilable file, use a script to produce: /* [stack-blowing number of lines of *//*] */ compilable program text Why strip comments? (1) the original poster had a broken compiler that choked on comments; (2) the start of a cheap way to get a list or inverted index of identifiers (cpp does too much). I suspect all useful points (and more? :-) have been made about comment stripping -- perhaps this thread should die now. John Rupley rupley!local@megaron.arizona.edu From local Sat Mar 25 11:10 MST 1989 To: arizona!uunet!mcvax!ukc!icodc!bilpin!jim Subject: Re: Want a way to strip.... Cc: local Status: R > #{ v_langC.1 } > Lots of postings on how not do this, and how other peoples > suggestions won't work, but something of a shortage of answers. But there have been 5 (at least) correct postings (2 Lex, 1 sed, 2 C). The C solutions were wrong as first posted, but were revised subsequently. > Weep no more ... Ah, but I do (:-) -- your solution is incorrect, as well as being, perhaps, a bit inelegant. The attached shar file has some test text, in uncomment.tst2, that minimally checks out an ``uncomment''. Also attached are 6 correct solutions, i.e, the five noted above, plus my own C code. I prefer Lex, for its simplicity. Awk, by the way, is not optimal for this problem, owing to multiline matches not being native to awk. One can use it, but why bother, when there are better ways? In this case, even C. Regards, John Rupley uucp: ..{uunet | ucbvax | cmcl2 | hao!ncar!noao}!arizona!rupley!local internet: rupley!local@megaron.arizona.edu (H) 30 Calle Belleza, Tucson AZ 85716 - (602) 325-4533 (O) Dept. Biochemistry, Univ. Arizona, Tucson AZ 85721 - (602) 621-3929 Path: arizona!noao!ncar!unmvax!pprg.unm.edu!hc!lll-winken!uunet!mcvax!hp4nl!botter!star.cs.vu.nl!maart From: maart@cs.vu.nl (Maarten Litmaath) Newsgroups: comp.lang.c Subject: C comment stripper shell script? -> use sed pipeline Message-ID: <2216@solo8.cs.vu.nl> Date: 25 Mar 89 14:34:38 GMT References: <1467@bilpin.UUCP> Organization: V.U. Informatica, Amsterdam, the Netherlands Lines: 120 jim@bilpin.UUCP (Jim G) writes: \#{ zapcom.sh } \# Remove comments from a C program \# sed removes comment strings which begin and end on the same line \# awk removes comment strings which extend across multiple lines \# sed/awk both handle nesting of comments within their context Aha! You're using a SHELL script! Well, in that case there's another word for my `sed approach' :-) No awk necessary. This pipeline is reasonably fast too! Usage: sed -f Cstrip.1.sed foo.c | sed -f Cstrip.2.sed | sed -f Cstrip.3.sed : This is a shar archive. Extract with sh, not csh. : This archive ends with exit, so do not worry about trailing junk. : --------------------------- cut here -------------------------- PATH=/bin:/usr/bin:/usr/ucb echo Extracting 'Cstrip.1.sed' sed 's/^X//' > 'Cstrip.1.sed' << '+ END-OF-FILE ''Cstrip.1.sed' X#n Xs/\(.\)/\1\ X/g Xs/$/==/p + END-OF-FILE Cstrip.1.sed chmod 'u=rw,g=r,o=r' 'Cstrip.1.sed' set `wc -c 'Cstrip.1.sed'` count=$1 case $count in 27) :;; *) echo 'Bad character count in ''Cstrip.1.sed' >&2 echo 'Count should be 27' >&2 esac echo Extracting 'Cstrip.2.sed' sed 's/^X//' > 'Cstrip.2.sed' << '+ END-OF-FILE ''Cstrip.2.sed' X#n X/"/{ X : L0 X p X n X /"/{ X p X b X } X /\\/{ X p X n X } X b L0 X} X/'/{ X : L1 X p X n X /'/{ X p X b X } X /\\/{ X p X n X } X b L1 X} X/\\/{ X p X n X p X b X} X/\//{ X h X n X /*/{ X : L2 X n X : L3 X /*/{ X n X /\//b X b L3 X } X b L2 X } X H X g X} Xp + END-OF-FILE Cstrip.2.sed chmod 'u=rw,g=r,o=r' 'Cstrip.2.sed' set `wc -c 'Cstrip.2.sed'` count=$1 case $count in 232) :;; *) echo 'Bad character count in ''Cstrip.2.sed' >&2 echo 'Count should be 232' >&2 esac echo Extracting 'Cstrip.3.sed' sed 's/^X//' > 'Cstrip.3.sed' << '+ END-OF-FILE ''Cstrip.3.sed' X#n X/==/{ X g X s/\n//gp X s/.*// X x X b X} XH + END-OF-FILE Cstrip.3.sed chmod 'u=rw,g=r,o=r' 'Cstrip.3.sed' set `wc -c 'Cstrip.3.sed'` count=$1 case $count in 40) :;; *) echo 'Bad character count in ''Cstrip.3.sed' >&2 echo 'Count should be 40' >&2 esac exit 0 -- Modeless editors and strong typing: |Maarten Litmaath @ VU Amsterdam: both for people with weak memories. |maart@cs.vu.nl, mcvax!botter!maart Path: arizona!noao!ncar!mailrus!uflorida!novavax!twwells!bill From: bill@twwells.uucp (T. William Wells) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Message-ID: <795@twwells.uucp> Date: 25 Mar 89 19:31:10 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <3145@nunki.usc.edu> <9797@megaron.arizona.edu> Reply-To: bill@twwells.UUCP (T. William Wells) Organization: None, Ft. Lauderdale Lines: 13 Summary: Expires: Sender: Followup-To: Distribution: Keywords: In article <9797@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: : A Lex source for uncommenting is attached (which I hope does not belie : the remark above about hard to get the logic wrong :-). Try it on a very long comment. You might discover an overflowed lex buffer. On the other hand, this shouldn't be too hard to fix. Just do for the comment what you did for the noncommented text. --- Bill { uunet | novavax } !twwells!bill (BTW, I'm may be looking for a new job sometime in the next few months. If you know of a good one where I can be based in South Florida do send me e-mail.) Path: arizona!rupley From: rupley@arizona.edu (John Rupley) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Summary: Lex script still fails (and crashes) Message-ID: <9888@megaron.arizona.edu> Date: 26 Mar 89 02:51:19 GMT References: <7150@siemens.UUCP> <9900010@bradley> <890@m10ux.UUCP> <893@m10ux.UUCP> Organization: U of Arizona CS Dept, Tucson Lines: 34 In article <893@m10ux.UUCP>, mnc@m10ux.UUCP (Michael Condict) writes: > Oops, the previous lex script I posted for deleting comments from > C source code is incorrect -- it doesn't recognize: /***...**/ > Here is a better one (simpler, too): > > %% > \"([^\\"]*\\(.|\n))*[^\\"]*\" ECHO; > "/*"([^*]|"*"+[^/*])*"*"*"*/" ; > . ECHO; You indeed fixed the /***/ error, but two errors remain. First, no handling of single-quoted double quotes: main() {printf("%c\n", '"');/*gotcha*/printf("%c\n", '"');} Second, your program crashes when uncommenting a real source file, with a sizeable change history or whatever inside a comment. You need at least one state change, so a comment can be matched line-by-line, and so not overflow a Lex buffer. Both previous Lex postings did it right. A third state, to handle quoted strings line-by-line, is perhaps optional, and the previous postings differ here. Apparently you missed the previous Lex postings, which I will be happy to email you on request. My argument, that it's difficult to make a logical error in coding this problem in Lex, has now been demonstrated wrong (sob :-). But at least Lex is still outscoring straight C (faint praise :-?). John Rupley rupley!local@megaron.arizona.edu Path: arizona!noao!ncar!tank!mimsy!chris From: chris@mimsy.UUCP (Chris Torek) Newsgroups: comp.lang.c Subject: Re: C comment stripper Message-ID: <16539@mimsy.UUCP> Date: 26 Mar 89 05:43:07 GMT References: <1842@viper.Lynx.MN.Org> <9543@smoke.BRL.MIL> <1453@wpi.wpi.edu> <9864@megaron.arizona.edu> Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742 Lines: 83 In article <9864@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: >Score, anyone? (recent postings tested on K&R-I-syntax code) > > sed 1/1 correct > Lex 2/2 correct > C 2/2 wrong This sounds like a CHALLENGE! :-) I wrote the following working against the ten-minute spaghetti clock. It is slightly tested, and probably works, with the exception of #include (and unclosed comments, etc., in included files). It is more permissive than real C (allowing newlines in string and character constants, and allowing infintely long character constants) but should not get anything wrong that cpp gets right. Of course, there are no comments in it. :-) #include enum states { none, slash, quote, qquote, comment, cstar }; main() { register int c, q = 0; register enum states state = none; while ((c = getchar()) != EOF) { switch (state) { case none: if (c == '"' || c == '\'') { state = quote; q = c; } else if (c == '/') { state = slash; continue; } break; case slash: if (c == '*') { state = comment; continue; } state = none; (void) putchar('/'); break; case quote: if (c == '\\') state = qquote; else if (c == q) state = none; break; case qquote: state = quote; break; case comment: if (c == '*') state = cstar; continue; case cstar: if (c != '*') state = c == '/' ? none : comment; continue; default: fprintf(stderr, "impossible state %d\n", state); exit(1); } (void) putchar(c); } if (state != none) fprintf(stderr, "warning: file ended with unterminated %s\n", state == quote || state == qquote ? (q=='"' ? "string" : "character constant") : "comment"); exit(0); } -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris Path: arizona!rupley From: rupley@arizona.edu (John Rupley) Newsgroups: comp.lang.c Subject: Re: C comment stripper Summary: It's neat and works Message-ID: <9893@megaron.arizona.edu> Date: 26 Mar 89 20:59:35 GMT References: <1842@viper.Lynx.MN.Org> <9543@smoke.BRL.MIL> <1453@wpi.wpi.edu> <16539@mimsy.UUCP> Organization: U of Arizona CS Dept, Tucson Lines: 38 > In article <16539@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > In article <9864@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: > >Score, anyone? (recent postings tested on K&R-I-syntax code) > > > > sed 1/1 correct > > Lex 2/2 correct > > C 2/2 wrong > > This sounds like a CHALLENGE! :-) Unclear again (sob :-). Meant it as a comment, implying the VIRTUES of Lex (and even sed :-) for pattern matching. Contest-wise, your C code is the first correct as initially posted, it runs faster than the previous postings (after correction of the latter), and one can follow the neat state-machine implementation at first reading. > I wrote the following working against the ten-minute spaghetti clock. Wow! It took me longer to test it. For what its worth (as COMMENTARY -- please, no contest), counting new postings, too: sed, awk 2/3 correct as first posted (test vs K&R-I-type code) Lex 2/3 C 1/3 Hmmm. Conclusion? The probability of any particular piece of code being correct is independent of language and is a toss-up (:-)? But I still like Lex for this particular type of problem. John Rupley rupley!local@megaron.arizona.edu Path: arizona!rupley From: rupley@arizona.edu (John Rupley) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Summary: comments do not overflow yytext[] Message-ID: <9919@megaron.arizona.edu> Date: 28 Mar 89 10:04:16 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <795@twwells.uucp> Organization: U of Arizona CS Dept, Tucson Lines: 29 In article <795@twwells.uucp>, bill@twwells.uucp (T. William Wells) writes: > In article <9797@megaron.arizona.edu> rupley@arizona.edu (John Rupley) writes: > : A Lex source for uncommenting is attached (which I hope does not belie > : the remark above about hard to get the logic wrong :-). > > Try it on a very long comment. You might discover an overflowed lex > buffer. On the other hand, this shouldn't be too hard to fix. Just do > for the comment what you did for the noncommented text. Nope.... no problem.... comments are thrown away line-by-line, by design, so that very long comments indeed do not blow the buffer. A very long string, however, will overflow the buffer, but clearly this is understood, and it can be viewed as a feature, although idiosyncratic, as noted in <9888@megaron.arizona.edu>. If you want to handle strings differently, add another start condition (state) begun by '"' and make explicit start condition 0 = , or change the size of the match buffer (yytext[]) by including in the definitions: %{ #define YYLMAX 5000 /* or whatever */ %} John Rupley rupley!local@megaron.arizona.edu Path: arizona!noao!ncar!ames!mailrus!tut.cis.ohio-state.edu!rutgers!gatech!gt-cmmsr!auc!maw From: maw@auc.UUCP (Michael A. Walker) Newsgroups: comp.lang.c Subject: Re: Want a way to strip comments from a Summary: Another solution to the comment craze. Message-ID: <32248@auc.UUCP> Date: 30 Mar 89 15:08:19 GMT References: <7150@siemens.UUCP> <9900010@bradley> <4896@cbnews.ATT.COM> <9887@megaron.arizona.edu> Distribution: na Organization: Atlanta University, Atlanta, GA Lines: 58 In article <9887@megaron.arizona.edu>, rupley@arizona.edu (John Rupley) writes: > > > In article <620@gonzo.UUCP>, daveb@gonzo.UUCP (Dave Brower) writes: > > So, I offer this week's challenge: Smallest program that will take > > "blank line" style cpp output on stdin and send to stdout a scrunched > > version with appropriate #line directives. [f]lex, Yacc, [na]awk, sed, > > perl, c, c++ are all acceptable. This will be an amusing excercise in > > typical text massaging that can be enlightening for many people. > > "Scrunching" is probably a matter of taste, with regard to the format > of the ouput. I don't know what is ment by the term scrunching, but here is my entry to the problem of removing comments in a C program. YACCR (Yet Another C Comment Remover :-) is a crazy looking lex specification that removes C comments from a source file. It also does not put out a lot of extra blank lines that cpp does. I have tested on most styles of C comments that I have seen and it seems to work, but PLEASE no flames if it doesn't!!!! In an earlier message, someone address the problem of a yytext overflow. YACCR redefines the YYLMAX constant as 500, but you can test it with other values. To use: 1. Save message in file called yaccr.l and edit this file to unwanted text. 2. Type: lex yaccr.l 3. Type: cc lex.yy.c -ll -lyaccr It should then be ready to go. Good luck. ---mike EMAIL: ...!gatech!auc!rambro!maw --------------------------------cut here-------------------------- %{ /* ** Specification: YACCR ** Description : YACCR removes comments from C programs. */ #define CR 0x0d #ifdef YYLMAX #undef YYLMAX #define YYLMAX 500 #endif %} %% "/*""*"*("/*"*|[^*/]|[^*]"/"|"*"[^/])*"*"*"*/" putchar(CR); . printf("%s",yytext); --------------------------------cut here--------------------------