string manipulation - How to write a function to remove comments from a .m source file preserving formatting such as line wrapping reasonably?
How to write a function to remove comments from a .m source file preserving formatting such as line wrapping reasonably?
This means if I have
foo;
(* comment 1 *)
bar[
baz (* comment 2 *)
(* comment 3 *)
];
I'd ideally end up getting
foo;
bar[
baz
];
Answer
Here is an alternative "first principle" approach, which does not use string patterns as a main tool, but instead makes use of the fact that comments have a simple structure and can only be escaped when they appear inside strings. Therefore, we can write a very simple parser which only parses strings and comments. Here is the tokenizer:
ClearAll[expr, parse, string, comments,tokenize, $commentPattern, $stringPattern];
tokenize[s_String] := StringSplit[s, t : "\\\"" | "\"" | "(*" | "*)" :> t];
Here are auxiliary patterns we will need:
$stringPattern =
PatternSequence["\"", middle___, "\""] /; ! MemberQ[{middle}, "\""];
$commentPattern =
PatternSequence["(*", middle___, "*)"] /;
Count[{middle}, "(*"] == Count[{middle}, "*)"];
Here is the parser:
parse[left___, s : $stringPattern, right___] :=
expr[parse[left], string[s], parse[right]];
parse[left___, c : $commentPattern, right___] :=
expr[parse[left], comments[c], parse[right]];
parse[{tokens___}] := parse[tokens];
parse[tokens___] := expr[tokens];
The heads expr
, string
and comments
are inert heads.
Finally, here is the function to remove comments from a string of code:
ClearAll[removeComments];
removeComments[s_String] :=
StringJoin[
DeleteCases[parse[tokenize@s], _comments, Infinity] /.
expr | string -> Sequence
]
Applying this to the initial string of code str
as removeComments[str]
returns the expected answer.
This parser won't be particularly fast. The reason I like this approach is that it does not rely on some external things such as specific forms of Import
, so it will only be wrong if the principles are wrong (e.g. I missed some other forms where comments can be escaped, etc), but I consider string manipulations rather fragile for parsing purposes generally. Interestingly, this seems to be one of the simplest problems I know of which illustrates that regexps are not sufficient to parse code representing recursive (nested) expressions / statements.
Comments
Post a Comment