Can anyone give me some idea of how to extract information from a given C++ or Java program(source code)? The information may be names of classes or names of methods or telling some inheritance relation or class hierarchy,etc.You have to write a c++ or Java program for the same.I have tried and abled to do that but it is not totally correct.Right now what I'm doing is reading the given progra开发者_Go百科m line by line and checking for "class" keyword and if I find any such word,it means the word following right after that is name of that class(to extract name of classes).I'm just thinking is there any built in libraries in C or Java which can do this work more efficiently ?And please suggest some simple ideas(not some external libraries or plugins).
If all you want is the names of classes and methods within classes, you can rig a set of regular expressions to pick off various tokens (identifiers, "{", "}", operator, number, string), and a crummy parser (called an "island parser") to recognize the sequence of tokens that make up class declarations and method declarations. (Hint: for Java and C++, make sure you somehow match corresponding { ... }").
This stunt works for classes and methods because in essence this how real compilers work: they break the input stream into tokens (usually using the compiler-generalization of regexps called "lexer generators"), and then use a parser to determine the actual code structure, and classes and methods are pretty easy to spot in the syntax. (This solution is a kind of clean version of what OP posted).
If you want to any other information form Java or C++ source code (e.g., types of method arguments, etc.) you probably need a tool that actually parses the languages, and builds symbol tables so you have a chance of knowing what the identifiers found in various locations mean.
(EDIT: OP indicated he wants to find out what function calls what other function. He can't do this sensibly without a full language front end (parser+ symbol table as a minimum).
You can get various tools to parse C++ (GCC, Clang, Elsa, ...) and various other tools to parse Java (ANTLR, javacc, ...). You will find that GCC is pretty hard to bend to general tasks, Clang and Elsa less problematic. ANTLR and Javacc will parse Java code but don't AFAIK build symbol tables, so they fall a little flat for general purpose tasks. What you will find is that dealing with a C++ tool will turn out to be completely different than dealing with a Java tool since none of these tools have any common compiler infrastructure.
How you extract class and method names from each of these will vary in detail, but most of them offer some kind of way to climb over a parse tree (and you code some ad hoc match for what you want to find, e.g., class declaration syntax) and/or navigate symbol tables (and spit out symbols marked as "class" or "method" names). How you find the right syntax requires you to know in intimate detail the structure of the tree and code lots of test to match for the proper tree structures.
If you really want to process both languages, and use a single infrastructure to do it, you could consider our DMS Software Reengineering Toolkit. DMS is language agnostic but can be tuned to arbitrary langauges, and then parse those languages, build abstract symbol tables and various kinds of flow analysis. DMS has both full C++ Front end (with a built-in preprocessor and handling C++ in its various forms including the new standard C++11) and a Java Front end handling all dialects of Java up through 1.6 (with 1.7 happening momentarily).
To do OPs (originally stated) task of finding classes and methods, you'd tell DMS to parse the file and then climb over trees or symbol tables, much as for the other tools. You can code an ad hoc tree matcher in DMS, but it easier to write patterns:
pattern match_class_declaration(i: identifier, b: statements): class_declaration
= " class \i { \b } ";
can be used with DMS to match those trees that happen to be class declarations, and will return "i" (and "b" which we don't care about) bound to the correspond subtrees. "i" of course contains the class name you want. Other patterns can be used to recognize other constructs, such as class names that inherit, or implement interfaces, or methods that return some type or methods that return void. The point is you don't have to know the tree structure in any great detail to use such patterns.
To go further, as OP seems to want to do (e.g build caller/callee information), you'd need to construct control flow graphs, do points-to analysis, etc. DMS provides support for that.
The good news is one infrastructure handles both languages; you can even mix C++ and Java in DMS without it getting anything confused. The more difficult news is that DMS is a fairly complex beast, but that's because it has to handle all the complexities of C++ and Java (as well as many other langauges). Still beats working with two different language parsers with two radically different implementations and thus two complete sets of learning curves.
the question sounds too vague to answer. please elaborate.
from what i could gauge, use Reflection when you are working with Java classes to figure out almost everything about a class and its methods. There are other (static) APIs that you could use on the Class object (if you have that hand). Refer the javadocs for more.
You could try to use some source from compilers, like gcc. They already have all the syntax parsing and preprocessing there, so you could save tons of time.
For compiled Java you could also use bytecode manipulation libraries (like asm).
As you're trying to parse a text file, a shell script based on awk and/or sed would be sufficient. You'll have to define some simple regular expressions based on the languages keywords and syntax to extract the informations you need.
For instance, this regular expression would match most of the class declarations of a C++ source file:
class *([A-Za-z_][A-Za-z_0-9]*) *\{?$
The parenthesis allow you to extract the identifier you're looking for, this is called a capturing group.
If you really want to do it in C/C++/Java, you'll have to find a library that provides regular expressions facilities (Java standard library already provides some). Maybe Boost Regex for a C++ program.
Here's an example building up how to parse a C++ file using the clang (llvm) libraries. Its long and pretty detailed but you should be able to adapt it to do what you need (for C and C++ anyway .. dont know if llvm is any good at handling Java ... and don't know if its easy to adapt that approach for Java).
Not sure about current Java, but C++ is a true nightmare to parse if you want to fully extract semantic information (consider that it took YEARS for the industry to agree 100% on how and if certain construct should have been parsed).
Note that while class name in C++ is easy enough (just remember however that the word class
or struct
can also be present before a template parameter instead of typename
, that you can have "nested classes" and that you can have class "forward declarations") for members things are much harder because member name comes after the type and even understanding what is a type, where the type ends or what is the member name is not trivial... consider
int (*foo)(int x, int y);
Node<Bar, Baz, Allocator<Foo, &Q::operator > >, 12> (*rex)(int);
in the first case the member name is foo
, and in the second case member name is rex
(note that I'm not sure if the second example is valid C++ code or, supposing it's valid, if common C++ compilers would accept it).
Note that even just understanding where the class member list begins after the class name is not trivial (you have to skip the inheritance list that can include templated classes with parameters that are generic types).
So, giving up with a regular expression (that clearly is not able to parse a type being it a complex recursive entity), the only solution is to use code written by someone else.
For this job (for C++) you can try for example GCC-XML that has been written exactly for this reason (it generates an XML result from parsing C++ source code).
精彩评论