In the interpreter for my experimental programming language I have a symbol table. Each symbol consists of a name and a value (the value can be e.g.: of type string, int, function, etc.).
At first I represented the table with a vector and iterated through the symbols checking if the given symbol name fitted.
Then I though using a map, in my case map<string,symbol>
, would be better than iterating through the vector all the time but:
It's a bit hard to explain this part but I'll try.
If a variable is retrieved the first time in a program in my language, of course its position in the symbol table has to be found (using vector now). If I would iterate through the vector every time the line gets executed (think of a loop), it would be terribly slow (as it currently is, nearly as slow as microsoft's batch).
So I could use a map to retrieve the variable: SymbolTable[ myVar.Name ]
But think of the following: If the variable, still using vector, is found the first time, I can store its exact integer position in the vector with it. That means: The next time it is needed, my interpre开发者_高级运维ter knows that it has been "cached" and doesn't search the symbol table for it but does something like SymbolTable.at( myVar.CachedPosition )
.
Now my (rather hard?) question:
Should I use a vector for the symbol table together with caching the position of the variable in the vector?
Should I rather use a map? Why? How fast is the [] operator?
Should I use something completely different?
A map is a good thing to use for a symbol table. but operator[]
for maps is not. In general, unless you are writing some trivial code, you should use the map's member functions insert()
and find()
instead of operator[]
. The semantics of operator[]
are somewhat complicated, and almost certainly don't do what you want if the symbol you are looking for is not in the map.
As for the choice between map
and unordered_map
, the difference in performance is highly unlikely to be significant when implementing a simple interpretive language. If you use map, you are guaranteed it will be supported by all current Standard C++ implementations.
You effectively have a number of alternatives.
Libraries exist:
- Loki::AssocVector: the interface of a map implemented over a
vector
of pairs, faster than a map for small or frozen sets because of cache locality. - Boost.MultiIndex: provides both List with fast lookup and an example of implementing a MRU List (Most Recently Used) which caches the last accessed elements.
Critics
- Map look up and retrieval take
O(log N)
, but the items may be scattered throughout the memory, thus not playing well with caching strategies. - Vector are more cache friendly, however unless you sort it you'll have
O(N)
performance onfind
, is it acceptable ? - Why not using a
unordered_map
? They provideO(1)
lookup and retrieval (though the constant may be high) and are certainly suited to this task. If you have a look at Wikipedia's article on Hash Tables you'll realize that there are many strategies available and you can certainly pick one that will suit your particular usage pattern.
Normally you'd use a symbol table to look up the variable given its name as it appears in the source. In this case, you only have the name to work with, so there's nowhere to store the cached position of the variable in the symbol table. So I'd say a map
is a good choice. The []
operator takes time proportional to the log of the number of elements in the map - if it turns out to be slow, you could use a hash map like std::tr1::unordered_map
.
std::map's operator[] takes O(log(n)) time. This means that it is quite efficient, but you still should avoid doing the lookups over and over again. Instead of storing an index, perhaps you can store a reference to the value, or an iterator to the container? This avoids having to do lookup entirely.
When most interpreters interpret code, they compile it into an intermediate language first. These intermediate languages often refer to variables by index or by pointer, instead of by name.
For example, Python (the C implementation) changes local variables into references by index, but global variables and class variables get referenced by name using a hash table.
I suggest looking at an introductory text on compilers.
a std::map
(O(log(n))) or a hashtable ("amortized" O(1)) would be the first choice - use custom mechanisms if you determin it's a bottleneck. Generally, using a hash or tokenizing the input is the first optimization.
Before you have profiled it, it's most important that you isolate lookup, so you can easily replace and profile it.
std::map
is likely a tad slower for a small number of elements (but then, it doesn't really matter).
Map is O(log N), so not as fast as positional lookup in an array. But the exact results will depend on a lot of factors, and so the best approach is to interface with the container in a way that allows you to swap between implementation later on. That is, write a "lookup" function that can be efficiently implemented by any suitable container, to allow yourself to switch and compare speeds of different implementation.
Map's operator [] is O(log(n)), see wikipedia : http://en.wikipedia.org/wiki/Map_(C%2B%2B)
I think as you're looking often for symbols, using a map is certainly right. Maybe a hash map (std::unordered_map) could make your performance better.
If you're going to use a vector
and go to the trouble of caching the most recent symbol look up result, you could do the same (cache the most recent look-up result) if your symbol table were implemented as a map
(but there probably wouldn't be a whole lot of benefit to the cache in the case of using a map
). With a map
you'd have the additional advantage that any non-cached symbol look ups would be much more performant than searching in a vector
(assuming that the vector
isn't sorted - and keeping a vector sorted can be expensive if you have to do the sort more than once).
Take Neil's advice; map
is generally a good data structure for a symbol table, but you need to make sure you're using it correctly (and not adding symbols accidentally).
You say: "If the variable, still using vector, is found the first time, I can store its exact integer position in the vector with it.".
You can do the same with the map: search the variable using find
and store the iterator
pointing to it instead of the position.
For looking up values, by a string key, map data type is the appropriate one, as mentioned by other users.
STL map implementations usually are implemented with self-balancing trees, like the red black tree data structure, and their operations take O(logn) time.
My advice is to wrap the table manipulation code in functions,
like table_has(name)
, table_put(name)
and table_get(name)
.
That way you can change the inner symbol table representation easily if you experience
slow run time performance, plus you can embed in those routines cache functionality later.
A map will scale much better, which will be an important feature. However, don't forget that when using a map, you can (unlike a vector) take pointers and references. In this case, you could easily "cache" variables with a map just as validly as a vector. A map is almost certainly the right choice here.
精彩评论