开发者

How are arrays and hash maps constant time in their access?

开发者 https://www.devze.com 2023-02-04 11:43 出处:网络
Specifically: given a hash (or an array index), how does the machine get to the data in constant time?

Specifically: given a hash (or an array index), how does the machine get to the data in constant time?

开发者_高级运维It seems to me that even passing by all the other memory locations (or whatever) would take an amount of time equal to the number of locations passed (so linear time). A coworker has tried valiantly to explain this to me but had to give up when we got down to circuits.

Example:

my_array = new array(:size => 20)
my_array[20] = "foo"
my_array[20] # "foo"

Access of "foo" in position 20 is constant because we know which bucket "foo" is in. How did we magically get to that bucket without passing all the others on the way? To get to house #20 on a block you would still have to pass by the other 19...


How did we magically get to that bucket without passing all the others on the way?

"We" don't "go" to the bucket at all. The way RAM physically works is more like broadcasting the bucket's number on a channel on which all buckets listen, and the one whose number was called will send you its contents.

Calculations happen in the CPU. In theory, the CPU is the same "distance" from all memory locations (in practice it's not, because of caching, which can have a huge impact on performance).

If you want the gritty details, read "What every programmer should know about memory".


Then to understand you have to look at how memory is organized and accessed. You may have to look at the way an address decoder works. The thing is, you do NOT have to pass by all the other addresses to get to the one you want in the memory. You can actually jump to the one you want. Otherwise our computers would be really really slow.


Unlike a turing machine, which would have to access memory sequentially, computers use random access memory, or RAM, which means if they know where the array starts and they know they want to access the 20th element of the array, they know what part of memory to look at.

It is less like driving down a street and more like picking the correct mail slot for your apartment in a shared mailbox.


2 things are important:

  1. my_array has information about where in memory computer must jump to get this array.
  2. index * sizeof type gets offset from beginning of array.

1 + 2 = O(1) where data can be found


Big O doesn't work like that. It's supposed to be a measure of how much computational resources are used by a particular algorithm and function. It's not meant to measure the amount of memory used and if you are talking about traversing that memory, it's still a constant time. If I need to find the second slot of an array it's a matter of adding an offset to a pointer. Now, if I have a tree structure and I want to find a particular node you are now talking about O(log n) because it doesn't find it on the first pass. On average it takes O(log n) to find that node.


Let's discuss this in C/C++ terms; there's some additional stuff to know about C# arrays but it's not really relevant to the point.

Given an array of 16-bit integer values:

short[5] myArray = {1,2,3,4,5};

What's really happened is that the computer has allocated a block of space in memory. This memory block is reserved for that array, is exactly the size needed to hold the full array (in our case 16*5 == 80 bits == 10 bytes), and is contiguous. These facts are givens; if any or none of them are true in your environment at any given time, you're generally at risk for your program crashing due to an access vialoation.

So, given this structure, what the variable myArray really is, behind the scenes, is the memory address of the start of the block of memory. This is also, conveniently, the start of the first element. Each additional element is lined up in memory right after the first, in order. The memory block allocated for myArray might look like this:

00000000000000010000000000000010000000000000001100000000000001000000000000000101
^               ^               ^               ^               ^
myArray([0])    myArray[1]      myArray[2]      myArray[3]      myArray[4]

It is considered a constant-time operation to access a memory address and read a constant number of bytes. As in the above figure, you can get the memory address for each one if you know three things; the start of the memory block, the memory size of each element, and the index of the element you want. So, when you ask for myArray[3] in your code, that request is turned into a memory address by the following equation:

myArray[3] == &myArray+sizeof(short)*3;

Thus, with a constant-time calculation, you have found the memory address of the fourth element (index 3), and with another constant-time operation (or at least considered so; actual access complexity is a hardware detail and fast enough that you shouldn't care) you can read that memory. This is, if you've ever wondered, why indexes of collections in most C-style languages start at zero; the first element of the array starts at the location of the array itself, no offset (sizeof(anything)*0 == 0)

In C#, there are two notable differences. C# arrays have some header information that is of use to the CLR. The header comes first in the memory block, and the size of this header is constant and known, so the addressing equation has just one key difference:

myArray[3] == &myArray+headerSize+sizeof(short)*3;

C# doesn't allow you to directly reference memory in its managed environment, but the runtime itself will use something like this to perform memory access off the heap.

The second thing, which is common to most flavors of C/C++ as well, is that certain types are always dealt with "by reference". Anything you have to use the new keyword to create is a reference type (and there are some objects, like strings, that are also reference types although they look like value types in code). A reference type, when instantiated, is placed in memory, doesn't move, and is usually not copied. Any variable that represents that object is thus, behind the scenes, just the memory address of the object in memory. Arrays are reference types (remember myArray was just a memory address). Arrays of reference types are arrays of these memory addresses, so accessing an object that is an element in an array is a two-step process; first you calculate the memory address of the element in the array, and get that. That is another memory address, which is the location of the actual object (or at least its mutable data; how compound types are structured in memory is a whole other can o' worms). This is still a constant-time operation; just two steps instead of one.

0

精彩评论

暂无评论...
验证码 换一张
取 消