Data structures
A data structure is a collection of data items held in the computer’s memory. Each data structure is given a unique identifier(name) and separate items are accessed using an index. There are many, many that exist, but we only need to know about:
-
One, two and three dimensional arrays
-
Records
-
Stacks
-
Queues
-
Binary trees
-
Linked lists and Hash tables
Types of Data structure
Static – Have a fixed size set at the start of the program, which can’t be changed. This can be very limiting as you don’t always know the size needed at the start. However, they are easier to program, and you will always know the space they take up.
Dynamic – Their size can expand or truncate when needed, helping save unneeded space, as well as avoiding potential loss of data and this is limited by the computers memory space. The Operating System is responsible for allocating the memory space for each data structure.
Static:
Arrays – Can be 1,2 or 3 dimensional.
-
One dimensional – This is the simplest of data structures. To access a variable within the array, you refer to the index location. For example, Print(Names [5]) would return the 6th name In the array. This is because index counting starts at 0. To change a variable, you simply state the index location to be changed, and what to change it to. For example, Names [5] = ‘Toby’ would set the 6th name in the array to be Toby, and therefore return ‘Toby’ when called upon. As the size is declared beforehand, requesting a value above the size would cause an error message.
-
Two dimensional - You simply place another array within the already made array by separating with a comma. In example we have: names = [['Jay, 'Ben', 'Bert'], ['Sam' , 'Fred' , 'Smith']]. You can imagine this as creating a table, with array location [0] along the x being the first array and [1] being the following array of names, Sam, Fred and smith. To refer to the piece of data you want: Print( names[0][2] ) would print the name ' Bert' (Remember index counting starts at 0!) and writing (Names[1][0] ) = 'Albert' would replace data item 'Sam' with the name 'Albert'
-
Three dimensional -
This is the declaration of a 3D array:
int arr[3][3][3]=
{
{
{11, 12, 13},
{14, 15, 16},
{17, 18, 19}
},
{
{21, 22, 23},
{24, 25, 26},
{27, 28, 29}
},
{
{31, 32, 33},
{34, 35, 36},
{37, 38, 39}
},
There are 3 arrays, each with 3 arrays within them, and 3 data items within that array. This allows for 27 (3x3x3) data items. Print (arr[0] [2] [1] ), would give the output ‘18’, as this is the second data item, in the third array of the first array. Confusing. The disadvantage of 3d arrays is that they are more complex to code.
Records
Are a set of data items all related to one subject. It differs from an array, in that a record can contain more than one data type. For example, you could have a record called ‘Memberdetails’ that contains a Member ID as a field name, followed with the data type Integer as well as having a field called First Name, that contains data type String. To access each part, simply put Memberdetails. Fieldname, with Fieldname changing according to what field should be accessed.
Dynamic:
Stacks
-
Are Last in, First out structures (LIFO), meaning the most recently added piece of data is the first piece taken off. Think of it as a ‘stack’ of objects. The object on top was the last item added, but will be the first taken off.
-
Adding data is called pushing and removing data is called popping, which are prebuilt functions in most languages.
-
As items are added, the stack grows in size and shrinks when items are removed.
-
The stack pointer refers to the current position of the top of the stack. This is incremented when items are added (therefore an empty stack is set at -1) and decremented when items are removed.
-
They can be implemented using an array and a pointer. This requires variables to store the max and min size of the stack, so that there are no overflow or underflow errors.
-
They are used to check syntax, store content of special purpose registers and storing values of variables when recursive programs call themselves
Queues
-
Are first in, first out structures (FIFO), meaning the most recently added piece of data is the last piece taken off, as new items are added to the back of the queue. This can be thought of as a literal queue, with the first person in the queue being the first person to leave.
-
Adding data is called enqueuing and removing data is called dequeuing
-
In a linear queue, there is a problem in that the data items move along the data structure, leaving fee space behind. This can generate overflow errors, even though there actually is space before the data items.
-
To fix this, a circular queue uses a check to see when a data item reaches the end of the structure. It will then loop back round to the start of the data structure.
-
They are used to store print jobs on a network and to store jobs in a multi-processor system as it needs a first come first served order.
Toby[5]
Linked Lists
-
It is a linear collection of data items, where each item points to the next item in a specific order, such as alphabetically, instead of in their physical position. Together they represent a sequence.
-
This allows for efficient addition or removal of items in a sequence, as we don’t have to go through every data item. It also avoids having to constantly sort the list.
-
Each item in a list is called a node (Data and a link) and the last item in the list has a null pointer (-1) to indicate the lists end.
-
It has 3 pointers: One called start that indicates the first item in the list. Another called previous that points to the previous item checked during the running of the algorithm and another called current that points to the currently checked item in the list.
-
They can store available memory locations and are Used for things like arrival times, exit times etc.
-
A new item is added to the next free space, with the algorithm starting at the head of the list, then checking each node until the position is found. Then, the pointer ‘previous’ of the previous node is copied to the link of the new item. The pointer of the previous node is then updated to point to the current item.
-
To remove an item, follow the pointers to find the item to remove. Copy the link from the item to be removed to the previous items link and add the current to the start of the free list.
Tree
Can remove half of the remaining tree at each check during a search. Names can easily be added onto the end as well, putting it into order. The links determine the order, rather than the physical location, so they don’t have to be put into order each time.
There is a route node, which is at the top of the hierarchy. From here, each element will have a left and right pointer. A new item is added by comparing against each element along the tree. If an item is larger, then we follow the right pointer, therefore removing the half of the list that was on the left. If the compared item is less, then we follow the left pointer. We do this until we reach a null pointer (where there is no pointer on the left/right route, it is instead a -1), where the new item will be placed. It’s called a tree, as each element has a branch coming off of itself.
It is a hierarchy structure, with each element being a child to another element, and a parent to another. An element can have multiple children (a left and right pointer). If there are no children (2 null pointers), then the element is called a leaf node, indicating an end of the tree.
When implementing via an array, the current pointer also exists in order to identify the position in the array of the current element.
Here is the algorithm to create a tree via arrays:
If tree = empty then
Add new_item as root node
Else
Repeat
If new_item < current_item Then
Move left
Else
Move right
End if
Until null pointer is found
Create new node
Insert new_item
End if
Increment size_of_tree
Hash Tables
A dynamic data structure in which a key value is passed into a hashing function which converts the key into an integer. The integer is used as n index, referring to the position in the hash table in which the value will be stored. They have a problem in which different keys can generate the same index and therefore point to the same location for more than one piece of data. They use something called buckets (linked lists) to store more than one piece of data in the same location and overcome the problem.
The hash function is used to distribute the entries across an array of buckets. The algorithm creates an index based on a key index:
f(key , array_size)
To separate a hash function from the array size the hashing function takes an attribute of a data element (e.g. the length of a string entered) then performs a modular division on it by the size of the table. The remainder generate the string’s index number:
Hash = hash_function(key)
Index = hash MOD array_size
The load factor is the ratio of the number of entries in the table, n , divided by the total number of buckets , k.
Load factor = n/k
As the load factor increases, the access rate slows
Collisions occur when the hash function generates the same index for more than one key. As stated beforehand, they are then stored in the same bucket to fix this problem. This slows access times as linked lists are only accessed linearly and a bucket is a linked list.
They are fairly fast for storing and retrieving data items for larger amounts of data when compared to other structures. The performance is heavily dependant on the load factor and collision rate.
They are used for memory mapping, like storing the locations of files in memory. When searching for the location of a file, a search is performed on a hash table. They are also used for database indexes and implementation of caches.
Thanks for reading, if you enjoyed or learnt something new, please consider visiting our sponsor @lifes_sweet2303
'I am a fully registered and insured cake business with level 2 food hygiene - making cakes to order with all your requirements!'