Databases &
Distributed Systems
Take yourself back before the creation of computers and try to imagine how difficult it would be to store important information. All data was held in paper format, often stored in large buildings dedicated to just holding these documents. This was hugely inconvenient; it would take a lot of time and money to find the right file in the building, let alone finding one in another building in a location far away. Luckily, computers have the ability to hold huge amounts of data, with each record being available to view in a matter of clicks.
Computers keep data accessible and searchable, with the equivalent of thousands of pages being able to be stored in a single modern day USB. Databases are data structures that hold vast amounts of data in such a way that makes it easy to modify and search through. The majority of apps and websites you use are just interfaces to databases.
This topic covers:
-
Database consistency, redundancy & independence
-
Benefits & drawbacks of contemporary database systems
-
The use of primary keys, foreign keys & indexes
-
Techniques for data validation & verification
-
Database queries & SQL
-
Database management system & dictionaries
-
Key database terms
-
How data is manipulated in a useful way
-
Distributed databases
-
Database structures
Types of Database
Flat File
A flat file database is in which all of the information is held within a single table, with each column being called an attribute and each row being a record.
The table displayed (taken from Slideplayer.com) represents how a flat file database looks like.
They are mainly used for smaller amounts of data and are not usually suitable for larger projects. This is due to a problem that arises called data redundancy. This is effectively where data is replicated multiple times, which causes problems as it not only wastes crucial storage space, but also creates inconsistent data and failed searchers. This unwanted replication can easily be seen, even within the small database displayed.
You can see that Jim smith's name, gender and DOB have been recorded twice, as well as his teacher's name and ID being recorded for every student. The only thing that differs for Jim Smith's records are the course name and ID. This is exactly what data redundancy is and you can imagine how much larger this problem becomes as a database grows, even with a class of 30 children with the same teacher.
There are many problems with this:
-
If a teacher was changed, every single record stored with his or her name and ID would have to be changed
-
Having a large database with lots of repeated data means some important information can easily be accidentally missed
-
This would cause legal issues due to the data protection act, especially if some data kept is inaccurate or out of date
-
If a new attribute was wanted to be added, such as a students Home address, then this information would also need to be entered and repeated multiple times
It is because of these issues, that this kind of database is has an extremely limited use and why the majority of databases are, instead, a type called Relational databases.
Relational Databases
Relational databases aim to combat the problem of data redundancy that occurs within flat file databases. They manage to avoid repetition via linking many smaller tables together. Although this is greatly beneficial in multiple ways, it does result in a database that requires more maintenance and one that has a higher complexity when being created.
The image displayed represents how a relational database works. (also taken from slideplayer.com) Essentially, rather than repeating the course ID, course name and teacher ID for every record, the course ID instead links to a much smaller database which will provide the rest of the information. This way, each time a course number is used, the database will fetch the data from the suitable smaller database.
In this example, the course ID is a primary key, which is essential in creating relational databases. This is only true however, in the students database. It becomes a foreign key in the course database, as it links back to the main database.
Advantages & Key Terms
-
Increased data consistency - making sure all changes in a database meet the required rules defined within that same database
-
Reduced data redundancy - the problem of data replication
-
Higher data integrity - the accuracy and consistency of the data within the database
-
The ability to combine data from different sources in order to provide solutions to problems
Disadvantages
-
This type of database is far more complex, meaning it requires more time to create and maintain
-
The database can become so heavily relied on, that if it were to fail for any reason, all work based around it would fail also
-
Due to the complexity, database administrators are required to avoid these failures, and this can often lead to a large expense
Database Design
Normalisation
Normalisation is database design techniques that are used in order to best optimise the structure of a database. It does this by covering a flat file database into a relational database. As previously discusses, this helps reduce data duplication. There are 3 different forms, with each one having a greater or lesser effect on this reduction.
First normal form
-
The tables have a primary key, with all the attributes being dependent upon this.
-
Each field contains only one piece of data
-
Tables have no repeating groups of attributes
Second normal form
-
This is when a table is in first normal form and has no partial dependencies, meaning all the attributes in the table depend on the whole of the primary key
-
There are no non-key dependencies
Third normal form
-
When a table is already in second normal form
-
Has no transitive dependencies. This can be explained by: If A is dependent on B and B is dependent on C, then A is transitively dependent on C. This means we can identify C from A, but not A from C.
Indexing
An index is a type of data structure that is used to help reduce the time taken to search through a database. It does this via displaying all information based off of one element. For example, when searching for a students surname you can find their first name, subject, teacher and so on.
Database Views
Database views allow for certain records and tables to be kept restricted, so that only certain users can access the. This is mainly used for privacy reasons, and a common example is something like a workers salary being hidden from non-management members.
Each table has the possibility to be restricted to some users in certain ways. What a user is allowed to do with the table is called a database privilege. Some rights that may be given are:
-
Deletion of records
-
Amendments of records
-
Searching/viewing of records
-
Dropping the table
-
Updating records
Database restrictions are also use in order to help hide complexity. An example is abstraction, showing only the key information in order to simplify the format of data presented to the user. Furthermore, using database views encapsulates complicated SQL queries, simplifies access and provides a higher performance.
Data Validation & Verification
Data validation can be defined within a database, which helps to remove the need for programmers to have to code all of the necessary validation rules into each area of their intended application. This also helps boost the data integrity, as these rules are applied consistently throughout the entire database. Some validation & verification examples can include:
-
Defining a data type within a column in a table so that other data types are rejected
-
Setting up columns to reject null values, so that nothing is left empty when required
-
Length checks for each record so that an overload of information is truncated (not rejected though)
-
Referential integrity is a validation method specific to databases. Essentially, if a foreign key doesn't point to a primary key, then that record is rejected.
Database management Systems
Databases are simply stores of information, whose job is to accept requests in the form of SQL commands and return tables via result sets. As a programmer, you can use a library software in order to connect the database and manipulate the result sets, but to connect, you must have:
-
The IP address
-
The schema name
-
Username
-
Password
To maintain the database a DBMS (Database Management System) is required, which is either GUI or command line based. Command line allows faster access to SQL queries, whilst GUI allow simpler access to management tools. Overall, it is a suite of software that controls all access to the database by:
-
Providing security - restricting access to the data via password protection
-
Allowing for data privacy - access rights that can be granted or revoked
-
Gives data integrity - ensures validation rules are followed
-
Ensures locking takes place - whilst records are updated, they are locked so no data corruption occurs
-
Provides data recovery - allows for recovery of mistakenly deleted records
Big Data
Big data is a data based system whose set is too large to be processed by a single computer. If you look back through the ages, it took us 24,000 years to create 5 exabytes of data. In the modern day, we create that every 2 days. Big data is the title given to the level of growth of this data and they are typified by the 3 V's:
-
Volume of data
-
Velocity of which the data volumes are increasing
-
Variability of the type of data coming into the system
Data can take various forms, like video, audio, animation, email and others. Big data can be analysed for various reasons, but has caused controversy due to privacy issues and vulnerability to hackers.
Data Warehousing
Due to the fact that computers can store huge amounts of data, businesses and governments, especially in recent years, have began to collect and store data. However, a lot of this data stored can often be meaningless, unless a process called Data mining takes place. This is essentially where patterns are found within large data sets and taken advantage of. After data mining has taken place and the patterns are found, Modelling can begin, where upon predictions are made about the impacts of changing different variables.
An example could involve something like the way a supermarket competes with another, by using past data to see the effects of lowering their price by 10 p, as opposed to matching, their competitors price. One may result in more money being made, in which case that price would be chosen. Another use of data mining is fraud detection. Banks can mine your purchase history, and if you have many constant purchases around one location, then suddenly a £3000 brand new TV is brought in another country, the payment may be declined as this acts as suspicious activity. This can sometimes cause problems when on holiday, but is often for the better good.
Data mining requires large amounts of storage, complex algorithms and many high performance processors. The final thing to remember, is that the data used must be brought together from multiple systems and stored in a Data warehouse, where it is changed regularly, to keep its purpose.
Distributed Systems
Network services , like printing, take the form of servers, with each service provided on a network being carried out by one or more or these servers. Distributed servers help separate these services across multiple servers/computers.
Databases need a server to run on and in most cases, will be stored on a separate server to the other main applications, with the application server communicating with the databases server when it requires data from it for processing.
The separation from processing (application) and data (database) is used for a multitude of reasons.
-
Application servers requires much less storage space than database servers
-
If a database server goes down, it can be replaced without affecting the application server. This means the backup server could be switched to fairly easily, without affecting the users too much. This increases data recovery and server redundancy.
-
Data security is also increased, as even if a hacker was to gain access to the application server, they still would have no access to the database server.
Distributed Databases
A large database may be split into multiple servers, with each one being used by a certain part of an organisation. For example, a normal business that trades may have a distributed database containing one server for stock, one for human resources and another for the sales. You can begin to see how easily these databases/servers may need to access one another.
If you consider the company making a sale, then not only does the sales database need to be updated, but so does the stock. As this means the transaction is updating more than one database at a time, it becomes more complicated. This is because that if one server fails to update, then data corruption would occur. To help combat this, transactions occurring across distributed databases take an all or nothing approach, where upon either all servers update, or none do. To do this, they use a two-phase commit algorithm:
-
First Phase - each server/database is asked if they are ready to update the data and then are updated up to the point where they are finalising it. Here, the databases will respond as to whether everything is okay or not. If any respond negatively, then the update is rolled back and the changes cancelled. Otherwise, the second phase begins.
-
Second Phase - this is known as the commit stage, where the updates are finalised. After this, another message is sent, verifying that everything went as it should have. If any reply negatively, then again the whole update is rolled back and cancelled.
There are some advantages to distributed databases:
-
Security - access to one database doesn't mean access to another
-
Speed - the load on a single database is heavily reduced
-
Growth - new databases can easily be added to the network of databases
-
Reliability - if one database goes down, other areas can continue to work