File Organization and its types
The File is
a collection of records. Using the primary key, we can access the records. The
type and frequency of access can be determined by the type of file organization
which was used for a given set of records.
Objectives of file organization are:
- It
contains an optimal selection of records, i.e., records can be selected as
fast as possible.
- To perform
insert, delete or update transaction on the records should be quick and
easy.
- The
duplicate records cannot be induced as a result of insert, update or
delete.
- For the
minimal cost of storage, records should be stored efficiently.
Types of File Organizations –
·
Sequential File
Organization
·
Heap File
Organization
·
Hash File
Organization
·
B+ Tree File
Organization
·
Clustered File
Organization and Multi key file Organization
Sequential File Organization
In
this method, files are stored sequentially. This method is the easiest method
for file organization. This method can be implemented in two ways:
1.
Pile File Method
2.
Sorted File Method.
1. Pile File Method:
- In this
method, records are stored in a sequence one after another. The records
will be inserted in the order in which they are inserted into tables.
- If updating
or deleting of any record has to be done, the record will be searched in
the memory blocks. When it is found, then it will be marked for deleting,
and the new record is inserted.
To insert a
new record R2 in the sequence, then it will be placed at the end of the file.
2. Sorted File Method:
- In this
method, the new record is always inserted at the file's end, and then it
will sort the sequence in ascending or descending order. Sorting of
records is based on any primary key or any other key.
- In the
case of modification of any record, it will update the record and then
sort the file, and lastly, the updated record is placed in the right
place.
Suppose a new record R2 has to be inserted in the
sequence, then it will be inserted at the end of the file, and then it will
sort the sequence.
Advantages of sequential file organization
- It
contains a fast and efficient method for the huge amount of data.
- In this
method, files can be easily stored in cheaper storage mechanism like
magnetic tapes.
- It is
simple in design. It requires no much effort to store the data.
- This
method is used when most of the records have to be accessed like grade
calculation of a student, generating the salary slip, etc.
- This
method is used for report generation or statistical calculations.
Disadvantages of sequential file organization
- It will
waste time as we cannot jump on a particular record that is required but
we have to move sequentially which takes our time.
- Sorted
file method takes more time and space for sorting the records.
Heap file organization
- In heap
file organization, the records are inserted at the file's end. When the
records are inserted, it doesn't require the sorting and ordering of
records. It works with data blocks. It is the simplest and most basic type
of organization.
- The heap
file is also known as an unordered file. When the data block is full, the
new record is stored in some other block. This new data block need not to
be the very next data block, but it can select any data block in the
memory to store new records.
- In the
file, every record has a unique id, and every page in a file is of the
same size. It is the DBMS responsibility to store and manage the new
records.
- If we want
to search, update or delete the data in heap file organization, then we
need to traverse the data from staring of the file till we get the
requested record.
- If the
database is very large then searching, updating or deleting of record will
be time-consuming because there is no sorting or ordering of records. In
the heap file organization, we need to check all the data until we get the
requested record.
Advantages of Heap file
organization
- It is a
very good method of file organization for bulk insertion. If there is a
large number of data which needs to load into the database at a time, then
this method is best suited.
- In case of
a small database, fetching and retrieving of records is faster than the
sequential record.
Disadvantages of Heap file
organization
- This
method is inefficient for the large database because it takes time to
search or modify the record.
- This
method is inefficient for large databases.
Hash
File Organization
Hash File Organization uses the computation of hash
function on some fields of the records. The hash function's output determines
the location of disk block where the records are to be placed.
If a record has to be received using the hash key, then
the address is generated, and the whole record is retrieved using that address.
If a new record has to be inserted, then the address is generated using the
hash key and record is directly inserted. The same process is applied in the
case of delete and update.
In this method, there is no effort for searching and
sorting the entire file. In this method, each record will be stored randomly in
the memory.
B+
Tree File Organization
A B+ tree is a balanced binary search tree that
follows a multi-level index format. The leaf nodes of a B+ tree denote
actual data pointers. B+ tree ensures that all leaf nodes remain at the
same height, thus balanced. Additionally, the leaf nodes are linked using a
link list; therefore, a B+ tree can support random access as well as
sequential access.
- B+ tree
file organization is the advanced method of an indexed sequential access
method. It uses a tree-like structure to store records in File.
- It uses
the same concept of key-index where the primary key is used to sort the
records. For each primary key, the value of the index is generated and
mapped with the record.
- The B+
tree is similar to a binary search tree (BST), but it can have more than
two children. In this method, all the records are stored only at the leaf
node. Intermediate nodes act as a pointer to the leaf nodes. They do not
contain any records.
The above B+ tree shows
that:
- There is
one root node of the tree, i.e., 25.
- There is
an intermediary layer with nodes. They do not store the actual record.
They have only pointers to the leaf node.
- The nodes
to the left of the root node contain the prior value of the root and nodes
to the right contain next value of the root, i.e., 15 and 30 respectively.
- There is
only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and
29.
- Searching
for any record is easier as all the leaf nodes are balanced.
- In this
method, searching any record can be traversed through the single path and
accessed easily.
Disadvantages of B+ tree
file organization
- In this
method, searching becomes very easy as all the records are stored only in
the leaf nodes and sorted the sequential linked list.
- Traversing
through the tree structure is easier and faster.
- The size
of the B+ tree has no restrictions, so the number of records can increase
or decrease and the B+ tree structure can also grow or shrink.
- It is a
balanced tree structure, and any insert/update/delete does not affect the
performance of tree.
This method is inefficient for the static method.
Indexed sequential access method (ISAM)
ISAM method is an advanced sequential file organization.
In this method, records are stored in the file using the primary key. An index
value is generated for each primary key and mapped with the record. This index
contains the address of the record in the file. If any record has to be
retrieved based on its index value, then the address of the data block is
fetched and the record is retrieved from the memory.
Pros of ISAM:
- In this
method, each record has the address of its data block, searching a record
in a huge database is quick and easy.
- This
method supports range retrieval and partial retrieval of records. Since
the index is based on the primary key values, we can retrieve the data for
the given range of value. In the same way, the partial value can also be
easily searched, i.e., the student name starting with 'JA' can be easily
searched.
Cons of ISAM
- This
method requires extra space in the disk to store the index value.
- When the
new records are inserted, then these files have to be reconstructed to
maintain the sequence.
- When the
record is deleted, then the space used by it needs to be released.
Otherwise, the performance of the database will slow down.
Types of Indexes
An index is a table which stores the key values and
the corresponding addresses of the records in a file. Given a key value, its
address is located in the index and the corresponding record can be accessed
using this address.
When an index is used, the index rather than the
file is used for locating a record. An index is usually defined on a single
field of a file, called an Indexing Field. The index typically stores each
value of the index field along with a list of pointers to all disk blocks that
contain a record with that field value. The values in the index are ordered and
the index file is much smaller than the data file.
There are several types of indexes. These include
primary index, clustering index, secondary index etc.
A primary index is an index specified on the primary
key of a data file. Primary key is a field which contains unique values and
uniquely identifies each record. By knowing a primary key field’s value we can
access its record. The primary key is also known as the ordering key field and
is used to physically order the file records on the disk.
A clustering index stores data similar to a phone
directory where all people with the same last name are grouped together. A
clustering index is specified on a field that does not have a distinct value,
for each record. These records are then stored in ascending or descending order
according to the data values in this field.
A third type of index, called a secondary index, can
be specified on any field other than the primary key of the file. A secondary
key is any field other than the primary key that is used to uniquely identify a
record in a table.
Cluster
file organization
- When the
two or more records are stored in the same file, it is known as clusters.
These files will have two or more tables in the same data block, and key
attributes which are used to map these tables together are stored only
once.
- This
method reduces the cost of searching for various records in different
files.
- The
cluster file organization is used when there is a frequent need for
joining the tables with the same condition. These joins will give only a
few records from both tables. In the given example, we are retrieving the
record for only particular departments. This method can't be used to
retrieve the record for the entire department.
In this method, we can directly insert, update or delete
any record. Data is sorted based on the key with which searching is done.
Cluster key is a type of key with which joining of the table is performed.
Types of Cluster file organization:
Cluster file organization is of two types:
1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster
key and stored together. The above EMPLOYEE and DEPARTMENT relationship is an
example of an indexed cluster. Here, all the records are grouped based on the
cluster key- DEP_ID and all the records are grouped.
2. Hash Clusters:
It is similar to the indexed cluster. In hash cluster,
instead of storing the records based on the cluster key, we generate the value
of the hash key for the cluster key and store the records with the same hash
key value.
Advantages of Cluster file
organization
- The
cluster file organization is used when there is a frequent request for
joining the tables with same joining condition.
- It
provides the efficient result when there is a 1:M mapping between the
tables.
Disadvantages of Cluster
file organization
- This
method has the low performance for the very large database.
- If there
is any change in joining condition, then this method cannot use. If we
change the condition of joining then traversing the file takes a lot of
time.
MULTIKEY
FILE ORGANISATION
The
ability to search on many keys is enabled by building multiple index files
(multikey file organisation) “on top of” the data file. The physical database
then consists of one or more data files and many index files, and each data
file contains either one or several record types. Each index file supports
access by a particular field or group of fields.
two
approaches for providing additional access paths into a file of data records.
1.
Multilist file organisation
2.
Inverted file organization
1. Multilist File Organisation
The
basic approach to providing the linkage between an index and the file of data
records is called multilist organisation. A multilist file maintains an index
for each secondary key. The index for secondary key contains, instead of a list
of primary keys related to that secondary key, only one primary key value
related to that secondary key. That record will be linked to other records
containing the same secondary key in the data file.
To
facilitate searching on the primary key as well as on secondary keys, it is
customary to maintain several indexes, one for each key. Using an index in this
way reduces the length of the lists and thus the search time.
2. Inverted File Organisation
In
inverted file organisation, a linkage is provided between an index and the file
of data records. A key’s inverted index contains all of the values that the key
presently has in the records of the data file. Each key-value entry in the
inverted index points to all of the data records that have the corresponding
value. Inverted files represent one extreme of file organisation in which only
the index structures are important. The records themselves may be stored in any
way (sequentially ordered by primary key, random, linked ordered by primary key
etc.).
Both
inverted files and multilist files have:
- An index for each
secondary key.
- An index entry for each
distinct value of the secondary key.
- The index may be tabular
or tree-structured.
- The entries in an index
may or may not be sorted.
- The pointers to data
records may be direct or indirect.
The
indexes differ in that
·
An entry in an inverted
index has a pointer to each data record with that value.
·
An entry in a multilist
index has a pointer to the first data record with that value.
Thus
an inverted index may have variable-length entries whereas a multilist index
has fixed-length entries.
No comments:
Post a Comment