Database Optimization - First 3 Forms of Normalization [top]
Now that we have a basic understanding of how to create a simple table, we need to develop a better understanding of how to separate our data into a particular configuration of tables. Database design has evolved to the point where there are specific steps you can take to optimize any database configuration. This process is called normalization and it has three distinct forms.
Here is a set of raw data about several known musicians:
First Normal Form
In order for a set of data to be in first normal form, the following must be true:
The second condition says that if you have similar types of data in different columns, move them to one single column, or create a new table for them if you have to.
So, what happens when we apply these rules to our data set?
As you can see, the column "birth_information" violated our first rule, since it contained both a place and a date. Thus, we had to separate them into their own columns.
Additionally, for our instrument fields, we had the same type of data in three different columns. The only way for us to put this data in first normal form is to put it in its own table, with a foreign pointing back to the musician, in order to keep track of which musician plays which instrument. This necessitated us to add an ID field to the musician table in order to uniquely identify them.
This is first normal form.
Second Normal Form
In order for a set of data to be in second normal form, the following must be true:
So, what happens when we apply these rules to our current data set?
Our old "birthplace" actually contained two entities: a city, and a country. Thus, we needed to split this values into two values: birth_city and birth_country
This is second normal form.
Third Normal Form
Third normal form is all about removing dependencies in our table design. In order for a set of data to be in third normal form, the following must be true:
So, what happens when we apply these rules to our current data set?
As you can see, we were able to eliminate the "age" field because it was dependent upon the "birthdate" field. There is no need for us to store age because we can always just derive it by subtracting it from the current date.
This is second normal form.
Removing Further Redundency
Just because something is normalized doesn't mean you can't streamline things a bit more. There is one other thing you can investigate to make sure your design is as compact as possible. For this, I pay attention to two things:
So, what happens when we apply these rules to our current data set?
We found that the number of countries and the number of instruments to be finite in their range of possible values. Thus, we created two new tables: "instrument" and "country". For instrument, we actually replaced the old "instrument" table with a new one with its own primary key. We then renamed the old "instrument" table as "musician_x_instrument" and replaced the "instrument" field with "instrument_id", which is a foreign key to the new "instrument" table. By doing this, we created a many-to-many relationship between musicians and the instruments they play. For the country of birth for each musician, we created a "country" table with a primary key, and then replaced the old "birth_country" column in the "musician" table with "birth_country_id", which is a foreign key to the "country" table.
Now we have a design that is optimized and that we can really run with.
Multiple Table Joins [top]
The whole point of having a relational database is to be able to connect two or more different tables such that we can extract various sets of data. The process of pulling data from two or more related tables is called a join. The most common is the inner join which has two different syntaxes. For our purposes, we will be using the older syntax, which is a bit easier to read. This format is:
SELECT [DISTINCT] column1[,column2] FROM table1,table2[,table3] [WHERE "conditions"] [GROUP BY "column-list" [ASC | DESC] ] [HAVING "conditions] [ORDER BY "column-list" [ASC | DESC] ] [LIMIT [offset,] number-rows]
For example, if we want to get the names of our musicians and their respective birth countries from our example above, we would say:
SELECT m.firstname, m.lastname, c.name FROM musician m, country c WHERE m.birth_country_id=c.id ORDER BY m.lastname, m.firstname
which would return the following results:
Note that in order to identify columns, it is necessary to give each table its own abbreviation. In the above case, we use "m" for the musician table, and "c" for the country table. Thus, if we want to refer to a column name in the musician table, we use the syntax "m.[column name]".
Also note that in order to equate the two tables, we are using the foreign key in the musician table (m.birth_country_id), that points to the primary key in the country table (c.id), and making it a necessary condition that they must be equal to other. If we don't we get this:
SELECT m.firstname, m.lastname, c.name FROM musician m, country c ORDER BY m.lastname, m.firstname
which would return the following results:
This redundency occurs because we never included our country-related reference in our musician table (i.e. birth_country_id) in our query. Thus, the query returned every musician in combination with every country. It's only when we make use of the "WHERE m.birth_country_id=c.id" clause that we look for the birth country for each individual musician.
If we wanted to see which of our musicians plays guitar, we would need to use three tables: musician, instrument, and musician_x_instrument. Note that this is because the relationship of musicians to instruments is a many-to-many relationship. If it's a one-to-many relationship, such as a musician and the country in which they were born, then it only involves two tables, which is what we performed above. Because it is three tables, we need to make two equations:
SELECT m.firstname, m.lastname, i.name FROM musician m, instrument i, musician_x_instrument x WHERE x.musician_id=m.id AND x.instrument_id=i.id AND i.name='guitar' ORDER BY m.lastname, m.firstname
This returns the following recordset:
You can also use the OR clause if you want to know more general things, like who plays guitar OR bass:
SELECT m.firstname, m.lastname, i.name FROM musician m, instrument i, musician_x_instrument x WHERE x.musician_id=m.id AND x.instrument_id=i.id AND (i.name='guitar' OR i.name='bass') ORDER BY m.lastname, m.firstname
This returns the following recordset:
Note that we had to put parentheses around the instrument part of our query, because the OR part is only applicable to instrument name.
If you want to get really serious, you can find out who plays guitar AND sings. When you need to match two separate values of a field using the same table, you need to equate two different instances of the related tables. This means one musician table, mapping to two musician_x_instrument tables, which each map to a separate instance of the instrument table. Then, each instrument table reference maps to its respective desired value (i.e. "guitar" and "vocals"):
SELECT m.firstname, m.lastname, i1.name, i2.name FROM musician m, musician_x_instrument x1, instrument i1, musician_x_instrument x2, instrument i2 WHERE x1.musician_id=m.id AND x2.musician_id=m.id AND x1.instrument_id=i1.id AND x2.instrument_id=i2.id AND (i1.name='guitar' AND i2.name='vocals') ORDER BY m.lastname, m.firstname
This returns all the musicians who sing AND play guitar:
Advanced SELECT Queries [top]
One of the main reasons you store data in a database is so that you can extract very targeted sets of data and manipulate this set in a desired fashion. Thus, the SELECT statement can utilize many functions and clauses which provide this kind of flexibility:
COUNT()
Use the COUNT() function to count the number of records returned by a SELECT statement.
SELECT COUNT(m.id) FROM musician m, instrument i, musician_x_instrument x WHERE x.musician_id=m.id AND x.instrument_id=i.id AND i.name='guitar'As we saw earlier, there were 5 musicians that played guitar, so this returns the number 5, which corresponds to the number of records that match this criteria.
SUM()
Use the SUM() function to add up all the values of a particular field in the records returned by a SELECT statement. Want to know how much money your store took in today?
SELECT SUM(total_price) FROM purchase WHERE purchase_date=CURRENT_DATE()By the way, the function "CURRENT_DATE()" returns the current system date in format "YYYY-MM-DD".
AVG()
Use the AVG() function computes the average of all the values of a particular field in the records returned by a SELECT statement. If you're going to store the result back into a field in your table (say, the average user rating of a recording), make sure you put it in a field of type DOUBLE.
SELECT "rating"=AVG(user_rating) FROM recording_review WHERE title="Hello, Nasty" UPDATE recording SET avg_rating=$rating, last_modified=NOW() WHERE title="Hello, Nasty"
MAX() and MIN()
Use the MAX() function returns the row with the maximum value of a particular field. What is the most expensive product in yout store?
SELECT MAX(total_price), title FROM productor the cheapest:
SELECT MIN(total_price), title FROM product
ORDER BY
The "ORDER BY" clause is great for sorting a result set. You can also specify whether you want the results sorted "low to high", specified using "ASC" (the default), or "high to low" which is "DESC":
SELECT firstname, lastname, birthdate FROM musician ORDER BY birthdate ASCThis shows all the musician in the "musician" table sorted by their birthdate with oldest birthdate first.
SELECT id, title, price FROM recording ORDER BY price DESCThis shows the most expensive recordings first and in order of descending price.
GROUP BY
The "GROUP BY" clause, at its heart, has an aggregating behavior, much like COUNT. Use it in conjunction with COUNT to get the number of instances of a particular attribute value:
SELECT count(id), release_year FROM recording GROUP BY release_yearThis shows all the unique release years that exist in the recording table and how many recordings there are for each year. You can also use the ASC or DESC option with GROUP BY to sort your results accordingly.
LIMIT
Use the LIMIT keyword to control the maximum amount of records that are obtained by your query. If there's fewer, then that's what you get. If you add a second parameter, the second number is the number of records you want to skip before starting to retrieve your result set:
SELECT id, title FROM recording ORDER BY release_year LIMIT 10This returns the 10 oldest recordings in your table, sorted oldest to newest.
SELECT id, title FROM recording ORDER BY release_year LIMIT 10, 5This returns the 5th to 14th oldest recordings in your table.
IN
NOTE: The IN keyword is only available in MySQL version 4.1 and above. The IN keyword is great for obtaining a result set that uses another result set as a condition in the WHERE clause. For example:
SELECT id, title FROM recording WHERE id IN (SELECT recording_id FROM recording_review) ORDER BY titleThis gives you the names of all the recordings that have been reviewed. It compares the primary key in the "recording" table (id) to its foreign key in the "recording_review" table (recording_id), and if it exists, then that record from the "recording" table is returned. Similarly:
SELECT id, title FROM recording WHERE id NOT IN (SELECT recording_id FROM recording_review) ORDER BY titlegives you all the recordings that have not been reviewed.
MySQL Date and Time Functions [top]
MySQL allows you to perform many arithmetic and logical operations using the SELECT. The functions which obtain, format, and manipulate date and time are particularly useful:
NOW() SYSDATE() CURRENT_TIMESTAMP CURRENT_TIMESTAMP() LOCALTIME LOCALTIME() LOCALTIMESTAMP LOCALTIMESTAMP()Returns the current date and time as a value in 'YYYY-MM-DD HH:MM:SS' or YYYYMMDDHHMMSS format, depending on whether the function is used in a string or numeric context:
mysql> SELECT NOW();
-> '1997-12-15 23:50:26'
mysql> SELECT NOW() + 0;
-> 19971215235026
DATE_FORMAT(date,format)Formats the date value according to the format string. The following specifiers may be used in the format string:
| Specifier | Description |
| %M | Month name (January..December) |
| %W | Weekday name (Sunday..Saturday) |
| %D | Day of the month with English suffix (0th, 1st, 2nd, 3rd, etc.) |
| %Y | Year, numeric, 4 digits |
| %y | Year, numeric, 2 digits |
| %X | Year for the week where Sunday is the first day of the week, numeric, 4 digits; used with %V |
| %x | Year for the week, where Monday is the first day of the week, numeric, 4 digits; used with %v |
| %a | Abbreviated weekday name (Sun..Sat) |
| %d | Day of the month, numeric (00..31) |
| %e | Day of the month, numeric (0..31) |
| %m | Month, numeric (00..12) |
| %c | Month, numeric (0..12) |
| %b | Abbreviated month name (Jan..Dec) |
| %j | Day of year (001..366) |
| %H | Hour (00..23) |
| %k | Hour (0..23) |
| %h | Hour (01..12) |
| %I | Hour (01..12) |
| %l | Hour (1..12) |
| %i | Minutes, numeric (00..59) |
| %r | Time, 12-hour (hh:mm:ss followed by AM or PM) |
| %T | Time, 24-hour (hh:mm:ss) |
| %S | Seconds (00..59) |
| %s | Seconds (00..59) |
| %f | Microseconds (000000..999999) |
| %p | AM or PM |
| %w | Day of the week (0=Sunday..6=Saturday) |
| %U | Week (00..53), where Sunday is the first day of the week |
| %u | Week (00..53), where Monday is the first day of the week |
| %V | Week (01..53), where Sunday is the first day of the week; used with %X |
| %v | Week (01..53), where Monday is the first day of the week; used with %x |
| %% | A literal `%'. |
All other characters are just copied to the result without interpretation. The %f format specifier is available as of MySQL 4.1.1. As of MySQL Version 3.23, the `%' character is required before format specifier characters. In earlier versions of MySQL, `%' was optional. The reason the ranges for the month and day specifiers begin with zero is that MySQL allows incomplete dates such as '2004-00-00' to be stored as of MySQL 3.23.
mysql> SELECT DATE_FORMAT('1997-10-04 22:23:00', '%W %M %Y');
-> 'Saturday October 1997'
mysql> SELECT DATE_FORMAT('1997-10-04 22:23:00', '%H:%i:%s');
-> '22:23:00'
mysql> SELECT DATE_FORMAT('1997-10-04 22:23:00',
'%D %y %a %d %m %b %j');
-> '4th 97 Sat 04 10 Oct 277'
mysql> SELECT DATE_FORMAT('1997-10-04 22:23:00',
'%H %k %I %r %T %S %w');
-> '22 22 10 10:23:00 PM 22:23:00 00 6'
mysql> SELECT DATE_FORMAT('1999-01-01', '%X %V');
-> '1998 52'
TIME_FORMAT(time,format)This is used like the DATE_FORMAT() function above, but the format string may contain only those format specifiers that handle hours, minutes, and seconds. Other specifiers produce a NULL value or 0. If the time value contains an hour part that is greater than 23, the %H and %k hour format specifiers produce a value larger than the usual range of 0..23. The other hour format specifiers produce the hour value modulo 12:
mysql> SELECT TIME_FORMAT(NOW(), '%H:%i:%s');
-> '21:30:15'
Now that we know how to obtain and format our dates, let's look how to add or subtract time to them:
DATE_ADD(date,INTERVAL expr type) DATE_SUB(date,INTERVAL expr type)These functions perform date arithmetic.
As of MySQL Version 3.23, INTERVAL expr type is allowed on either side of the + operator if the expression on the other side is a date or datetime value. For the - operator, INTERVAL expr type is allowed only on the right side, because it makes no sense to subtract a date or datetime value from an interval. (See examples below.)
date is a DATETIME or DATE value specifying the starting date. expr is an expression specifying the interval value to be added or subtracted from the starting date. expr is a string; it may start with a `-' for negative intervals. type is a keyword indicating how the expression should be interpreted.
The following table shows how the type and expr arguments are related:
| type Value | Expected expr Format |
| SECOND | SECONDS |
| MINUTE | MINUTES |
| HOUR | HOURS |
| DAY | DAYS |
| MONTH | MONTHS |
| YEAR | YEARS |
| MINUTE_SECOND | 'MINUTES:SECONDS' |
| HOUR_MINUTE | 'HOURS:MINUTES' |
| DAY_HOUR | 'DAYS HOURS' |
| YEAR_MONTH | 'YEARS-MONTHS' |
| HOUR_SECOND | 'HOURS:MINUTES:SECONDS' |
| DAY_MINUTE | 'DAYS HOURS:MINUTES' |
| DAY_SECOND | 'DAYS HOURS:MINUTES:SECONDS' |
| DAY_MICROSECOND | 'DAYS.MICROSECONDS' |
| HOUR_MICROSECOND | 'HOURS.MICROSECONDS' |
| MINUTE_MICROSECOND | 'MINUTES.MICROSECONDS' |
| SECOND_MICROSECOND | 'SECONDS.MICROSECONDS' |
| MICROSECOND | 'MICROSECONDS' |
The type values DAY_MICROSECOND, HOUR_MICROSECOND, MINUTE_MICROSECOND, SECOND_MICROSECOND, and MICROSECOND are allowed as of MySQL 4.1.1.
MySQL allows any punctuation delimiter in the expr format. Those shown in the table are the suggested delimiters. If the date argument is a DATE value and your calculations involve only YEAR, MONTH, and DAY parts (that is, no time parts), the result is a DATE value. Otherwise, the result is a DATETIME value:
mysql> SELECT '1997-12-31 23:59:59' + INTERVAL 1 SECOND;
-> '1998-01-01 00:00:00'
mysql> SELECT INTERVAL 1 DAY + '1997-12-31';
-> '1998-01-01'
mysql> SELECT '1998-01-01' - INTERVAL 1 SECOND;
-> '1997-12-31 23:59:59'
mysql> SELECT DATE_ADD('1997-12-31 23:59:59',
-> INTERVAL 1 SECOND);
-> '1998-01-01 00:00:00'
mysql> SELECT DATE_ADD('1997-12-31 23:59:59',
-> INTERVAL 1 DAY);
-> '1998-01-01 23:59:59'
mysql> SELECT DATE_ADD('1997-12-31 23:59:59',
-> INTERVAL '1:1' MINUTE_SECOND);
-> '1998-01-01 00:01:00'
mysql> SELECT DATE_SUB('1998-01-01 00:00:00',
-> INTERVAL '1 1:1:1' DAY_SECOND);
-> '1997-12-30 22:58:59'
mysql> SELECT DATE_ADD('1998-01-01 00:00:00',
-> INTERVAL '-1 10' DAY_HOUR);
-> '1997-12-30 14:00:00'
mysql> SELECT DATE_SUB('1998-01-02', INTERVAL 31 DAY);
-> '1997-12-02'
mysql> SELECT DATE_ADD('1992-12-31 23:59:59.000002',
-> INTERVAL '1.999999' SECOND_MICROSECOND);
-> '1993-01-01 00:00:01.000001'
If you specify an interval value that is too short (does not include all the
interval parts that would be expected from the type keyword),
MySQL assumes you have left out the leftmost parts of the interval
value. For example, if you specify a type of DAY_SECOND, the
value of expr is expected to have days, hours, minutes, and seconds
parts. If you specify a value like '1:10', MySQL assumes
that the days and hours parts are missing and the value represents minutes
and seconds. In other words, '1:10' DAY_SECOND is interpreted in such
a way that it is equivalent to '1:10' MINUTE_SECOND. This is
analogous to the way that MySQL interprets TIME values
as representing elapsed time rather than as time of day.
Note that if you add to or subtract from a date value something that contains a time part, the result is automatically converted to a datetime value:
mysql> SELECT DATE_ADD('1999-01-01', INTERVAL 1 DAY);
-> '1999-01-02'
mysql> SELECT DATE_ADD('1999-01-01', INTERVAL 1 HOUR);
-> '1999-01-01 01:00:00'
If you use really malformed dates, the result is NULL. If you add
MONTH, YEAR_MONTH, or YEAR and the resulting date
has a day that is larger than the maximum day for the new month, the day is
adjusted to the maximum days in the new month:
mysql> SELECT DATE_ADD('1998-01-30', interval 1 month);
-> '1998-02-28'
For a more real-world example of using this, we can re-do our SELECT query to obtain the total revenue for our store today:
SELECT SUM(total_price) FROM purchase WHERE purchase_date>=CURDATE() AND purchase_date<(CURDATE() + INTERVAL 1 DAY);Or how many new messages have been posted in the last hour:
SELECT count(id) FROM message WHERE date_posted>=(NOW() - INTERVAL 1 HOUR);
Related Resources [top]