Using the Data Warehouse

In order to utilize the data warehouse fully, it's important to understand, generally, how it is structured and why. The aim of this page is to offer a brief explanation of these things.

If all you want is an an overview of what's in the data warehouse, look here.

There are also separate pages on how to get connected and where to find data documentation.

The short explanation of how the data warehouse is structured is that it's a big database that's divided up into spreadsheet-like tables into which regularly used data is loaded from our major systems of record, Advance, Slate, Colleague, etc.  The spreadsheet-like tables are of two kinds: dimension tables (aka dimensions) and fact tables.  These tables join together into so-called star schemas, which all together form the relational part of the data warehouse. 

The relational part of the data warehouse may be accessed directly (e.g., using a query tool such as Excel 2013 or later's PowerQuery).  More typically, however, the data in the relational part of the data warehouse is loaded into OLAP cubes and then accessed, by users, from an Excel pivot table or via one or another reporting tool, such as Tableau.

When do you use OLAP cubes?  When you want to look at summarized data (total x over y years; average z for the current year by age and gender, and so on).  If you are looking for a list, however, query the relational part of the data warehouse, or probably better yet, get your list from the system where the data originates.  For example, if you want a names and contact information for individual people, or lists of courses, or specific financial transactions, you should not be using a cube.  For such situations, a relational-database query is better.  Or perhaps a conversation with someone who administers the data in Colleague, Advance, Slate, etc.  


Dimensions (in the relational part of the data warehouse) hold information about 1) real-world entities like people and places, and 2) conventional entites like dates, times, terms, years, and other concept and labels we use to divide up or compare numbers and facts in reports and analyses (e.g., "Do we have more applicants from New England this year than last?").  Dimensions' names all begin with the prefix MD_ (e.g., MD_Date, and MD_DegreeStudent).

Fact Tables

Fact tables hold numeric information about the entities in the dimensions - not the labels and classifiers that we house in dimensional tables, but rather things we summarize and aggregate (average, median, sum, distinct count, etc.).

Take, for example, the dimension table, MD_Prospect, which contains a long list of prospective students, along with their names, home geographical regions, genders, etc.  Alongside that dimension we also have a series of fact tables, MF_TopSATScore, MF_ReaderRating, etc., that contain exact numeric values like SAT scores, numeric ratings given to applications by Admissions staff members, and counts of activities we have performed for prospective students.

These particular fact tables and dimensions, when linked together into related groups called star schemas, allow us (using appropriate reporting tools) to answer questions, quickly and easily, like:

  1. What activities are most closely associated with a student applying? Having a family member who is an alum? Coming to visit campus?
  2. What are the average SAT scores of applicants, by race, by gender, by region (or other things)? Is this changing over time?
  3. How has our yield (matriculants / admittees) changed over time?

OLAP Cubes

As noted above, though, most data warehouse users don't access the relational part of the data warehouse directly, because it may require them to know a query language (SQL) or be able to join relational fact and dimensional tables.

Instead, most data warehouse users access the data warehouse through online analytical processing (OLAP) cubes.

OLAP cubes are special data structures that facilitate slicing and dicing of data into charts and interactive pivot tables, drill-down reports, and visually pleasing collections of these (typically called dashboards).  OLAP cubes may be accessed with a variety of software, from plain old Excel, to higher-end data visualization and display tools like Tableau.

Data Mining

Not all questions are best answered using OLAP, however. Some questions, especially in the initial stages of analysis and exploration, are best answered using data mining techniques, which allow users to visualize how some parts of a data set are related to other parts.  Data mining techniques, e.g., would be used to make a first pass at determining what factors have the most negative influence on student retention, or what demographic or personal characteristics seem most predictive of a high GPA.

In the past, data mining was the exclusive province of highly trained statisticians, actuaries, and other analytically/quantitatively trained professionals, and although many data mining tasks are still best done by such professionals, quite a bit can actually be done by non-experts using simple tools (data mining add-ins) available to all users of Microsoft Excel.

Data Cleansing

It is also worth noting that the data in the data warehouse is extensively cleaned, normalized, and massaged to fit our institutional standards, so that those using it don't have to worry about inconsistencies, missing chunks of data, and obscure codes.

The data is loaded overnight, to free the systems up during the day for OLAP and data mining tasks. Data in the warehouse is therefore at most a day old, i.e., fresh enough for most OLAP and data mining tasks, but not suitable for operational reporting ("As of lunch time today, how many admitted students have paid their deposit?") or for data "dumps" for mailing labels, or for tracking day-to-day transactions.

Note that Mac users cannot currently access the data warehouse directly.  On the technical reasons for this, see AD Notes