Using the Data Warehouse
In order to utilize the data warehouse fully, it's important to understand, generally, how it is structured and why. The aim of this page is to offer you a brief explanation of these things, and to provide basic information on how to get connected, and where to find documentation.
The short explanation of how the data warehouse is structured is that it's a big database that's divided up into spreadsheet-like tables into which regularly used data is loaded from our major systems of record, Advance, Recruitment Plus, and Colleague.
The spreadsheet-like tables are of two kinds: dimension tables and fact tables.
Dimension tables hold information about 1) real-world entities like people and places, and 2) conventional entites like dates, times, terms, years, etc., which we use as ways of dividing up or comparing numbers and facts in reports and analyses (e.g., "Do we have more applicants from New England this year than last?"). Dimension tables' names all begin with the prefix MD_ (e.g., MD_Date, and MD_DegreeStudent).
Fact tables hold numeric information about the entities in the dimension tables. Take, for example, the dimension table, MD_Prospect, which contains a long list of prospective students, along with their names, home geographical regions, genders, etc. Alongside that dimension we also have a series of fact tables, MF_TopSATScore, MF_ReaderRating, MF_ProspectActivity, etc., that contain exact numeric values like SAT scores, numeric ratings given to applications by Admissions staff members, and counts of activities we have performed for prospective students.
These particular fact and dimension tables, when linked together into related groups called star schemas, allow us (using appropriate software) to answer questions like:
- What activities are most closely associated with a student applying? Having a family member who is an alum? Coming to visit campus?
- What are the average SAT scores of applicants, by race, by gender, by region (or other things)? Is this changing over time?
- What demographic characteristics do our application readers seem to rely most heavily on to come up with their ratings?
Some of these questions are answered using software designed for online analytical processing (OLAP). OLAP is what you are doing if you are slicing and dicing data with pivot tables, or constructing drill-down reports—or just exploring sets of data by this or that attribute (e.g., age, gender, salary) over time, geographical region, etc. OLAP can be done with Microsoft Excel, Microsoft Access, Cognos BI 8, SPSS, and a number of other software tools available on campus.
Not all questions are best answered using OLAP, however. Some questions, especially in the initial stages of analysis and exploration, are best answered using data mining techniques, which allow you to visualize how some parts of a data set are related to other parts. Data mining techniquest, e.g., would be used to make a first pass at determining what factors have the most negative influence on student retention, or what demographic or personal characteristics seem most predictive of a high GPA.
In the past, data mining was the exclusive province of highly trained statisticians, actuaries, and other analytically/quantitatively trained professionals, and although many data mining tasks are still best done by such professionals, quite a bit can actually be done by non-experts using simple tools (data mining add-ins) available to all users of Microsoft Excel.
In addition to OLAP tables and data mining structures are valuable additions to the reporting and business intelligence strategy at Carleton. We have found that we can further simplify reporting for those that require a simple and direct approach with Microsoft SSAS cubes. These cubes pre-aggregate large quantities of data and allow the user to create quick and easy Excel pivot tables for reporting and analysis. Much work goes into designing these cubes to allow for a simple end-user tool that will provide a high success rate for distributing data without misinterpretation. Areas that have used SSAS cubes on campus include Student Financial Services, Budgeting, Financial Reporting, and Curriculum and Registration assessment.
The point of the data warehouse is to facilitate these sorts of activities, particularly for non-experts. And its dimension and fact tables are all structured and optimized for this purpose.
The data in the data warehouse is also extensively cleaned, normalized, and massaged to fit our institutional standards, so that those using it don't have to worry about inconsistencies, missing chunks of data, and obscure codes. The data is loaded overnight, to free the systems up during the day for OLAP and data mining tasks. Data in the warehouse is therefore at most a day old, i.e., fresh enough for most OLAP and data mining tasks, but not suitable for operational reporting ("As of lunch time today, how many admitted students have paid their deposit?") or for data "dumps" for mailing labels, or for tracking day-to-day transactions.