Skip to content
May 16 16

Baby Bump redux

by Administrator

First published March 13, 2014

Today we analyze the enrollment projections presented by the Arlington Public Schools as one rationale for spending upwards of $100,000,000 to rebuild the public high school. We find several mistakes in the data, inconsistencies in the model of projected enrollments and perform some rudimentary statistical analysis that shows that no conclusions can be made about future increased enrollments using the administration’s data analysis. Specifically, we find that the entire 10% projected 5 year increase in enrollments is due to assumptions of drastically lowered historic attrition rates (-12% projected instead of -24% historic). We detail three mistakes in the projected enrollment data including no attrition between 8th and 9th grade (twice) and unrealistic growth in high school enrollments. We observe a bias when using Power School enrollment totals compared to the official DESE enrollment statistics, with the Power School data consistently higher than the official enrollment statistics.

For those interested in consolidated data tables open this image for the APS projections and this image for the DESE official statistics.

Enrollment Projections

The Arlington Public Schools are showing projections of increased students in the Middle (a +21% increase in 5 years) and High schools (+32% increase in 10 years) as one reason that Arlington should rebuild the aging high school sooner rather than later. The data and projection sheet can be found here and reproduced below:


8 Year Enrollment History and Projected Enrollment 2014 to 2028 -by Grade Levels



This is a dense and difficult table to read, so we converted the pdf into an excel spreadsheet and added some color, see below. The yellow cells represents enrollments that are projected. The green and blue cells represent one class, the graduating year, followed through time. The green cells start at K (kindergarten) in the 2006-2007 school year, progress to 1st grade in 2007-2008 and are projected to graduate in 2018-2019. The green cells represent the class of 2019. The blue cells start at the 5th grade in 2006-2007 and represent the class of 2014; the current graduating class, also shown in bold. These ‘cohorts’ will become important later in our analysis. For the moment, ignore the numbers colored in red.


8 Year Enrollment History and Projected Enrollment 2014 to 2028 -by Grade Levels
Grade/Year Births preK K 1 2 3 4 5 6 7 8 9 10 11 12
2006-2007 545 84 442 391 386 394 385 357 356 339 347 302 309 301 323
2007-2008 537 79 409 439 399 384 381 382 337 354 317 316 271 299 292
2008-2009 496 82 456 405 439 387 376 374 369 344 354 296 308 266 300
2009-2010 558 64 457 451 411 423 387 366 365 373 343 320 295 323 272
2010-2011 545 60 450 442 435 399 427 367 349 350 365 306 325 296 311
2011-2012 537 47 434 455 421 426 390 412 355 335 348 308 304 342 299
2012-2013 496 57 453 472 446 420 429 395 379 337 337 322 313 309 354
2013-2014 558 60 477 478 483 464 434 429 357 393 328 299 320 321 314
2014-2015 517 60 442 496 473 484 469 429 400 352 388 292 300 329 325
2015-2016 563 60 481 459 490 474 489 463 400 394 348 346 293 308 333
2016-2017 545 60 466 500 454 491 479 483 431 394 390 310 347 301 312
2017-2018 597 60 510 484 495 455 496 473 450 425 390 347 311 356 305
2018-2019 525 60 449 530 479 496 460 490 441 444 420 347 348 319 361
2019-2020       466 524 480 501 454 457 435 439 374 348 358 323
2020-2021         461 526 485 494 423 450 430 391 375 385 352
2021-2022           462 531 479 461 417 445 430 392 386 390
2022-2023             467 524 446 454 413 397 431 402 391
2023-2024               461 489 440 449 413 398 442 408
2024-2025                 430 482 435 400 414 424 403
2025-2026                   424 476 387 401 411 419
2026-2027                     419 424 388 398 407
2027-2028                       373 425 436 394


We believe we have faithfully reproduced the projection “data” provided by the public school’s administration. To cross check the dataset, we computed totals, compared them to the source document and found exact agreement; see the table below.


Summary of Enrollments 2007 – 2028
Grade/Class B-K K-5 Tot 6-8 Tot 9-12 Tot Total Chg
2006-2007 -19% 2,355 1,042 1,235 4,716
2007-2008 -24% 2,394 1,008 1,178 4,659 (57)
2008-2009 -8% 2,437 1,067 1,170 4,756 97
2009-2010 -18% 2,495 1,081 1,210 4,850 94
2010-2011 -17% 2,520 1,064 1,238 4,882 32
2011-2012 -19% 2,538 1,038 1,253 4,876 (6)
2012-2013 -9% 2,615 1,053 1,298 5,023 147
2013-2014 -15% 2,765 1,078 1,254 5,157 134
2014-2015 -15% 2,793 1,140 1,246 5,238 81
2015-2016 -15% 2,856 1,142 1,280 5,338 100
2016-2017 -14% 2,873 1,215 1,270 5,418 80
2017-2018 -15% 2,913 1,265 1,319 5,557 139
2018-2019 -14% 2,904 1,305 1,375 5,643 86
2019-2020 1,331 1,403
2020-2021 1,303 1,503
2021-2022 1,323 1,598
2022-2023 1,313 1,621
2023-2024 1,378 1,661
2024-2025 1,347 1,641
2025-2026 1,618
2026-2027 1,617
2027-2028 1,628


The summary totals tell the administration’s whole story. Elementary school enrollment is up which will result in 21% more middle school students (1,078 in 2014 to 1,305 in 2019) in the next five years; see the red highlighted rows in the “6-8 Tot” column. This baby bump will then move into the high school with 10 year projected increase of 32% (1,254 in 2014 to 1,661 in 2024). Looking closely one can see that the APS is projecting that middle school enrollment which has held steady for the past 8 years is about to see a 21% increase in the next five years.

Overall, the APS is projecting a 10% increase in total enrollments in the next 5 years (5,157 in 2014 to 5,643 in 2019); an accelerating growth rate (60% higher) compared to the 9% increase over the past 8 years. The graph below encapsulates all of the summary data showing the projected increases.


APS Enrollment History and Projected Enrollment



The question becomes do you believe that the last three years of enrollment increases will continue for the next five years, steadily increasing by the rate seen over the past three years? Alternatively, is the projected enrollment model predictive?

Analysis

First, let’s look at the birth numbers provided by the APS, found in the first table above which records the number of children born 5 years previously and about to enter kindergarten. The first question to ask ourselves is whether the number of children entering the school system is statistically different than those who entered over the past 8 years. This is a simple question and can be answered by comparing the averages – 534 over the past 8 years and 549 over the next 5 years – and the spread (standard deviation) – 25 historically, 32 next 5 years. Intuition tells us that a difference of 15 in the means is hidden within the spread of 25-32 in the standard deviation. Performing a Student’s T-test (0.35) suggests that there is no statistical difference in the number of births used in the projected enrollments when compared to the last eight years that might result in a baby bump. The number of births over the past five years does not indicate any increase in future enrollments.

Next, let’s look at the data provided by the APS in a slightly altered format. Instead of looking at fiscal years, let’s follow the class years as they progress from kindergarten through high school. In the table below is the exact same data as above but ’tilted’ to allow us to easily follow though class years.


APS Enrollment History and Projected Enrollment – by Graduation Year
Grade/Class K 1 2 3 4 5 6 7 8 9 10 11 12
2006-2007 323
2007-2008 301 292
2008-2009 309 299 300
2009-2010 302 271 266 272
2010-2011 347 316 308 323 311
2011-2012 339 317 296 295 296 299
2012-2013 356 354 354 320 325 342 354
2013-2014 357 337 344 343 306 304 309 314
2014-2015 385 382 369 373 365 308 313 321 325
2015-2016 394 381 374 365 350 348 322 320 329 333
2016-2017 386 384 376 366 349 335 337 299 300 308 312
2017-2018 391 399 387 387 367 355 337 328 292 293 301 305
2018-2019 442 439 439 423 427 412 379 393 388 346 347 356 361
2019-2020 409 405 411 399 390 395 357 352 348 310 311 319 323
2020-2021 456 451 435 426 429 429 400 394 390 347 348 358 352
2021-2022 457 442 421 420 434 429 400 394 390 347 348 385 390
2022-2023 450 455 446 464 469 463 431 425 420 374 375 386 391
2023-2024 434 472 483 484 489 483 450 444 439 391 392 402 408
2024-2025 453 478 473 474 479 473 441 435 430 430 431 442 403
2025-2026 477 496 490 491 496 490 457 450 445 397 398 424 419
2026-2027 442 459 454 455 460 454 423 417 413 413 414 411 407
2027-2028 481 500 495 496 501 494 461 454 449 400 401 398 394
2028-2029 466 484 479 480 485 479 446 440 435 387 388 436
2029-2030 510 530 524 526 531 524 489 482 476 424 425
2030-2031 449 466 461 462 467 461 430 424 419 373


As before, the yellow cells are projected, the green and blue cells shows respectively the 2019 and 2014 graduating class. Viewing the data by graduation year allows for some simple retention/attrition calculations. Below is a chart that shows the retention (number of students in/number of students out) over a few relevant periods; from Kindergarten to 1st grade, from 1st – 5th grade (elementary school), from 5-6 grade (first dropoff), 6-8th grade (middle school), 8-9 grade (second drop off), 9th – 12th grade (high school), 1-12 grade and a compounded retention rate. The green coded row and difference are detailed later in the post and are shown here for reference.


Retention Rates – Projected
Year K-1 1-5 5-6 6-8 8-9 9-12 1-12 Compounded
2014-2015 6%
2015-2016 -7% 3%
2016-2017 -11% 4%
2017-2018 -8% -11% 4%
2018-2019 -1% -6% -8% 2% -11% 4% -18%
2019-2020 -1% -2% -10% -3% -11% 4% -20%
2020-2021 -1% -5% -7% -3% -11% 1% -22%
2021-2022 -3% -3% -7% -3% -11% 12% -12%
2022-2023 1% 2% -7% -3% -11% 5% -14%
2023-2024 9% 2% -7% -2% -11% 4% -14%
2024-2025 6% -1% -7% -2% 0% -6% -16%
2025-2026 4% -1% -7% -3% -11% 6% -16%
2026-2027 4% -1% -7% -2% 0% -1% -11%
2027-2028 4% -1% -7% -3% -11% -2% -21%
2028-2029 4% -1% -7% -2% -11%
2029-2030 4% -1% -7% -3% -11%
2030-2031 4% -1% -7% -3% -11%
Projected 4% -1% -7% -2% -9% 3% -16% -12%
Actual 1% -7% -4% -3% -13% 0% -23% -24%
Diff 3% 6% -3% 1% 3% 3% 7% 12%


First let’s note some errors in the projected enrollments. As can be seen, for the class of 2025 and 2027, the attrition rate between 8th and 9th grade is 0% instead of the -11% projected for every other year. As well, the 2021/2022 graduation years start off with exactly the same number of 9th graders, but then the 2022 year shows a net projected increase of 12% instead of the 1% increase for the 2022 graduating class. Since all of these are ‘yellow’ cells, the enrollment numbers are projected and presumably computed using the same model. These three data errors account for a 3% higher retention than in a consistent projection model.

In addition, there are two bigger errors earlier on in the projection model that over estimate the retention rate by 9%. Before we show this, we need more historical data and need to do a little more work.

Power School Vs DESE Enrollment Statistics

In the source document, the historic data from fiscal years 2006 – 2014 comes from Power School and not a public data source. To ascertain the predictive power of the APS’ enrollment projections, we need to see a longer historical record. As well, since the Power School data source is not available to the public there is no way to know exactly what these enrollment numbers mean. To rectify this, we turned to the Massachusetts Department of Elementary and Secondary Education (DESE) and their databank. DESE collects and maintains enrollment statistics for all public school districts in Massachusetts back to the 1990s. Arlington’s official enrollment numbers for the years 1997 – 2014 are reproduced in the table below.


DESE Official Enrollment by Grade 1997 – 2014
FY PK K 1 2 3 4 5 6 7 8 9 10 11 12 Total
1997 360 402 381 373 357 319 335 306 276 256 275 243 202 4,116
1998 344 385 374 372 370 344 326 311 295 268 255 274 228 4,197
1999 376 351 373 372 366 364 341 326 299 259 260 253 265 4,222
2000 343 371 311 349 364 354 341 330 317 299 269 263 254 4,178
2001 0 388 354 378 322 340 354 345 345 320 284 280 261 244 4,215
2002 7 385 407 344 372 312 329 347 324 332 289 291 270 256 4,265
2003 97 434 392 394 366 357 310 335 346 331 291 275 283 270 4,481
2004 87 393 411 379 374 354 353 294 338 344 275 287 263 273 4,425
2005 78 406 405 406 374 364 362 346 292 334 287 280 300 252 4,486
2006 74 381 411 399 399 368 353 338 342 292 293 285 294 293 4,522
2007 77 446 382 379 388 379 348 346 329 333 250 304 287 300 4,548
2008 62 411 432 388 376 375 378 328 347 309 296 255 293 282 4,532
2009 67 451 401 433 377 372 373 365 336 347 285 293 257 297 4,654
2010 59 434 446 403 418 372 359 359 367 334 312 280 304 266 4,713
2011 57 448 441 433 395 427 360 344 347 360 297 318 286 295 4,808
2012 48 450 455 427 429 390 415 349 331 346 300 297 331 290 4,858
2013 54 454 460 446 418 424 386 374 328 326 313 298 296 326 4,903
2014 55 471 472 474 458 428 423 352 385 317 280 313 303 289 5,020


First let us note that as in all government data sets there are errors. From 1997 – 2000 the detailed grade enrollments do not equal the reported totals and so we highlight these in red. As well, Arlington’s pre-K program started in 2003 and so total enrollment statistics before 2003 are not comparable to later years and inappropriate for year on year changes. Most disturbing is the comparison between the DESE total district-wide enrollments and the Power School totals used in the projected enrollment analysis. In the table below, we compare the DESE enrollment numbers and the Power School totals and see a persistent bias, the Power School numbers are consistently higher, 137 more students in the current 2014 fiscal year, than the official DESE statistics. We believe the DESE enrollment numbers to be the ‘truth’ and expect an explanation from the APS administration to explain the difference between the Power School enrollment data and the DESE official enrollments.


Comparison of DESE and Power School Total Enrollment Statistics 2007 – 2014
FY DESE PowerSchool DIFF

2007 4,548 4,716 (168)

2008 4,532 4,659 (127)

2009 4,654 4,756 (102)

2010 4,713 4,850 (137)

2011 4,808 4,882 (74)

2012 4,858 4,876 (18)

2013 4,903 5,023 (120)

2014 5,020 5,157 (137)


Historic Retention/Attrition Rates

Now that we have a deeper history of actual enrollments, let’s look at historic retention rates so as to compare to the projected enrollments that the APS has modeled. Since we are looking at rates of change, the difference in the total enrollments should not be an issue. In the table below, we repeat the relevant enrollment numbers and show the retention rate for the same periods as above (K-1, 1-5, 5-6, 6-8, 8-9, 9-12). We average these retention rates to come up with a robust set of expected values from actual enrollments. As well, we computed a compounded and simple over 1st – 12th grade retention (attrition) rate of -24%. We were careful to not overlap any of the statistics when computing the compounded values.


Retention Rates for Selected Periods, 2001- 2014
Year K 1 K-1 5 1-5 6 5-6 8 6-8 9 8-9 12 9-12
2001 388 354 354 345 320 284 244
2002 385 407 5% 329 347 -2% 332 289 -10% 256
2003 434 392 2% 310 335 2% 331 -4% 291 -12% 270
2004 393 411 -5% 353 294 -5% 344 -1% 275 -17% 273 -4%
2005 406 405 3% 362 2% 346 -2% 334 0% 287 -17% 252 -13%
2006 381 411 1% 353 -13% 338 -7% 292 -1% 293 -12% 293 1%
2007 446 382 0% 348 -11% 346 -2% 333 -4% 250 -14% 300 9%
2008 411 432 -3% 378 -8% 328 -6% 309 -9% 296 -11% 282 -2%
2009 451 401 -2% 373 -8% 365 -3% 347 0% 285 -8% 297 1%
2010 434 446 -1% 359 -13% 359 -4% 334 2% 312 -10% 266 6%
2011 448 441 2% 360 -6% 344 -4% 360 -1% 297 -11% 295 0%
2012 450 455 2% 415 -4% 349 -3% 346 -4% 300 -17% 290 2%
2013 454 460 2% 386 -4% 374 -10% 326 -5% 313 -10% 326 4%
2014 471 472 4% 423 -5% 352 -9% 317 -9% 280 -14% 289 -3%
Avg 1% -7% -4% -3% -13% 0%


Now let’s compare the actual, observed retention rates with those predicted in the APS projected enrollments analysis.


Comparison of Retention Rates. Actual Vs Projected
Year K-1 1-5 5-6 6-8 8-9 9-12 1-12 Compounded
Projected 4% -1% -7% -2% -9% 3% -16% -12%
Actual 1% -7% -4% -3% -13% 0% -23% -24%
Diff 3% 6% -3% 1% 3% 3% 7% 12%


As one can observe, the APS is projecting a retention rate of -12%, much lower than the actual -24% rate that is observed historically. The difference lies in three areas, the K-1 growth rate which has been observed to be about 1% but is projected to be 4%. In other words, the APS is expecting that 4% more 1st graders will enter into the system than those who attend full day, free kindergarten. In the elementary school years, the actual retention rate is -7% while the APS is predicting a -1% rate. Finally, we showed earlier that data errors in the modelling resulted in a 3% over estimate of retention in the APS projections for the high school enrollments. These three differences account for the entire difference between the historic -24% retention rate and the projected -12% retention rate.

Conclusion

The so called baby bump used, in part, in justifying an expensive rebuild of the Arlington High School is not supported by the APS’ projected enrollment analysis. The entire 10% projected increase in future enrollments is completely driven by assumptions of far higher retention rates (or equivalently far lower attrition rates) than the historic record suggests especially in the earlier grades (K-5). There are many reasons to rebuild the AHS, but the APS’ projected enrollments analysis is not one of them.

May 31 15

Data Science

by Stephen Harrington

So you want to be a data scientist. I’ve compiled some resources for the budding and perhaps the experienced data guy to use. There are online academic courses hosted as MOOCs on edX, Coursera and Udacity, see the list below. As well, some of the skills needed are in section 2. A compilation of some of the classic problems in data science can be found in the next section.

Online Academic Courses

Below is a list of online resources for those interested in data science. Much of the course material including lectures, assignments, exams, solutions, slides, readings, notes and discussions can be found by clicking on the relevant link.


University Platform Name
Berkeley edX Big Data with Apache Spark
Berkeley edX Scalable Machine Learning
Berkeley edX Introduction to Statistics
Berkeley edX Artificial Intelligence
Berkeley Applied Machine Learning
Berkeley Data Science Curriculum
Berkeley iSchool Analyzing Big Data with Twitter
Brown Coursera Exploring Neural Data
CalTech edX Learning From Data
CalTech-JPL Coursera Summer School on Big Data Analytics
Carnegie Mellon Online Machine Learning
Carnegie Mellon Online Statistical Machine Learning
Carnegie Mellon Online Probabilistic Graphical Models
Carnegie Mellon Online Multi-media Databases and Datamining
Carnegie Mellon Online Machine Learning with Large Datasets
Carnegie Mellon Online MS ML Courses Fall 2015
Carnegie Mellon Online Natural Language Processing
Columbia Coursera Big Data in Education
Duke Coursera Data Analysis and Statistical Inference
Eindhoven Coursera Process Mining: Data science in Action
Georgia Inst Tech Coursera Healthcare Informatics & Data Analytics
IIT – Delhi Coursera Web Intelligence and Big Data
Johns Hopkins Coursera Data Sciences
Johns Hopkins Coursera The Data Scientist’s Toolbox
Johns Hopkins Coursera R Programming
Johns Hopkins Coursera Getting and Cleaning Data
Johns Hopkins Coursera Exploratory Data Analysis
Johns Hopkins Coursera Reproducible Research
Johns Hopkins Coursera Statistical Inference
Johns Hopkins Coursera Regression Models
Johns Hopkins Coursera Practical Machine Learning
Johns Hopkins Coursera Developing Data Products
Johns Hopkins Coursera Data Science Capstone
Johns Hopkins Coursera Genomic Data Science
MIT edX The Analytics Edge
MIT OCW Machine Learning
MIT OCW Statistical Learning Theory
MIT OCW Prediction Machine Learning and Statistics
MIT OCW Statistical Learning Theory and Applications
MIT OCW Data Mining
MIT OCW Communicating with Data
MIT IAP OCW How to Process, Analyze and Visualize Data
MIT IAP OCW Statistics and Visualization for Data Analysis and Inference
Princeton Coursera Statistics One
Stanford edX Databases
Stanford Coursera Machine Learning – Ng
Stanford Coursera Algorithms
Stanford Coursera Mining Massive Datasets
Stanford Coursera Natural Language Processing
Stanford edX Statistical Learning
Stanford Online Catalog
UIUC Coursera Cloud Computing
UIUC Coursera Data Mining
Univ of Toronto Coursera Neural Networks for Machine Learning
Univ of Toronto Coursera Statistics: Making Sense of Data
U. Washington Coursera Inroduction to Data Science
U. Washington Coursera Machine Learning
Udacity Nanodegree Data Scientist
Datacamp Cheap R & Data Science
SlideRule Cheap Data Analysis



I completed some or all of the work for those university names in bold. For example, I recently completed MITx’s The Analytics Edge on the edX platform, while Berkeley’s Apache Spark class has just started.

Skills

The data scientist needs skills in statistics, programming, databases, visualization/graphics and computer science. Below is a non-comprehensive list of some of those skills that I have some familiarity with. Of course, as time goes on the list of necessary tools and skills will grow since DS is by its very nature an interdisciplinary field.



DS Venn Diagram

Programming Languages

Python

Python is the most popular scripting language and the many open source modules are proof of a wide spread and growing adoption of python as the language of choice by many in academia and industry. You may meet a perl guy someday and you may wonder how they missed the python boat. Below is an appended version from this github repository which gives a much broader view of useful python libraries for data science:

  1. Fundamental Libraries for Scientific Computing: IPython Notebook, NumPy, pandas, SciPy, pySpark
  2. Math and Statistics: SymPy, Statsmodels
  3. Machine Learning: Scikit-learn, Shogun, PyBrain, PyLearn2, PyMC
  4. Plotting and Visualization: Bokeh, d3py, ggplot, matplotlib, plotly, prettyplotlib, seaborn
  5. Data formatting and storage: csvkit, mrjobs, PyTables, sqlite3,lml, BeautifulSoup, wget, curl

R

The R-programming language is used by many academics and data science practitioners for exploratory data analysis. R has enormous overhead and can be very slow especially for memory intensive use and so is typically not used as a production environment. However, there are hundreds of useful libraries built on one of the oldest statistical platforms that implement machine learning algorithms, data visualization graphics and data manipulation. As such, it is one of the best known languages for data science. A small sample of some of the R-libraries that a DS might find useful are below. Before you start using R, download R-Studio, an IDE of the best kind; free, fast, full of features and fantastic.

  1. Machine Learning and Natural Language Processing: arm, caret, caretEnsemble, caTools, chron, cvAUC, cvTools, doParallel, dynamicTreeCut, e1071, flexclust, gbm, glmnet, kernlab, lpSolveAPI, mice, neuralnet, randomForest, rattle, ROCR, rpart, RWeka, SnowballC, tm
  2. Statistics and Time Series: digest, aod, lsr, methods, multilevel, psych, zoo, quantmod, Quandl, QuantPsyc, sm, stats, UsingR
  3. Data formatting, manipulation and storage: RCurl, RODBC, RMySQL, RPostgresSQL, RSQLite, sqldf, xlsx, lubridate, dplyr, plyr, base64enc, data.table, downloader, jsonlite, manipulate, methods, multilevel, reshape2, XLConnect, foreign, XML
  4. Plotting and Visualization: ggmap, ggplot2, googleVis, gclus, jpeg, lattice, maps, pROC, RColorBrewer, rgl, rpart.plot, shiny, shinyapps, vcd
  5. Matlab

    Industry makes extensive use of matlab; it is not an inexpensive piece of software and add-ons are additional expenses. However, many engineering students were taught matlab and everything that you can do for free in python and R, you can do in matlab for a price. The open source version of matlab is Octave, which is not as widely supported or used as any of the other programming tools used in DS. That said, matlab is a wonderful vectorized programming language that handles much of linear algebra in a natural and uncluttered way. Andrew Ng’s Machine Learning course made extensive use of matlab (or octave) and is probably the best advertisement for this language for the data scientist.

    Julia

    Real programmers use C/C++ and real (old) scientists use Fortran, but this is a post about Data Scientists and good luck getting all of the useful modules, libraries, packages, data visualization tools, example algorithms, sample code and the large community in all three camps to convert everything to C. And for what? A thousand-fold increase in speed.

    Oh yeah, C provides an almost thousand-fold increase in execution speed over R, matlab and an order of magnitude over even the best written python.

    That’s a pretty compelling reason to use C. R, python and matlab all have C interfaces for those guys who need the speed which includes anyone who programs industrial strength code for actual use. Don’t get me wrong, I learned C from Kernighan and Ritchie but I have no interest in shoe-horning C into three other quirky languages.

    Julia to the rescue. Julia is a rather new language that caters to the pythonistas, the matlabbers and the R-cran bloggers with a simple language structure but the speed of C; check out the benchmark pictured below. I’m still in the process of learning Julia, but so far, so good.

    Julia Benchmarks

    SAS/SPSS

    Everything you can do in R, matlab or (almost) python, you can do with SAS or SPSS. I learned SAS many years ago while on a company paid educational junket to Chicago where I also learned that some bars in Chicago stay open until 4am and even to 6am which made SAS a blur to me a few hours later. Enough said. If anyone wants to add SPSS information, please let me know.

    Tools

    Industry and industrious fellows have created many tools for big data manipulation, analysis and deployment. This section will be expanded as time permits.

    o map reduce
    o Hadoop components
    o HDFS
    o Cloudera/HortonWorks
    o MIR programming
    o Sqoop; loading data in HDFS
    o Flume, scribe – unstructured data
    o SQL w/PIG
    o DWH with Hive
    o Scribe, Chukwa for Weblog
    o Mahout
    o Zookeeper Avro
    o Storm: Hadoop realtime
    o Rhadoop, RHIPE
    o rmr
    o Cassandra
    o MongoDB, Neo4j
    o Weka, Knime, RapidMiner
    o Spark
    o Scribe, Chukwa
    o Nutch, Talend, Scraperwiki
    o Webscraper, Flume, Sqoop
    o NLTK
    o RHIPE
    o IBM Languageware
    o IBM ManyEyes
    o Tableau

    ETL Informatica, IBM DataStage, Ab Initio, Talend

    ETL
    » Informatica, IBM DataStage, Ab Initio, Talend
    • Data Warehouse
    » Teradata, Oracle, IBM DB2, Microsoft SQL Server
    • Business Intelligence and Analytics
    » SAP Business Objects, IBM Cognos, Microstrategy, SAS, SPSS, R

    • ETL
    » Apache Flume, Apache Sqoop, Apache Pig, Apache Oozie,
    Apache Crunch
    • Data Warehouse
    » Apache Hadoop/Apache Hive, Apache Spark/Spark SQL
    • Business Intelligence and Analytics
    » Custom dashboards: Oracle Argus, Razorflow

    Classic Examples and Algorithms

    This section needs work….

    testing Vs training, validation, regularization, cross validation

    Regression examples; linear, logistic, decision and classification trees

    perceptron
    Polls and election predictions

    handwriting recognition – zipcodes
    image processing

    text analytics
    google n-grams
    bag of words – corpus
    sentiment via tweets
    natural language processing

    recommendation systems

    clustering

    learning algorithms; supervised, unsupervised, support vector machines

    kernel methods

    neural networks

    optimizations; linear, integer, convex

    From Berkeley CS100.1x
    The difference between descriptive and inferential statistics.
    SUPERVISED LEARNING
    kNN (k Nearest Neighbors)
    Naive Bayes
    Logistic Regression
    Support Vector Machines
    Random Forests
    UNSUPERVISED LEARNING
    Clustering
    Factor Analysis
    Latent Dirichlet Allocation
    US National Institute of Standards and Technology primer on Exploratory Data Analysis.
    The five-number summary
    Descriptive statistics
    Percentiles
    Sample minimum
    Lower quartile
    Median
    Upper quartile
    Sample maximum
    Box plot
    SparkR
    Introduction to Probability and Statistics
    Big Data XSeries
    Spark’s mllib library

    Here is a good description of the difference between descriptive and inferential statistics.

May 15 15

Article 32 – Public Art Consultant Implies Democrat

by Stephen Harrington

This post is substantially copied from here, but I was asked to make a stand-alone post. Did you know that the way Town Meeting members voted on local issues identifies their political party designation? Or more succinctly, voting ‘Yes’ for a public art consultant implies you are a Democrat.

e-Vote Analysis

During the 2015 annual Town Meeting, there were 42 electronic votes taken. After discarding the five poll questions at the start of each of the five sessions these e-votes provide the best record of member participation. The 42 electronic votes were taken on 23 of the 46 warrant articles with about 15 amendments and substitute motions; most notably on the CPA vote (article 11) and the outdoor sign bylaw (article 7). The other half of the warrant articles were decided on voice votes; almost all in the affirmative to the recommended vote. There were 5 e-votes and an equal number of voice votes to terminate debate. There were 4 substitute motions on citizen initiated articles that had been voted ‘No Action’ by the Selectmen/FinCom; two of which I submitted and the other two submitted on behalf of Chris Loreti.

Of the 250 seats, 16 members failed to cast a single e-vote and one member cast only one vote and that was to terminate debate! The pattern of e-voting is not similar to the attendance records with some interesting discrepancies. Another 22 members e-voted less than half the time. Between vacancies, no shows and less than half-timers about 16% of the members do not participate. There were 3 vacant seats with one vacant seat filled on night 4.

Democrats outnumber Unenrolled 2:1 at TM

Cross referencing the Town Meeting member list with the list of registered voters allowed me to find each member’s stated party affiliation. The party affiliations of Town Meeting members is somewhat skewed when compared to the entire Town. The overall town party affiliations are listed in the table below along with the ratio of TMMs. Note there are other party affiliations besides Democrat, Republican or Unenrolled, and two TMMs are none of the above, but these affiliations account in aggregate for less than 1% of all town residents.


Party Residents Residents % Town Meeting %
Democrat 13,629 46% 59%
Republican 1,935 7% 6%
Unenrolled 13,888 47% 33%


Residents are nearly equally split between Democrats and Unenrolled, however Town Meeting members are more likely to be Democrats by an almost 2:1 margin compared to those registered as unenrolled. This leads to some interesting results.

Putting the CART Before the Horse

I found 25 e-votes that were key votes out of the 46 recorded e-votes, discarding the motions to terminate debate, unanimous votes on housecleaning articles, etc. I winnowed down the member list to 202 names, removing anyone who had not voted in at least 15 of the 25 votes. I identified the party affiliation for each Town Meeting member. For the analysis below only Democrat ‘D’ or not Democrat ‘U’ is analyzed.

Next I modeled the data using the R-programming language; a logistic regression model, a Classification and Regression Tree (CART) analysis and a more sophisticated Random Forest algorithm to identify patterns. The Random Forest model had the most predictive ability, but I chose to highlight the simple CART model for some useful insights. The CART model had a 76% accuracy in identifying party affiliation compared to a baseline accuracy (everyone a Democrat) of 59%. Below is a copy of the decision tree that was produced.

CART decision tree e-votes 2015 ATM

These trees are easy to read once you know what to look for. Starting at the top of the tree, the first ‘node’ is labeled ‘A32m’ and the left branch of the tree is followed if a TMM voted ‘Yes’, abstained or was missing from the vote for Article 32, main motion, Public Art Consultant. Traveling down the decision tree each node represents a vote on a specific article. The left branch is taken for a ‘Yes’ to the displayed voting criteria which is not always a ‘Yes’ vote as noted in the A33m node at the bottom of the left branch. The round bubbles are the predictions with ‘D’ being a prediction for a Democrat and ‘U’ for not a Democrat.

First, note that out of our 25 key votes, CART identified just 6 votes that mattered for identifying whether a TMM was a Democrat or unenrolled. These key articles are on the left branch:

  1. A32m – $12,000 for Public Art Consultant
  2. A46s1 – Rehrig amendment of Mugar open space to Master Plan
  3. A11a1 – Rehrig amendment to CPA committee membership
  4. A33m – Human Rights Commission Executive Director



What the decision tree illustrates is that the most predictive vote of being a Democrat out of the 25 key votes is whether you voted ‘Yes’ to the $12,000 appropriation for a public art consultant. If you voted ‘Yes’ to a public art consultant and ‘Yes’ to Rehrig’s amendment adding the Mugar open space resolution to the Master Plan, then you are a Democrat. This identifies half of the 202 voters as Democrats. Now I stated earlier that this particular CART model had a 76% accuracy in the training/testing set, so there are some false positives. Perusing the list, it would appear that some registered Republicans are actually closeted Democrats or at best RINOs.

If you voted ‘Yes’ to a public art consultant, ‘No’ to Rehrig’s Mugar amendment and ‘Yes’ to Rehrig’s amendment adding 3 Selectmen selected members to the CPA committee, then you are a Democrat. Finally, if you voted ‘Yes’ to a public art consultant, ‘No’ to both of Rehrig’s amendments and ‘No’ to an Executive Director for the Human Rights Commission, then you are a Democrat.

Not a Democrat

Below are the three nodes that comprise the right hand branch of the CART decision tree.

  1. A32m – $12,000 for Public Art Consultant
  2. A22s1 – $1M decrease to School Budget
  3. A7a2 – Bayer amendment for Public Signage



If you voted ‘No’ for a public art consultant and ‘Yes’ to cutting $1M out of the school budget, then guess what, you are not a Democrat! In fact, you are in the lonely ranks in Town Meeting of the unenrolled or similar to the average Arlington resident. Finally, I am not sure how to interpret the final branch (‘No’ to public art, ‘Yes’ to Bayer’s amendment to signage = Democrat), but at this point the decision tree is predicting a handful (<5) of Town Meeting members and carries less weight than previous predictions.

Conclusion

Overall, the CART analysis does a great job of identifying how Town Meeting members vote, how TM votes can predict your party affiliation, how a small subset of articles can determine the voting blocks in TM, how outnumbered the unenrolled are at TM and how little TM represents Arlington’s voters while over representing the Democratic party. I found the CART decision tree nodes intuitive; a Democrat would not vote to take money out of a school budget and very few non-Democrats would vote to fund a public art consultant.

This analysis is certainly not comprehensive and many liberties were taken in preparing the data so please don’t take umbrage at being misidentified. Anyone interested can obtain the full data set by asking nicely and proving their ability to perform a statistical analysis using CART.

Mar 16 15

How Arlington’s Property Tax Assessment Were Determined for FY2015

by Stephen Harrington

Today, I want to share with you some observations and some questions about how FY2015 property tax bills were determined in Arlington. A video presentation of part of this blog post can be found here or watched below.





Conclusions

Let us start with the overall conclusions. Changes in the 2015 property tax bills in Arlington were driven by four factors accounting for 95% of the $3.6M increase in property taxes collected.

1. Owners of commercial properties and apartment buildings saw an overall decrease of about 1.7% in their tax bills, driven by no change in their 2015 property assessments combined with a 1.7% decrease in Arlington’s tax rate.

2. Condominium owners saw an across the board increase of about 5% in the value of their buildings accounting for 27% of the total tax increase for 2015.

3. 65% of the increase in property taxes was achieved by arbitrary changes in land valuations, based on precinct boundaries, with east Arlington seeing a 22% increase in their land values, 9x higher than land values in and around Jason Heights. This land value increase is not supported by recorded sales.

4. Sales of commercial properties and some large apartment complexes were not considered in the assessment process resulting in more than $30M in decreased tax assessments; most notably at the Mill Street Apartment complex.

5. Arlington’s changes in assessments appears to disproportionately affect residents in east Arlington and does not compare favorably with some surrounding towns.

Commercial, Apartment Building, Condos and Land Values

All of the snapshots and discussion below can be found in the video here or by using this interactive GIS map that lets you seamlessly toggle between commercial properties, condos and residential land values.

Commercial Properties – 1.7% Decrease in Tax Bills

Commercial Properties - Outlined in Red.  White Fill - No Change in AssessmentFirst, let us look at Arlington’s commercial properties. Whenever I use the term ‘commercial’, I’ll also be including the much smaller industrial properties. Commercial properties make up about 4- 5% of Arlington’s total tax base, pretty steady over the past decade and down from about 8% 20 years ago. I’ve outlined the commercial properties in Arlington in bright red, while apartment buildings are outlined in green. A few things pop out. First, notice that commercial properties follow the Mass Ave corridor west (left) to east (right), forking in Arlington center along both Mass Ave and Broadway.

Commercial Properties - Outlined in Red.  White Fill - No Change in AssessmentLet’s zoom in a bit around Arlington center. I’ve filled in the commercial properties with three different colors. White properties had no change in their assessments. Properties filled in with a reddish/pink color saw an increase in their assessments, while parcels colored light blue saw decreases in their assessed values.

Note the two large parcels filled in with a reddish color on the top and left sides of the map. At the left side of Arlington, near Arlmont, is a small portion of the Belmont Country Club, while on the top, or North side of Arlington are the 45 acres or so of the Winchester Country Club that is in Arlington.

Now, for our first observation. All but a handful of commercial properties, filled in with white, saw their assessed values stay the same for 2015. In fact, out of the 400 or so commercial properties in Arlington, 12 parcels saw an increase in their assessed values including the two golf clubs (4 parcels), a spit of land on the Mystic Lakes owned by the Medford boat club, the newest Housing Corp purchase by Downing Square and half a dozen, random properties in and around Arlington center.

Almost every commercial property in Arlington saw no change in their assessment. Combined with a 1.7% decrease in the tax rate, results in the observation that commercial property owners saw a decrease in their overall tax bills. This reduced the amount of property tax collected by $70,000.

This brings up some questions. Why did commercial properties see no increase in their market values? Why did the town decrease their tax bills?

Commercial Properties - 659-671 Mass Ave - 11% Decrease in AssessmentTwo commercial properties saw a decrease in their assessed values this year. 659-671 Mass Ave is a handsome commercial block in Arlington center, across from the Robbins Library, owned by Charles Blumsack and home to Domino’s Pizza, Thai Moon restaurant and Involution Studios – designer of the Town’s budget visualization. The assessment on this commercial property went down almost 11%.

Commercial Properties - 22 Sunnyside Ave - 22% Decrease in AssessmentThe second property that saw a decrease was way down in east Arlington, along Sunnyside Avenue. The property is owned by Harry Allen and occupied by the Arlmont Fuel Company which saw a 22% decrease in property assessment.

This brings up some other questions. Why did these properties see a decrease in their assessments? What was the process by which these property owners got their tax bills decreased? Did these property owners go through the regular abatement process?

Apartment Buildings – 1.7% Decrease in Tax Bills

Apartment Buildings - No Change in AssessmentNow let’s look at the apartment buildings in Arlington. I’ve outlined in green all 8+ unit apartment buildings in Arlington and filled in the parcels as before. White means no change in the assessed values, pink an increase and light blue a decrease. Let’s zoom in and take a look. Notice that all but two apartment buildings in Arlington saw no increase in their 2015 assessed values. The two exceptions are a 33% increase in the Arlington 360 complex built on the Symmes Hospital site and a 2.8% increase at the Mill St. apartments on the former Brigham’s site.

Apartment Buildings - No Change in AssessmentThis brings up the second observation. 71 of the 73 large apartment buildings in Arlington saw no increase in their assessed values for 2015. Combined with a 1.7% decrease in the tax rate means that apartment building owners saw a decrease in their tax bills; decreasing the amount of property tax collected by Arlington by about $75,000.

This raises several some questions. How did Arlington’s assessors decide that there was no increase in the market values of apartment buildings in Arlington? Are apartment buildings determined by the income method of assessment? Were rents in Arlington steady throughout the year?

Recap – Decreases in Tax Bills

Commercial Properties - Outlined in Red.  White Fill - No Change in AssessmentTo recap, this view shows all commercial properties – outlined in red – and apartment buildings -outlined in green- and their change in assessment, which is almost exclusively no change, accounting for a combined decrease of about $150,000 or 4% of the total increase in taxes for Arlington in 2015. We will gray out these properties for the remainder of this post.

Condominiums

CondominiumsIn this view, we look at all condominiums in Arlington. The condo properties are outlined in blue. One important thing to consider is that condos do not have a separate land assessment. This will become important in the next segment. As you can see, most condos can be found in east Arlington and south (towards the bottom of the map) of Mass Ave.

The property parcels are filled with pink if the change in assessment was between 4% and 5.2%; white if the change in assessment was less than 4% and light blue if the change was greater than 5.2%.

Condominiums - Assessments Increased 4.5% - 5% regardless of size, style or any other characteristicZooming into east Arlington shows that the great majority of condos saw assessment changes between 4.5% and 5% regardless of their building characteristics. A couple of interesting outliers. One was on Hamilton Road where the end units saw average increases of 12.3%, same as 993 Mass Ave. Another interesting change was at Colonial Village in Arlington Heights that saw a 24% increase in its assessment across the board.

Why were most condos increased at the same rate, regardless of location or unit sizes? Why did Colonial Village see a large increase in their building values relative to all other condo complexes in Arlington?

The more than 3300 condo units in Arlington represent about 22% of all properties. The taxes collected for condo properties increased by about $1M overall. Since all taxes collected increased $3.55M and commercial/apartments got a $150K tax decrease, condos accounted for $1M/$3.7M or about 27% of the overall increase in taxes collected, about 5% more than expected.

Residential Land Values Increase 22%

Now for the lions share of the increase in Arlington’s property taxes collected in 2015. Lets switch gears a little bit and look at residential properties.

Residential Land Values Increased 2% - 22% Along Precinct BoundariesFirst, we look at changes in land values. We grayed out all of the commercial properties and apartment buildings that saw zero increase in their assessments. We also grayed out exempt properties which includes town owned property, schools, churches and other entities that don’t pay property taxes. We broke the land assessment changes into five groups.

Properties colored in blue saw assessment changes of land values less than 4%, pink 4-8%, green 8-15% and yellow 15-30%. Properties colored in white was no change, teal a decrease and brown a greater than 30% increase. This color scheme will show all condominium properties colored in white since condos do not have a separate land assessment.

We used a range, but typical values were 8% for green 5.4% for pink, 2.4% and 3.5% for blue and a whooping 21.6% increased in land assessments for properties colored yellow.

Residential Land Values Increased 2% - 22% Along Precinct BoundariesWhat we see right away is that changes in land assessments are clustered; east Arlington saw a 21.6% change in its land values. Residents of Jason Heights saw a 2.4% change. The Morningside area saw an 8% change and those in between saw a 5.4% change. Adding precinct boundaries allows for a stunning observation. Land value changes conform to precinct boundaries.

The change in land assessments was based on political boundaries.

My only question here is; why? I understand that the Mass Department of Revenue can certify neighborhoods during a triennial reassessment. Were precinct boundaries certified as neighborhoods by the state?

Why do the Board of Assessors believe that land in east Arlington increased in value by nine times as much as land values in Jason Heights? Why were four distinct values used in setting land values? What sales data supports these changes?

One note. If you use the interactive map found here, when you zoom in to the fullest extent the label will change to show the house # and a large value, in the $Ms of dollars that is the price per acre for that property (land assessment / lot size) which allows for easy comparisons of different size lots.

Interactive GIS maps

Use the map below to explore Arlington’s 2015 Property Assessments. Or go here for a full screen version.






Sales Do Not Support Changes in Assessments

When asked about the changes in land values, one Assessor told me that sales data supported the changes in land values. We could find no evidence to support this claim. Before we present our evidence to the contrary, a little back ground is necessary.

The FY2015 tax bills are generated from property assessments as of January 1, 2014 based (theoretically) on sales from the calendar year 2013. These sales are submitted to the Mass Department of Revenue, Division of Local Services. Sales are coded for being at an “Arms-length” and included in determining assessments while sales coded for being “non-arms-length” (NAL), such as between a parent and child, are not included in determining assessments since non-arms-length sales are unlikely to be at full and fair market value.

For FY2015, Arlington had 1,039 sales of which 491 were at arms-length. There were 243 one-family and two family sales, 241 condo sales, four commercial property sales and three miscellaneous sales of mixed used properties. The coding of non-arms-length sales, accounting for more than half of all sales in Arlington, should be the subject of an entire post, see two notable examples in the section below.

The 243 one-family and two family sales (SF) in FY2015 can be found on this Google map with a snapshot below.

View Arlington FY2015 Assessments and Sales – SF only in a full screen map


Sales Price Increases Over 2014 Assessment

Sales of One and Two Family Homes in 2013 - red markers higher than average, blue lower than averageThe default pin category (“>Avg Sales Inc?”) shows whether the sales price increase over the property’s prior (FY2014) assessment is greater than the average percentage change (16%) of all sales in the SF category – red pins are higher than average and blue are lower than average. Other pin options can be found in the drop down menu. Click on any marker at bottom of map to show subcategories. Click on any individual marker to see the sales detail information.


Higher Than Average Sales Evenly Distributed Throughout Arlington

Sales of One and Two Family Homes in 2013 with selling price 30% or more above the FY2014 assessmentAs the snapshot above shows (click to expand to full view), the volume of sales of one family and two family homes was not unusually larger in east Arlington compared to the rest of the town. In addition, the sales that were for much greater (>30%) than the average increase relative to the FY2014 assessment are evenly distributed throughout Arlington with east Arlington seeing fewer than expected as a share of all housing stock. Further, we reach the same conclusion, that sales in east Arlington were no different than sales throughout Arlington, by looking at all sales above the average as seen in this snapshot.Sales of One and Two Family Homes in 2013 with selling price 30% or more above the FY2014 assessment



East Arlington Assessments Higher Than Average

Not to lose track of what we are arguing here, the picture on the right (click to expand), shows all 2013 sales of 1&2 family homes in Arlington. Blue markers are properties that saw assessment increases higher than the average assessment increase of other properties that were sold, while red markers are assessment changes lower than the average. Note the stunning fact, all sales in east Arlington saw assessment changes above the average. This should not be surprising since the analysis above on land value changes also clearly demonstrates this fact.Sales of One and Two Family Homes in 2013 with selling price 30% or more above the FY2014 assessment



Condo Sales

To be complete, we included all condominium sales in Arlington in this google map. Since condos do not have a separate land valuation the sales data does not have much to tell us.

Observations on Sales

Some observations:

1. Many more SF sales outside of East Arlington

2. Toggle to “>Avg Assess Inc?” on dropdown – clearly shows what we know, East Arlington had a higher than average assessment change compared to all sales.

3. Toggle back to “>Avg Sales Inc?” on dropdown and then click the category marker “>30%” at the bottom of map which shows that the highest sales over assessments are scattered throughout Arlington.

4. Condos sales are evenly distributed in “>Avg Sales Inc?” throughout Arlington where condos had sales.

5. Condos in east Arlington has fewer (only 2) red pins in “>Avg Assess Inc?”

Condo sales are included for completeness although condos have no separate land value. I don’t see how condo sales can be used to justify increased land values in East Arlington.

Notable Non-Arms Length Sales

On 12/30/2013, immediately before the FY2015 assessment date for full and fair valuations of real property, 30 Mill Street, the site of the former Brighams’ ice cream plant, sold for more than $50M. The current assessment is for less than $30M, a $20M difference, or more than $250,000 of taxes shifted from a large corporation onto homeowners in east Arlington. The sale is a non-arms-length transaction coded as “B” which, according to the DOR Classification Handbook is:

An intra-corporation sale, e.g. between a corporation and its stockholder, subsidiary, affiliate or another corporation whose stock is in the same ownership

The buyer of the property was US REIF BRIGHAM SQUARE while the seller was SP5 WOOD ALTA MILL STREET LLC. Click on either party to view, among other things, their boards and key employees. These appear to be two completely separate entities.

The other bit of sales legerdemain involves the 250 condos sold by the Wilfert Trust in the Brentwood (60 Pleasant St.) and Old Colony condo complexes.

There are several stories about how the treatment of sales affects the valuation process and the role town officials play in the representation of corporate entities.

Example of Valuation Model Over Fit

Colonial Village Assessment Changes; Note the 24% Increase in 2015 and the 25% Decrease in 2013Another revealing assessment change is the huge year on year changes in the Colonial Village condo complexes. The assessments on these properties saw a 24% increase this year and a 25% decrease a few years ago, while sales of condos in these buildings were almost double the assessed values. These large changes seen in the Colonial Village condos are symptomatic of a valuation process that over fits the data, especially on properties, like condos that don’t include the leading factor, lot size, in the computerized process. This observation is deserving of its own blog post and involves math that most people would find dull.

Red Flag – Distributions

Looking at the distribution in the percentage change in assessments from 2014 – 2015 for different communities can be enlightening.

In the table below are four such distributions showing the change in residential properties. Click to expand the image. The four images represents the towns of Wakefield, Westford, Lexington and Arlington. The horizontal (x-axis) shows the percent change in the assessed value, while the vertical (y-axis) shows the number of properties (parcels) with that percent change. The vertical line in the center of the chart shows the average residential change. The dark blue columns are condominiums.

Some things to note.

Wakefield - Percent Change in Residential Assessments 2014 - 2015The chart for Wakefield is very easy to understand. The average assessment change was 2.63% with most properties seeing between a 2% and a 4% change in their assessments with a comparable number seeing no change in their assessments. Wakefield’s assessment changes are incremental, uniform and tightly clustered about the average change.


Westford - Percent Change in Residential Assessments 2014 - 2015Next look at the chart for Westford. Westford is undergoing rapid development with new construction ongoing that far exceeds Arlington. With a 4.73% increase in assessed values for residential properties, the distribution is skewed somewhat to the left with a long tail representing growth in new homes and improvements. Condominiums, showed as blue columns are somewhat uneven with a larger fraction of condos seeing a smaller than average increase in their assessments.


Lexington - Percent Change in Residential Assessments 2014 - 2015Lexington performed its triennial reassessment in 2015 with all properties revalued. The average increase in residential assessments was 10.33%, which combined with a 4.6% decrease in the tax rate meant the average residential tax bill increased about 5.7%. One thing to note about Lexington is the symmetric distribution about the average – a “normal” distribution. This is the sign of a robust revaluation process that treats all properties, including condos, equally.


Arlington - Percent Change in Residential Assessments 2014 - 2015Finally, we show Arlington’s distribution of percent changes in residential assessments. The average is about 5.7%, which combined with a 1.7% decrease in the tax rate, and zero change in commercial and apartment properties, resulted in tax bill increases of about 3.8%. The first thing to note is the disparity between condos and single/multi-family dwellings. Most condos saw a below average assessment change (4.8% Vs 5.7%), in agreement with our observations above. But the real difference is the bump, or extra peak in the distribution around the 10% change. These are the east Arlington residential properties that saw such an outsized increase in their land values for 2015.


Apr 6 14

Diversity

by Stephen Harrington

When I was a graduate student in Physics at Boston University during the 1990s, I had the opportunity to work in a research group with about 20 other students. For a time, I was the only American in the group. Many of the students were from Eastern Europe; Bulgaria, Slovakia, Hungary and Russia. As well, most of the students in this graduate program at Boston University were from India, China and South America. At that time, Boston University had the second largest foreign student population of any US university and the graduate program in physics is truly an international group at most US universities.

On one month-long trip to China, my travel mates were from Russia, Iran, Portugal, Argentina and South Korea. We were like a little United Nations with the attendant problems crossing borders; the Russian was hassled by the US embassy trying to obtain his re-entry Visa, the Argentine was not allowed to enter Macau, not by the Portuguese, but for re-entry to Hong Kong by the British – a hangover of the Falklands, the Iranian was welcomed by the Chinese, while my passport was met with scrutiny by the cigarette smoking customs guard with the machine gun who welcomed me to a free country on our entry to Xiamen.

To this day, I count people from Iceland, Venezuela, Israel, Japan and from all over the world as my friends.

Generally, during the good weather, the graduate students would climb out of our cold, subterranean laboratories to share lunch together on the plaza in front of the Science Building at 590 Commonwealth Avenue. Even at that time, I couldn’t help but recognize, and somewhat cringe, at the sight 20 or 30 young men (and some women) would make, dressed alike in jeans and flannel shirts, even on warm spring days, while the fashionably dressed young undergraduates would walk by.

Although we represented the entire gamut of ethnic, religious and national identities, what differentiated us most, in our minds, was whether we studied theoretical or experimental physics. What I realized two decades ago, whether I was sharing tea on a beach in China at 2:00am, scaling a 20 foot high wall after the city gates had closed, packed with seven other people into a Trabant driven by a crazy Russian or making a “pilgrimage” with an Indian friend to the local Walden Pond is that I had a stronger bond with these other physicists than I did with the people I had grown up with in lily white, mostly Irish Catholic Arlington. Our bond was not how different we were in appearance, background, economic experience or beliefs, but in our mutual pursuit of science.

That is the lesson I learned. Diversity is not what separates us, diversity is what brings us together.

Mar 13 14

Baby Bump

by Menotomy Observer

Today we analyze the enrollment projections presented by the Arlington Public Schools as one rationale for spending upwards of $100,000,000 to rebuild the public high school. We find several mistakes in the data, inconsistencies in the model of projected enrollments and perform some rudimentary statistical analysis that shows that no conclusions can be made about future increased enrollments using the administration’s data analysis. Specifically, we find that the entire 10% projected 5 year increase in enrollments is due to assumptions of drastically lowered historic attrition rates (-12% projected instead of -24% historic). We detail three mistakes in the projected enrollment data including no attrition between 8th and 9th grade (twice) and unrealistic growth in high school enrollments. We observe a bias when using Power School enrollment totals compared to the official DESE enrollment statistics, with the Power School data consistently higher than the official enrollment statistics.

For those interested in consolidated data tables open this image for the APS projections and this image for the DESE official statistics.

Enrollment Projections

The Arlington Public Schools are showing projections of increased students in the Middle (a +21% increase in 5 years) and High schools (+32% increase in 10 years) as one reason that Arlington should rebuild the aging high school sooner rather than later. The data and projection sheet can be found here and reproduced below:


8 Year Enrollment History and Projected Enrollment 2014 to 2028 -by Grade Levels



This is a dense and difficult table to read, so we converted the pdf into an excel spreadsheet and added some color, see below. The yellow cells represents enrollments that are projected. The green and blue cells represent one class, the graduating year, followed through time. The green cells start at K (kindergarten) in the 2006-2007 school year, progress to 1st grade in 2007-2008 and are projected to graduate in 2018-2019. The green cells represent the class of 2019. The blue cells start at the 5th grade in 2006-2007 and represent the class of 2014; the current graduating class, also shown in bold. These ‘cohorts’ will become important later in our analysis. For the moment, ignore the numbers colored in red.


8 Year Enrollment History and Projected Enrollment 2014 to 2028 -by Grade Levels
Grade/Year Births preK K 1 2 3 4 5 6 7 8 9 10 11 12
2006-2007 545 84 442 391 386 394 385 357 356 339 347 302 309 301 323
2007-2008 537 79 409 439 399 384 381 382 337 354 317 316 271 299 292
2008-2009 496 82 456 405 439 387 376 374 369 344 354 296 308 266 300
2009-2010 558 64 457 451 411 423 387 366 365 373 343 320 295 323 272
2010-2011 545 60 450 442 435 399 427 367 349 350 365 306 325 296 311
2011-2012 537 47 434 455 421 426 390 412 355 335 348 308 304 342 299
2012-2013 496 57 453 472 446 420 429 395 379 337 337 322 313 309 354
2013-2014 558 60 477 478 483 464 434 429 357 393 328 299 320 321 314
2014-2015 517 60 442 496 473 484 469 429 400 352 388 292 300 329 325
2015-2016 563 60 481 459 490 474 489 463 400 394 348 346 293 308 333
2016-2017 545 60 466 500 454 491 479 483 431 394 390 310 347 301 312
2017-2018 597 60 510 484 495 455 496 473 450 425 390 347 311 356 305
2018-2019 525 60 449 530 479 496 460 490 441 444 420 347 348 319 361
2019-2020       466 524 480 501 454 457 435 439 374 348 358 323
2020-2021         461 526 485 494 423 450 430 391 375 385 352
2021-2022           462 531 479 461 417 445 430 392 386 390
2022-2023             467 524 446 454 413 397 431 402 391
2023-2024               461 489 440 449 413 398 442 408
2024-2025                 430 482 435 400 414 424 403
2025-2026                   424 476 387 401 411 419
2026-2027                     419 424 388 398 407
2027-2028                       373 425 436 394


We believe we have faithfully reproduced the projection “data” provided by the public school’s administration. To cross check the dataset, we computed totals, compared them to the source document and found exact agreement; see the table below.


Summary of Enrollments 2007 – 2028
Grade/Class B-K K-5 Tot 6-8 Tot 9-12 Tot Total Chg
2006-2007 -19% 2,355 1,042 1,235 4,716
2007-2008 -24% 2,394 1,008 1,178 4,659 (57)
2008-2009 -8% 2,437 1,067 1,170 4,756 97
2009-2010 -18% 2,495 1,081 1,210 4,850 94
2010-2011 -17% 2,520 1,064 1,238 4,882 32
2011-2012 -19% 2,538 1,038 1,253 4,876 (6)
2012-2013 -9% 2,615 1,053 1,298 5,023 147
2013-2014 -15% 2,765 1,078 1,254 5,157 134
2014-2015 -15% 2,793 1,140 1,246 5,238 81
2015-2016 -15% 2,856 1,142 1,280 5,338 100
2016-2017 -14% 2,873 1,215 1,270 5,418 80
2017-2018 -15% 2,913 1,265 1,319 5,557 139
2018-2019 -14% 2,904 1,305 1,375 5,643 86
2019-2020 1,331 1,403
2020-2021 1,303 1,503
2021-2022 1,323 1,598
2022-2023 1,313 1,621
2023-2024 1,378 1,661
2024-2025 1,347 1,641
2025-2026 1,618
2026-2027 1,617
2027-2028 1,628


The summary totals tell the administration’s whole story. Elementary school enrollment is up which will result in 21% more middle school students (1,078 in 2014 to 1,305 in 2019) in the next five years; see the red highlighted rows in the “6-8 Tot” column. This baby bump will then move into the high school with 10 year projected increase of 32% (1,254 in 2014 to 1,661 in 2024). Looking closely one can see that the APS is projecting that middle school enrollment which has held steady for the past 8 years is about to see a 21% increase in the next five years.

Overall, the APS is projecting a 10% increase in total enrollments in the next 5 years (5,157 in 2014 to 5,643 in 2019); an accelerating growth rate (60% higher) compared to the 9% increase over the past 8 years. The graph below encapsulates all of the summary data showing the projected increases.


APS Enrollment History and Projected Enrollment



The question becomes do you believe that the last three years of enrollment increases will continue for the next five years, steadily increasing by the rate seen over the past three years? Alternatively, is the projected enrollment model predictive?

Analysis

First, let’s look at the birth numbers provided by the APS, found in the first table above which records the number of children born 5 years previously and about to enter kindergarten. The first question to ask ourselves is whether the number of children entering the school system is statistically different than those who entered over the past 8 years. This is a simple question and can be answered by comparing the averages – 534 over the past 8 years and 549 over the next 5 years – and the spread (standard deviation) – 25 historically, 32 next 5 years. Intuition tells us that a difference of 15 in the means is hidden within the spread of 25-32 in the standard deviation. Performing a Student’s T-test (0.35) suggests that there is no statistical difference in the number of births used in the projected enrollments when compared to the last eight years that might result in a baby bump. The number of births over the past five years does not indicate any increase in future enrollments.

Next, let’s look at the data provided by the APS in a slightly altered format. Instead of looking at fiscal years, let’s follow the class years as they progress from kindergarten through high school. In the table below is the exact same data as above but ’tilted’ to allow us to easily follow though class years.


APS Enrollment History and Projected Enrollment – by Graduation Year
Grade/Class K 1 2 3 4 5 6 7 8 9 10 11 12
2006-2007 323
2007-2008 301 292
2008-2009 309 299 300
2009-2010 302 271 266 272
2010-2011 347 316 308 323 311
2011-2012 339 317 296 295 296 299
2012-2013 356 354 354 320 325 342 354
2013-2014 357 337 344 343 306 304 309 314
2014-2015 385 382 369 373 365 308 313 321 325
2015-2016 394 381 374 365 350 348 322 320 329 333
2016-2017 386 384 376 366 349 335 337 299 300 308 312
2017-2018 391 399 387 387 367 355 337 328 292 293 301 305
2018-2019 442 439 439 423 427 412 379 393 388 346 347 356 361
2019-2020 409 405 411 399 390 395 357 352 348 310 311 319 323
2020-2021 456 451 435 426 429 429 400 394 390 347 348 358 352
2021-2022 457 442 421 420 434 429 400 394 390 347 348 385 390
2022-2023 450 455 446 464 469 463 431 425 420 374 375 386 391
2023-2024 434 472 483 484 489 483 450 444 439 391 392 402 408
2024-2025 453 478 473 474 479 473 441 435 430 430 431 442 403
2025-2026 477 496 490 491 496 490 457 450 445 397 398 424 419
2026-2027 442 459 454 455 460 454 423 417 413 413 414 411 407
2027-2028 481 500 495 496 501 494 461 454 449 400 401 398 394
2028-2029 466 484 479 480 485 479 446 440 435 387 388 436
2029-2030 510 530 524 526 531 524 489 482 476 424 425
2030-2031 449 466 461 462 467 461 430 424 419 373


As before, the yellow cells are projected, the green and blue cells shows respectively the 2019 and 2014 graduating class. Viewing the data by graduation year allows for some simple retention/attrition calculations. Below is a chart that shows the retention (number of students in/number of students out) over a few relevant periods; from Kindergarten to 1st grade, from 1st – 5th grade (elementary school), from 5-6 grade (first dropoff), 6-8th grade (middle school), 8-9 grade (second drop off), 9th – 12th grade (high school), 1-12 grade and a compounded retention rate. The green coded row and difference are detailed later in the post and are shown here for reference.


Retention Rates – Projected
Year K-1 1-5 5-6 6-8 8-9 9-12 1-12 Compounded
2014-2015 6%
2015-2016 -7% 3%
2016-2017 -11% 4%
2017-2018 -8% -11% 4%
2018-2019 -1% -6% -8% 2% -11% 4% -18%
2019-2020 -1% -2% -10% -3% -11% 4% -20%
2020-2021 -1% -5% -7% -3% -11% 1% -22%
2021-2022 -3% -3% -7% -3% -11% 12% -12%
2022-2023 1% 2% -7% -3% -11% 5% -14%
2023-2024 9% 2% -7% -2% -11% 4% -14%
2024-2025 6% -1% -7% -2% 0% -6% -16%
2025-2026 4% -1% -7% -3% -11% 6% -16%
2026-2027 4% -1% -7% -2% 0% -1% -11%
2027-2028 4% -1% -7% -3% -11% -2% -21%
2028-2029 4% -1% -7% -2% -11%
2029-2030 4% -1% -7% -3% -11%
2030-2031 4% -1% -7% -3% -11%
Projected 4% -1% -7% -2% -9% 3% -16% -12%
Actual 1% -7% -4% -3% -13% 0% -23% -24%
Diff 3% 6% -3% 1% 3% 3% 7% 12%


First let’s note some errors in the projected enrollments. As can be seen, for the class of 2025 and 2027, the attrition rate between 8th and 9th grade is 0% instead of the -11% projected for every other year. As well, the 2021/2022 graduation years start off with exactly the same number of 9th graders, but then the 2022 year shows a net projected increase of 12% instead of the 1% increase for the 2022 graduating class. Since all of these are ‘yellow’ cells, the enrollment numbers are projected and presumably computed using the same model. These three data errors account for a 3% higher retention than in a consistent projection model.

In addition, there are two bigger errors earlier on in the projection model that over estimate the retention rate by 9%. Before we show this, we need more historical data and need to do a little more work.

Power School Vs DESE Enrollment Statistics

In the source document, the historic data from fiscal years 2006 – 2014 comes from Power School and not a public data source. To ascertain the predictive power of the APS’ enrollment projections, we need to see a longer historical record. As well, since the Power School data source is not available to the public there is no way to know exactly what these enrollment numbers mean. To rectify this, we turned to the Massachusetts Department of Elementary and Secondary Education (DESE) and their databank. DESE collects and maintains enrollment statistics for all public school districts in Massachusetts back to the 1990s. Arlington’s official enrollment numbers for the years 1997 – 2014 are reproduced in the table below.


DESE Official Enrollment by Grade 1997 – 2014
FY PK K 1 2 3 4 5 6 7 8 9 10 11 12 Total
1997 360 402 381 373 357 319 335 306 276 256 275 243 202 4,116
1998 344 385 374 372 370 344 326 311 295 268 255 274 228 4,197
1999 376 351 373 372 366 364 341 326 299 259 260 253 265 4,222
2000 343 371 311 349 364 354 341 330 317 299 269 263 254 4,178
2001 0 388 354 378 322 340 354 345 345 320 284 280 261 244 4,215
2002 7 385 407 344 372 312 329 347 324 332 289 291 270 256 4,265
2003 97 434 392 394 366 357 310 335 346 331 291 275 283 270 4,481
2004 87 393 411 379 374 354 353 294 338 344 275 287 263 273 4,425
2005 78 406 405 406 374 364 362 346 292 334 287 280 300 252 4,486
2006 74 381 411 399 399 368 353 338 342 292 293 285 294 293 4,522
2007 77 446 382 379 388 379 348 346 329 333 250 304 287 300 4,548
2008 62 411 432 388 376 375 378 328 347 309 296 255 293 282 4,532
2009 67 451 401 433 377 372 373 365 336 347 285 293 257 297 4,654
2010 59 434 446 403 418 372 359 359 367 334 312 280 304 266 4,713
2011 57 448 441 433 395 427 360 344 347 360 297 318 286 295 4,808
2012 48 450 455 427 429 390 415 349 331 346 300 297 331 290 4,858
2013 54 454 460 446 418 424 386 374 328 326 313 298 296 326 4,903
2014 55 471 472 474 458 428 423 352 385 317 280 313 303 289 5,020


First let us note that as in all government data sets there are errors. From 1997 – 2000 the detailed grade enrollments do not equal the reported totals and so we highlight these in red. As well, Arlington’s pre-K program started in 2003 and so total enrollment statistics before 2003 are not comparable to later years and inappropriate for year on year changes. Most disturbing is the comparison between the DESE total district-wide enrollments and the Power School totals used in the projected enrollment analysis. In the table below, we compare the DESE enrollment numbers and the Power School totals and see a persistent bias, the Power School numbers are consistently higher, 137 more students in the current 2014 fiscal year, than the official DESE statistics. We believe the DESE enrollment numbers to be the ‘truth’ and expect an explanation from the APS administration to explain the difference between the Power School enrollment data and the DESE official enrollments.


Comparison of DESE and Power School Total Enrollment Statistics 2007 – 2014
FY DESE PowerSchool DIFF

2007 4,548 4,716 (168)

2008 4,532 4,659 (127)

2009 4,654 4,756 (102)

2010 4,713 4,850 (137)

2011 4,808 4,882 (74)

2012 4,858 4,876 (18)

2013 4,903 5,023 (120)

2014 5,020 5,157 (137)


Historic Retention/Attrition Rates

Now that we have a deeper history of actual enrollments, let’s look at historic retention rates so as to compare to the projected enrollments that the APS has modeled. Since we are looking at rates of change, the difference in the total enrollments should not be an issue. In the table below, we repeat the relevant enrollment numbers and show the retention rate for the same periods as above (K-1, 1-5, 5-6, 6-8, 8-9, 9-12). We average these retention rates to come up with a robust set of expected values from actual enrollments. As well, we computed a compounded and simple over 1st – 12th grade retention (attrition) rate of -24%. We were careful to not overlap any of the statistics when computing the compounded values.


Retention Rates for Selected Periods, 2001- 2014
Year K 1 K-1 5 1-5 6 5-6 8 6-8 9 8-9 12 9-12
2001 388 354 354 345 320 284 244
2002 385 407 5% 329 347 -2% 332 289 -10% 256
2003 434 392 2% 310 335 2% 331 -4% 291 -12% 270
2004 393 411 -5% 353 294 -5% 344 -1% 275 -17% 273 -4%
2005 406 405 3% 362 2% 346 -2% 334 0% 287 -17% 252 -13%
2006 381 411 1% 353 -13% 338 -7% 292 -1% 293 -12% 293 1%
2007 446 382 0% 348 -11% 346 -2% 333 -4% 250 -14% 300 9%
2008 411 432 -3% 378 -8% 328 -6% 309 -9% 296 -11% 282 -2%
2009 451 401 -2% 373 -8% 365 -3% 347 0% 285 -8% 297 1%
2010 434 446 -1% 359 -13% 359 -4% 334 2% 312 -10% 266 6%
2011 448 441 2% 360 -6% 344 -4% 360 -1% 297 -11% 295 0%
2012 450 455 2% 415 -4% 349 -3% 346 -4% 300 -17% 290 2%
2013 454 460 2% 386 -4% 374 -10% 326 -5% 313 -10% 326 4%
2014 471 472 4% 423 -5% 352 -9% 317 -9% 280 -14% 289 -3%
Avg 1% -7% -4% -3% -13% 0%


Now let’s compare the actual, observed retention rates with those predicted in the APS projected enrollments analysis.


Comparison of Retention Rates. Actual Vs Projected
Year K-1 1-5 5-6 6-8 8-9 9-12 1-12 Compounded
Projected 4% -1% -7% -2% -9% 3% -16% -12%
Actual 1% -7% -4% -3% -13% 0% -23% -24%
Diff 3% 6% -3% 1% 3% 3% 7% 12%


As one can observe, the APS is projecting a retention rate of -12%, much lower than the actual -24% rate that is observed historically. The difference lies in three areas, the K-1 growth rate which has been observed to be about 1% but is projected to be 4%. In other words, the APS is expecting that 4% more 1st graders will enter into the system than those who attend full day, free kindergarten. In the elementary school years, the actual retention rate is -7% while the APS is predicting a -1% rate. Finally, we showed earlier that data errors in the modelling resulted in a 3% over estimate of retention in the APS projections for the high school enrollments. These three differences account for the entire difference between the historic -24% retention rate and the projected -12% retention rate.

Conclusion

The so called baby bump used, in part, in justifying an expensive rebuild of the Arlington High School is not supported by the APS’ projected enrollment analysis. The entire 10% projected increase in future enrollments is completely driven by assumptions of far higher retention rates (or equivalently far lower attrition rates) than the historic record suggests especially in the earlier grades (K-5). There are many reasons to rebuild the AHS, but the APS’ projected enrollments analysis is not one of them.

Nov 21 13

Staying the Course

by Wise Guy

Greetings! This is a follow-up post to my experience with online courses (see previous posts Spring 2013, MITx 6.00x, MITx 8.MReV and Fall 2013), also known as MOOCs. Since September, I have completed seven online classes and audited another five bringing my total to 15 completed courses and 11 audits over the past year. I decided to overload the number of courses in the fall to find out the limits of what one, somewhat average, middle aged guy could do with online learning.

I sampled a number of courses, but focused on the best offerings from some of the most prestigious universities, not worrying about dropping out of a class, since I am not looking for anything more than knowledge. As well, I’ve reached some conclusions about what knowledge acquisition might be good for besides a ‘certificate’, ‘diploma’ or other intangible accolade.

BookwormBelow, I describe in gory detail each course completed and briefly describe the courses I “dropped” giving pathetic reasons why. Before I do this, I make some general comments about the two most popular platforms for MOOCs, edX and Coursera. I also describe my overall experience in taking what amounts to more than a full load of college level courses that parallel some of the most popular introductory courses at some of the best universities in the United States and from one in India.

Forgive me for the length of this post. I hope those that read my drivel might attain some understanding of these MOOCs. Mostly, though, I write these thoughts down for myself to record my observations.

Of Course

Since late August, 2013 I registered for more than a dozen courses on the edX and Coursera MOOC platforms. At this point, I have completed or will complete seven of the courses after doing all of the work associated with each.

In addition, I audited and/or partially completed the following courses.

As detailed below, the courses took about 30 hours/week of work. First, let me respond to all of those out there who complain about how busy they are. I completed all of this work while also running a business, spending as much time as possible with my family and collaborating with others on two new business ventures. This is in addition to the normal social and professional engagements required of an adult. Several times during the fall, someone or other would complain about how busy they were. Some people put in their time; I like to use my time effectively.

Ages of ManThis brings me to my second point. Seven college level classes, and audits in a half dozen other classes, far exceeds the normal load of a college student. Part of the ability to do all of this work, and more, lies in the fact that I am much older than your typical college aged kid with a broader and deeper level of education than any undergraduate. Some of which comes from my own formal education, but even more so from having learned far more than a mere 10 years of post-graduate education might provide. One conclusion I have made is that, to paraphrase George Bernard Shaw, education is wasted on the young. I see the potential of MOOCs, in addition to educating university students, in retraining technology workers, keeping retirees mentally active and opening up new areas of knowledge for the non-traditional student.Evolution

More to the point, the unique structure of the MOOC made this course load possible. In a residential (normal) college setting, lectures are at a set time and there is always dead time between classes. Not so with a MOOC. During the term, I could be listening to a lecture and pause it while taking a client phone call. I often listened to lectures while waiting at various sporting venues while my children practiced. I could read materials while on the subway directly from my phone, I often would complete assignments, participate in discussions or review online notes very early in the morning or late at night. All this is to say that MOOCs are an efficient education delivery mechanism allowing for incredible productivity gains by the student as well as the teacher.

One last point is the difference between the MOOC platforms. Coursera has a larger selection of courses from many more institutions than edX. However, the courses themselves tend to be easier, the presentation less rigorous and the use of technology (embedded autograders, simulations and interactive problem sets) less impressive than the edX platform. Coursera’s format is somewhat confusing with too many clicks to go between courseware components, while edX’s LMS is crisp and relatively clean, with most features just a single level in depth. Finally, of all of the courses offered, I have to say, in my still limited experience, that MITx has, by far and away, the best implementations with MOOCs that compare favorably with the residential courses offered on campus.

Now for an overview of some of the courses.

Princeton Statistics One

Andrew Conway Princeton Statistics OnePrinceton’s Statistics One (“Stats1”) course, taught by Andrew Conway, is hosted on the Coursera platform. Stats1 is a series of 25 lectures broken into 2, 10-25 long segments each recorded in HD. Andrew stands next to a large screen monitor holding a tablet strapped to his hand controlling powerpoint slides on the monitor that he annotates during the lecture. Professor Conway talks directly into the camera and there is no class present. Princeton does not offer a certificate for this course.

The class covers four broad categories of introductory statistics:

      Research Methods and Descriptive Statistics
      Simple and Multiple Regression
      Group Comparisons using T-tests and ANOVA
      Non-normal Distributions and non-Linear Models

There were 10 lab tutorials showing how to use the R-programming language to do statistical analysis. There were 11 homework assignments that reinforced the lecture concepts as well as two exams, a mid-term and a final. I spent approximately 2 hours a week on this class and scored a 90% overall.

The course used examples from IQ testing/memory training studies, concussion studies using the IMPACT dataset that many high schools now use and other real world datasets. There were a couple of contrived examples, which I would urge Conway to replace with more meaningful data. The concussion studies were particularly interesting, showing a link between pre and post test results for athletes suffering head injuries.

I’ve never taken a dedicated statistics course, and although little of the material was new to me, it was informative to have a clear, concise and detailed exposition of the subject delivered in a comprehensive manner. Overall, if you are interested in understanding statistics, or want to solidify your R statistical programming skills, I can recommend taking Conway’s Statistics One course.

MITx – 7.00x Introduction to Biology – The Secret of Life

GradeMITx 7.00x is an introduction to biology required of all MIT students. The professor is Eric Lander, a well known, accomplished scientist and professor. I found 7.00x well designed with an excellent sequence of lectures that are mind expanding. The questions throughout the course kept me on track, helped solidify my understanding of the lecture materials as well as pushing me to learn certain aspects of biology on my own.

The use of the tools in 7.00x, like the molecular editor, jsMol a 3-D macromolecule viewer, geneX, IGV an interactive gene viewer and the virtual genomics lab VGL, was perfectly coordinated with the lectures. I’m one of the lucky participants with a high speed internet, monster computing device and multiple screens. Having multiple screens really came in handy when answering the problem sets; opening resource box PNGs, the instructions and the answer section all on different screens. I’m sure the clever edX developers will figure out a way to make this less important in the future.jsMol

The week 7 lectures and problem sets were just brilliant. I showed my middle school children the connection between DNA to RNA to protein using the visual representation made possible by the jsMol tool in the problem sets which allowed them to follow the conclusions presented in the lecture materials. Their response was ‘cool’. This is a testament to the efficacy of the tools used in 7.00x; that portions of the course are entirely accessible to most anyone.

In the lectures, the discussion of how sickle cell anemia changes the morphology of the red blood cell, and the molecular mechanism of hydrophobic binding forming long chains and resulting from a single base pair change in one chromosome was the high point of 7.00x for me. Adding in the discussion of all the greek (beta, delta, gamma, etc.) -globins, the connection to thalassemia, fetal oxygenation processes and the introduction to evolutionary genetics was just pure brilliance.

ivgThe last few lectures covered the molecular biology of both heart disease and certain cancers employing all of the genetics, biochemistry and other materials presented throughout the course. Again, the discussion was clear and my understanding of rational medicine improved by many orders of magnitude in 14 short weeks.

7.00x consisted of 27 hours of lectures and help sessions (although at 1.5X speed, it is more like 18 hours). There were 176 or so interstitial quizzes requiring about 3 hours of work (@ 1 minute each), 15 hours of reading (I admit, I skimped here!), 7 problems sets at 3-5 hours each for 30 hours and 3 exams at 4 hours each for 12 hours.

There were approximately 530 questions (counting green checks and the dreaded red X’s) in the problem sets and approximately 157 questions in the three exams. Those who completed all of the work in 7.00x answered well over 850 biology questions. In the beginning, I dreaded the single try exam questions, but in the end appreciated the necessity to up my game and not just answer with my usual, lazy, first guess.

7.00x consumed about 100 hours of my time, which over a 14 week period, was about 7 hours per week. An investment that was well worth the effort. I’ve estimated about 10 hours participating/reading the discussion groups, which may be on the low side. I also left out the huge number of tangents, self guided studies prompted by other students’ observations and discussions with my friends and family about the material I was excited about and they patiently listened to. I garnered an 89% in this course.

In conclusion, 7.00x is a rigorous, well designed, fantastic, interesting, mind blowing introduction to biology. I want to thank Professor Lander, all of the TAs and course staff, MIT, edX and the Broad Institute for providing me with this experience. For me, learning is the true secret to a well lived life and Eric Lander’s 7.00x should not be a secret for anyone who wishes to learn.

Some notable 7.00x links

Tree of Life

A virus simulation highlighted by Professor Lander:


JME Molecular Editor

Virtual Genetics Lab

MITx – 8.01x Classical Mechanics

GradeAnyone who has taken a course in physics knows, by far the most difficult courses offered in either an online setting or a residential setting is a course in physics. While I might have spent a few minutes, or even half an hour on a particular biology problem, there were times that I struggled for *days* solving a single problem in 8.01x. The instructor, Walter Lewin, is a great lecturer combining an entertaining pedagogical style with unforgettable in-classroom experiments and a genuine passion for teaching physics. 8.01x is a rigorous and challenging study of introductory physics and the lectures and problem sets should replace every high school AP physics lecture course.

Below is the introductory video for another class by Walter Lewin (8.02X – Electricity and Magnetism). However, I think it is also a great introduction to this course.




I kept (rough) track of the time required for each component of 8.01x. There were 36 hours of lectures and help sessions; although at 1.5X speed, it is more like 24 hours. As an aside, I can no longer watch lecture videos at normal speed, nor tolerate lecturers who don’t have the stage presence to view well at the 1.5x speed, like Walter Lewin does or as Eric Lander, who teaches 7.00x, does. There are 253 or so interstitial quizzes for about 4 hours (@ 1 minute each), 10 hours of textbook reading (I admit, I skimped here!), 10 problems sets at 3-5 hours each for 40 hours and 4 exams at 5 hours each (assuming a tough final) for 20 hours. 8.01x will consume about 100 hours of my time, which over a 15 week period, is about 6-7 hours per week. I achieved a grade of 96% in 8.01x.

I’m not sure if I am on the low side here for the general student population of 8.01x and I may be off in my own estimation.

In this accounting of time spent, I’m completely ignoring time spent in the discussion group. I was at the low end of usage with 40 comments or so, while others posted almost 3,000, extremely useful comments. As well, I am not including the “extras” that many students do, such as the 300+ pages of LaTex formatted notes compiled by one very diligent TA or the research into tricky homework problems. When one considers that the average tutor charges about $65/hr for help in AP physics, I’d estimate that community TAs provided hundreds of thousands of dollars worth of free tutoring.

So what is essential to the student of physics? You may say all of this and more (labs, in class discussions, one on one office hours, peer study groups, etc.), and you would be correct, but as a minimum, the lectures and quizzes are, as Professor Lewin might say, non-negotiable. More to the point, the lectures are so well done, with in-class experiments, thoroughly entertaining, concise yet rigorous, that I believe that this set of 8.01x lectures represent, in some very real sense, a platonic ideal of what an introductory course in classical mechanics should be. More to the point, the lectures themselves are accessible to a wide range of students; including high school students taking AP physics. At 3-4 hours a week, one could imagine a future in which all lectures in classical mechanics are an evolved or derived form of 8.01x. Furthermore, because these lectures are so accessible, a bright future might include a wider audience learning the basics of elementary physics with a relatively modest time commitment.

But the larger time and commitment sink, which solidifies a large fraction of understanding the material, are the problem sets and the assessment/evaluation process (exams). First, the commitment required to complete the problem sets is greater than the commitment to watch the lectures and try the quizzes, at least from my own experience.

One might imagine a hierarchy of problem sets. At the lowest level, the casual student might be asked to simply regurgitate the lessons learned in lectures with simple modifications; various iterations of Atwood machines or Capstans. These assignments might allow for infinite guesses until the student displayed mastery (much as the current system). At this level, and also experienced in this course, small variations to worked examples in the text would also be appropriate.

Imagine a society with a larger fraction of advanced high school students and first year college students displaying a mastery of just the lecture materials. IMHO, this would be beneficial.

At the next level, more difficult twists on the worked examples, perhaps with less allowed attempts, would allow an assessment of students able to go beyond and generalize the presented materials. Ratcheting up the assessments to the level seen in the current 8.01x, with related material or new material in problem sets and exams, and limited allowed attempts on exams, allows for differentiated assessments. Finally, the most excellent problems, such as the leaning ruler, which is intended to provide the ‘ah ha’ moments that signal an increase in intuition, a firm grasp of the concepts and the teaching of the “elegant” solution, with perhaps a single try, is the highest level of traditional instruction.

That is what I love about the MITx courses, the limited try, classic problems that you know have an elegant solution and that ‘ah ha’ moment of solving. This is how budding scientists are identified and trained today, at least in the classroom setting.

Now imagine a MOOC with dozens or even hundreds of examples and variations on each of these problems found in the homework sets and exams that allow for the student to be led to increasingly difficult assessments for each concept covered in the lectures. Imagine a tailored courseware that leads the student to push their own ability. I believe this will be how you can produce budding scientists.

All of this is to say, the current edX implementation of 8.01x is surely evolving. What we see now is just one small slice of a continuum of assessment tools (i.e. problems sets/exams) with a rather elementary way of setting a reasonable threshold, number of attempts, to gradually explore and refine the education system of the future. I’m happy that the course designers are tinkering with one lever, the allowed attempts, hoping that their vision of edX leads future students to better outcomes.

I read an interesting comment about MOOCs and education in general in a news article recently that is germane, although you may find my analogy long and torturous to follow.

One reader contrasted the ‘military’ model of training where no one was allowed to fail to pass certain tests (after the boot camp winnowing process) and that enlistees were required to continue taking a course of study, and continually tested, until they passed, usually with a very high proficiency rating.

What fascinates me about this insight is that is the ideal elementary and secondary school education model, where no student is allowed to progress until showing proficiency in a subject. Unfortunately, the current labor intensive education model makes this ideal almost impossible to achieve in your average public school, here in the US.

Stanford – Machine Learning

One of the first, modern MOOCs; Stanford Professor Andrew Ng does a great job with a difficult subject. Ng is a cofounder of Coursera and his Machine Learning course was offered for the fourth time. There were a couple of innovations for this course. Previous students offered tutoring services with prices ranging from *Free* to $30 for half an hour through the google helpouts, a service which provides video chat. This is an interesting experiment towards producing some kind of revenue stream for a MOOC.

Andrew Ng’s Machine Learning is a well designed course covering many of the commonly used algorithms used in, well, machine learning. The course consisted of 18 lectures in ten weeks broken into 5-10 segments, each about 10 minutes long, for a total of just under 20 hours of lectures, or 16 hours at 1.25X speed. Between each segment there was a multiple choice question covering the material just learned, for a total of 112 questions (2 hours @1 minute each). There were 90 review questions for 3 hours @2 minutes each. As well, there were 9 programming assignments, each taking 2-3 hours, for a total of about 25 hours. The total time commitment for the course was 46 hours, or about 5 hours per week over the 10 week course.

I did not participate in the online discussion groups, maybe reading a total of 10 posts (looking for resolution to an octave installation issue) and posting nothing myself. I did no outside reading, nor did I follow many tangents from the class. I achieved a 100% overall for the course.

The course covered a number of algorithms in Machine Learning; linear and logistic regression algorithms that minimized cost functions using gradient descent methods. Next we covered neural networks, training a neural network to identify handwriting solving a classic problem in recognizing zipcodes for post office sorting. This assignment was pretty cool using real data. The algorithm was about 99% accurate and the handwriting samples that it failed to classify were difficult even for the human operator, see below.
handwriting

After neural networks, we covered algorithms in support vector machines (SVMs), clustering using K-means testing and principal components analysis (PCA). Examples and problem sets covered topics in autonomous driving, image compression, email/spam processing and machine vision. The course wrapped up with anomaly detection systems, recommender (Amazon, Netflix) algorithms and large scale datasets.

I actually registered for three similar courses; Caltech’s Machine Learning and Big Data and Web Intelligence from the India Institute of Technology. The CalTech one was over my head, while the IIT one was too easy, like a survey course. The Stanford course was the Goldilocks’ choice, just right. I stopped doing the CalTech course, but completed the IIT one. Perhaps now, I’ll revisit the CalTech offering armed with greater understanding of the material.

I think that is one of huge advantages of online learning; multiple options for lectures, assignments and resources. Some of my MOOC buddies loved the CalTech course, so there is no accounting for taste, but there is enormous advantages to the smorgasbord approach to mastering a subject. During the course, I bookmarked a dozen different ML courses, from MIT’s OCW archive to complete programs at other universities.

I took Stanford’s ML course and the other MOOCs with an eye towards the barriers to learning a young student might face. In particular the three most difficult hurdles I recognize for the ML course were mathematical formalism, linear algebra and vectorized programming.

Andrew Ng, the professor, frequently acknowledged the potential difficulties of the formalism throughout the course, but to his credit he maintained a good balance between rigor, mathematical sophistication and accessibility. I observed a similar, stated difficulty when I took the Berkeley computer graphics course. For a physicist used to four vectors, alternative coordinate frames and Einstein indices/Kronecker deltas, the formalism is trivial, but easily daunting to those less familiar. Ng’s course was well done; with excellent consistency in using his indices for summations – ‘i’ and ‘m’ for training sets, ‘j’ and ‘n’ for features. That said, a few auxiliary slides with visual representations of the components of the resulting cost functions would be a nice addition.

The second hurdle is linear algebra. I might be mistaken, but most students are not exposed to LA until after the first year of college. I think linear algebra should be a foundational subject for budding computer science students. There is no reason that a formal course needs to be delayed so long in the typical sequence; Khan Academy has excellent tutorials and Ng had an introductory lecture on the necessary material. I think we focus on getting programming courses into high schools when a well done, highly motivational, and blessedly short intro to linear algebra with specific examples used in computer science might be more useful to students.

J(theta)
Which brings me to vectorized coding. One minor satisfaction in the ML course was taking one of the complicated, double or triple summation cost functions and implementing as one, compact line of octave code. Introducing loops is typical for intro computer science classes, it would be beneficial if students were then shown loop avoidance and other vectorized programming techniques early on. I was unhappy with having to succumb and use octave (matlab) for the course. However, after completing ML, I can appreciate Ng’s insistence on using octave. The complicated optimization formula, with double summations, above can be reduced to one line of code using octave:


J = 1/2*sum(sum((((X*Theta') .* R)-(Y .* R)) .^ 2)) +
(lambda/2) * (sum(sum((Theta .^ 2))) + sum(sum((X .^ 2))));



Again, I see no reason, as well as plenty of benefits, of introducing vector coding very early on in a computer science curriculum. Back in the day, I did a lot of scientific programming in the early models of the Connection Machine and there was a certain elegance and intellectual satisfaction in writing code that reduced to a single line directly recognizable as the textbook formula.

Sorry for the long note, but they boil down to multiple, parallel course materials are useful and have only incremental cost in a MOOC model and three foundational mathematical courselets of study (index notation, linear algebra and vector programming) need earlier introduction (high school) in a computer science curriculum.

John Hopkins – Data Analysis

Very intensive, short great way to learn R. Highly recommend. 4 weeks long, 4 challenging problem sets. Excellent coverage of R graphics packages and the visual display of quantitative data. I achieved a 99% in this class.

HarvardX – MCB80.1x Neuroscience

Disappointing.

IIT Delhi – Web Intelligence and Big Data

Survey course, great intro to all sorts of subjects. MapReduce, Haddop, noSQL, on and on. Lost interest about 2/3 of the way through and with the start-up of Stanford’s class, I did not watch the last third of the course lectures, although I did moderately well, all things considered, on the final quizzes, homeworks and final exam earning a 71% in the class and a certificate suitable for framing ;->

CalTechX – CS1156x Learning from Data (Machine Learning)

Too hard for me, switched to Stanford’s ML course instead.

Rice – Interactive Python Programming

Nothing I can’t learn on my own. Did not appreciate the twice weekly assignment due dates.

BerkeleyX – CS-191x Quantum Mechanics and Quantum Computation

Excellent course, challenging, rigorous. Decided to drop it and re-register at a later date.

MITx – 2.03x Dynamics

I had to drop this course since I was pretty much loaded up by the time it started.

Single Try to Answer

I’ve read with interest the discussions about whether questions on the exam should allow more than one guess; here are my thoughts. It is natural that students should want more than a single try, especially on an exam. I often click the check button and realize my mistake after the red X is displayed.

I think everyone would agree that True/False questions should only allow for a single guess. As well, questions with multiple parts, where the answer to part (a) dictates the correct answer to part (b), would become a guessing game with more than a single try or even if the auto-grader provided check boxes for each part of a problem. On the first exam, I scored less than I had hoped for, almost entirely due to my own lack of reading comprehension. For example, on Exam 1 question 1, I misread part (c) as asking to transform the given molecule to a hydrophobic, not hydrophilic, molecule. Five easy points down the drain. After my performance on exam 1, I decided to take my time on exam 2, read the question thoroughly, re-read the question and write down the relevant bits, solve the problem on paper, read the problem again, save my answer, go on to the next part(s) and revisit my answer depending on what I might glean from the rest of the problem. This strategy seems to have worked, since I scored higher on the second exam. I’d estimate that half of my mistakes in 7.00x problems sets and exams are related to misunderstanding the problem. For me, a large part of learning science is becoming proficient at using the language in a precise manner.

Too often, scientific terms are used by the novice (me!) loosely resulting in sloppy thinking and mis-communication. When I listen to the lectures, I often marvel at the rapid fire, precise use of the language by Professor Lander. One of my many take-aways from 7.00x is that if it ends in -ase it is an enzyme and if I need an enzyme, take the protein or function wanted and add an -ase. As well, I have admiration for those students taking this course who are not native English speakers; I could not accomplish what they have if this course was in another language, never mind pass one of the exams.

One particularly difficult issue in creating a course must be coming up with challenging questions that are not easily google-able, fair – at the same level of difficulty across courses, thoroughly tests the subject material and provides a mechanism to differentiate students without being too easy or too hard. My humble opinion is that edX, and in particular MITx, has struck a very good balance in their implementation of these courses making them as comparable to the offline course in their rigor and difficulty while recognizing the inherent differences between the two different approaches to learning.

Online courses are in their infancy, the auto-grader will surely evolve, and there are unique challenges and opportunities not found in offline courses. For example, students in the offline course are only given a single chance to answer questions but have a larger support system in place with tutoring sessions, dozens of accessible peers, etc. Students in 7.00x are given multiple chances in homework sets and have hundreds if not thousands of accessible peers with transcripts of dozens of relevant discussions for most every tricky problem. 7.00x students use cool embedded tools to solve problems, like jsMol, GeneX and build-a-molecule, while offline students draw free-hand diagrams in some exam questions and have the opportunity for partial credit in some problems. On balance, I think the advantages and disadvantages even out.

One last point that I’d like to make is that we are the test subjects of a grand experiment. These MOOCs are still a work in progress and researchers are able to collect a tremendous amount of data on student progress, course materials and student outcomes. I, for one, feel that is a small price to pay for these most excellent courses. So enjoy the challenge of the single shot at answering a difficult question knowing that your earnest efforts will not only increase your own subject matter knowledge, but will hopefully improve the entire education process for future students.

6.00x

One feature I really liked about 6.00x was the way the online lectures were implemented. Some courses have video of a professor at the front of a classroom writing on a board (or smart board) which feels impersonal to me. On the other hand, the lectures in 6.00x were much more intimate. I loved it when one of the 6.00x professors would start a lecture with ‘Welcome back.”. Videotaping the lecturers speaking directly into the camera and overlaying the tablet output or IDLE session was a strong point of this course. As well, I now find it hard, after experiencing 6.00x, to watch video lectures that don’t include both a running transcript and 1.5x speed option.
I appreciated the weekly updates, they made me feel an urgency to maintain whatever momentum I might have built up after finishing a problem set, but they were also a good reminder that the staff was involved continuously with the course. Other online courses feel “canned”, like their is no one sharing the ups and downs that students experience as a course unfolds. 6.00x felt as though the edX staff and professors were following along with me (and apparently 10,000s of others) throughout the entire four month long course.
I also enjoyed the “extra” stuff put into lectures; for example the self guided excursions into powersets and the guest lectures at the end of the course. The statistical fallacy lecture was very good as well as entertaining. The example of “Anscombe’s quartet” and the limits of statistical measures was brilliant.
I thought the autograder was very impressive; many courses just have simple, multiple choice or True/False questions. Accommodating actual code submissions, allowing multiple attempts and being able to see test case output really helped me with the learning process. I think the idea of peer grading was interesting, but may need some fine tuning to get right. The 6.00x auto-grader has frankly ruined me for many of the Coursera classes that require a 1990s style upload/download run script and figure out the auto-grader foible or worse some courses grade programming submission with peer evaluations.
Finally, I thought the homework problems themselves were done extremely well. The had the right amount of difficulty (for me, anyways) where they were hard enough to extend the lessons from reading or watching the lectures, but not impossible. The practical nature of many of the homework assignments also made them not just informative but fun as well.
I am also enjoying 7.00x (Intro to Bio with Eric Lander) which is a great course, well made, excellent courseware tools, thought provoking lectures, rigorous, challenging and exciting. I have a new found respect for biology by taking 7.00x I’ll probably extend my thoughts once I finish, depending if I pass ;->
I found 8.MReV well done for the range and depth of the problem sets. I wish the lectures had been more developed, but as a review course, I thought 8.MReV struck the right balance between time, effort and take-away.
I did not enjoy the CS181x as much, although I’ve got to say, the auto-grader that does a pixel by pixel comparison for graphics output submission was pretty cool.

Autograder

I got question 7(e) wrong and should have been graded down on the two previous parts as well, but wasn’t because my submitted answer was within the tolerance of the auto-grader. My bad. As soon as the correct answer was provided, I compared my calculations, realized my boneheaded mistake and moved on. Sure, I can no longer get the perfect score in 8.01x (well, couldn’t anyways…) and prove that I am smarter than a 5th grader, or 13th grader, or even an auto-grader, but I’m not taking this course to prove that. (humility intended).
That said, there are several aspects of online courses and evaluations that are fundamentally different than offline courses. One that many have experienced in this past exam is how problems sets are graded. While I got a higher score than I should have in exam 2 because of implementation limits in the auto-grader, the element of human discretion in grading is not part of an online course.
For example, partial credit. In many courses, if you submit an answer that differs by a minus sign, the human-grader might give you partial credit. As well, showing all of your work might provide partial credit when the final answer is wrong due to an arithmetic mistake. In my own life, I had a problem than ran for two pages of summing a series of -1, 0 and 1 (Clebsch–Gordan coefficients anyone?). I made a simple mistake and got partial credit (4 out of 25, but hey, better than a big fat red X – in pen).
As a further example of partial credit, think of the concept questions in this exam and in some of the lecture questions where you need to mark all statements that are true. If there are seven choices and two true statements, and you mark three correct, then you actually got one wrong and six correct answers giving you a score of 6/7 not -0-.
Instead of fine tuning an essentially binary decision of the current auto-grader, I suggest a different approach (for the 2020 8.01x students). Future auto-graders might allow for the student to “show all work” and receive partial credit. This may seem far fetched, but 6.01x, Intro to Computer Science, worked something like this. Code samples were graded by passing a series of test cases, from the banal to the intentionally devious, and depending on how many cases you passed, the score was scaled.
Similarly, a calculation or derivation that strayed at an arithmetic juncture, but displayed an understanding of the physics, might receive some partial credit. And those calculations that deviated wildly from the ‘canned’ approach, but gave a numerical answer that matched exactly, could be incorporated by the ‘smarter’ auto-grader for future comparisons. Think of how much nicer it would be for the after exam answers to be generated from the most used, correct approach with an option to display unique calculations as well.
As for today, one answer for numerical precision issues is to require a formulaic answer in graded problems instead of a numerical one. Perhaps, a numerical ‘check’ using given variable values would be an additional part of the problem.

Drop out rate

Many register in a MOOC just to see the material and satisfy their curiosity, some to judge the scope of a course, some ‘audit’ the course with no intent to do all of the work and others register as a placeholder for when the course starts. I think this accounts for the majority of those who register for any free MOOC. For example, between edX and Coursera, I registered for 12 classes since late August knowing full well that I did not intend to complete all of the work associated with every class. One coursera course became silly and I just stopped doing any of the work. One edX course was not helping me understand the material and I registered for the coursera class instead. Although I am still registered for the edX version, I am not doing any of the work. Of the 12, I’ll complete seven (or maybe 8, we’ll see). Since there is no stigma attached to register and sample a course, I believe many others are doing the same thing.
For 8.01x in particular, I convinced a couple of high school AP physics teachers to take a look at the course, which required registration. Both were very happy to see the material, and one told me he used some of the ideas in his own classroom, but neither are taking exams. Another 8.01x registrant from my own town also registered, but he stopped doing the work early on because of lack of preparation.
I think the better statistic is that 4500 people took exam 1 and 3200 took exam 2 and 70% of those scored 60 or higher (according to the course updates), in which case, more than 2000 people should end up passing the course (about a 50% pass rate), which means that a couple of thousand people got to experience a fraction of what an MIT undergrad would for 8.01. Not bad for free, and what a boon to society and credit to MITx that a few more thousand people on earth have a more than rudimentary understanding of elementary mechanics!

Sep 10 13

MOOCs – Fall 2013

by Wise Guy

I got a lot of feedback from the last few posts describing my experience with some of the MOOCs. Today, I do the worst possible thing and discuss my plans for fall courses. At this point, I have completed seven courses; three from edX, three from Coursera and one from Udacity while ‘auditing’ another four. This fall, I plan on taking/auditing a bunch more.

MITx 7.00x

MITx 7.00x is an Introduction to Biology. The lead instructor is Eric Lander, a well known, accomplished scientist and professor. I am looking forward to this course since I know the incredible strides microbiology, biochemistry and genetics research have made in the past few decades.

The last time I took a biology class was almost 40 years ago as a high school freshman. I vividly recall the first day of class when the teacher asked us students to write down one question, any question, no matter how stupid or silly, that related to biology. Having been a regular Fidelity House Day Camp tripper, I asked whether the practice of camp counselors cutting open frogs harmed the frogs; the counselors’ assertions to the contrary not withstanding. My high school biology teacher was shocked and proceeded to vilify my apparent, incurable stupidity for the rest of the class.

While this experience was only a part of why I was turned off to biology, I was able to hold the subject with a disdain that was only later reserved for other pseudo-science endeavors such as economics, sociology and psychology. Over the years, I did come into contact with the biological sciences, but managed to hold my prejudice intact. I was awarded several NIH grants to continue my graduate school research. I did a short gig providing a small hand in setting up a computational bioinformatics lab for a first wave (1980s) biotech company. I was an early user of the PDB finding convincing images of the floppy hinge of mutated Apolipoprotein E4 as a possible physical mechanism for plague formation in Alzheimer’s patients and later presented in a seminal Ether Dome talk on the subject.

This is all to say, that biology, in the 20th century, was not considered a hard science to the snobby intellectual. Time to shed that prejudice! Although 7.00x does not begin until next week, the reading list was distributed and all indications are that biology has come a long way in 40 years. The first reading assignment covers the chemical components of a cell beginning with atoms and building up, block by block, using firmly established chemical and physical principals, the macromolecules responsible for molecular biology.

I suspect that 7.00x may do more than remove my long held prejudiced occasioned by, what I might claim, as rough handling in my formative years. I expect that 7.00x will kick my intellectual ass.

MITx 8.01x

MITx 8.01x is the ‘classic’ Classical Mechanics course taught to MIT freshman. I have probably ‘cheated’ a bit first by having already seen all of the course material in my formal education, but also by participating in the 8.MReV review course this summer. That said, I am looking forward to listening to the lectures by Walter Lewin. I explain this by analogy. In the 1960s, one of the greatest physicists of the 20th century, Richard Feynman, taught the introductory freshman physics course at CalTech. Lore has it that the students were utterly and hopeless lost and many stopped attending; only to be replaced by an ever increasing number of graduate students, physicists and the merely curious. The end result was the compilation of the Feynman Lectures in Physics, an orthogonal, entertaining and unique resource for understanding physics. Cool link here http://www.feynmanlectures.info/

Web Intelligence and Big Data

This Coursera class is interesting but perhaps not very rigorous. I’ll report more on this later.

A History of the World since 1300

I want to see how Princeton implements an online class. Plus, everyone loves a good story!

Aug 26 13

MITx 8.MReV

by Wise Guy

Hello again! Today, I will relate my experience in taking my seventh MOOC and fourth on the edX platform. First, a little background. I had no intention of taking this course since I have never, ever taken a course during the summer which I generally reserve for necessary business and vacation time with my family. I’ll admit, this course came in third, or maybe even fourth place in terms of priorities. That said, it was a worthwhile class and I highly recommend it.

MITx 8.MReV

8.MReV is a college-level introductory mechanics class using a strategic problem-solving approach and is designed for teachers of college or Advanced Placement physics. The students ranged in age from 14 to 80 and I heard estimates of about 10,000 participants. I am not sure how many completed the course. In addition to teachers of physics, there were many current students and life long learners, judging by the introductions in the discussion forum.

The course ran from about mid-June through the end of August with some optional material that runs through the middle of September, offering a condensed schedule covering a semester’s worth of material. Over the graded, 11-week course, there were approximately 40 video lectures, a dozen or so interactive java applets, about 100 pages of notes and almost 1000 graded problems on various topics in classical mechanics. The course followed the SIMs (system – interactions – model) framework developed by Dr. David Pritchard for a uniform approach to problem solving. The course did not attempt to teach the concepts presented, assuming some level of familiarity already, instead it was designed to teach proficiency in problem solving.

My decision to take this class was made at the last minute; unfortunately I was away at the start of the course and twice during the summer. While I do have a formal background in physics, I had not studied this material since my high school AP course more than 35 years ago. I did register for the usual sequenced classical mechanics course MITx 8.01x which starts in September and rationalized that any work I put into the 8.MReV class would be helpful. I was in for a surprise.

The coursework consisted of about 400 graded checkpoints scattered as exercises throughout the lectures and notes. Generally, these were true/false questions, multiple choice and a fair number of formulaic solutions. Unlike many other courses, the number of allowed guesses were limited to one for true/false (doh), two or three chances for multiple choice questions and between two and ten guesses for questions requiring a formula answer. Answers to the exercises were not provided until after the due date which was sub-optimal for learning.

In addition to the 400 checkpoint exercises, there were approximately 300 home work problems; many of these problems were very hard. In my opinion, this is where the course really shined. Some of the best problems in classical mechanics were contained in the homework sets. I had forgotten the beauty of finding an elegant solution to a challenging problem, more about this below. There were also 8 quizzes with an additional 160 problems. Generally, the quizzes were much easier than the homework sets although you were given far fewer tries to get the correct answer. All together there were about 1000 problems in this course, about 100 due every Sunday. I spent an average of 6-8 hours per week on this course and certainly not enough time on the homework sets.

The material covered Newton’s Laws, equations of motion, kinematics, mechanical energy and work, dynamics, linear momentum, impulse and inertia, torque and rotation, rotational energy, angular momentum, orbital mechanics and harmonic oscillations. Problem sets included examples using systems of blocks and pulleys, Atwood machines, pendulums, springs, inclined planes, friction, projectiles, collisions, ladders, rolling, slipping, skidding balls, yo-yos, tether balls, merry-go-rounds and many other simple systems.

One of the highlights for me was the lecture by Walter Lewin who showed that rolling dynamics were independent of mass and extent and depended solely on the geometric properties of the object. Check it out.

Problem Solving Skills

The most useful components of 8.MReV, in my humble opinion, are the 1000 problems. I know what some of you are thinking, who in their right mind would waste about 100 hours on introductory physics problems during the summer? The more practical among readers of this post might ask, what possible value could a middle aged guy get from a course like this? Sure, if you are a teacher or student of physics, then maybe one could understand taking a course like 8.MReV (or 8.01x, etc.) but what about the remaining 99.993% of the population?

I believe there are three responses, each revealing a deeper layer of understanding of why you might consider such a course. First, no matter what your age, using your mind in a constructive way is a good thing. Studies have shown that mental health and quality of life are all improved by exercising your mind. Life learners know that understanding brings a kind of happiness not found elsewhere.

Second, 8.MReV covers topics that every person should have some grasp of. Recently, I discussed quadracopters with a non-native English speaker. Since we both had a basic understanding of classical mechanics the depth of our discussion was greatly enhanced. More importantly, developing robust problem solving skills is immediately applicable to practical problems in just about any field. For example, some of us have seen elegant programming implementations, but generally we see confused, coded solutions that obscure and are prone to errors. Becoming proficient in finding the elegant solution to a problem that is analytically tractable transfers to real world problems that may have no analytic solution.

Most importantly, problem solving is a key skill of a well educated, 21st century knowledge worker. Anyone can be taught how to grind out a solution using a rote method. However, many problems faced outside of an academic environment require independent thought. 8.MReV, in the tradition of many physics courses, uses problem sets to nurture this ability. The most effective approach to develop this skill is in solving real world problems where you can touch and see the system (a yo-yo, a tether ball, a merry-go-round) and build a sensory intuition along with the analytic ability to describe accurately.

Sample Problems

I included a sample of four problems below to illustrate the type of problems found in 8.MReV. One neat aspect of the course is that the level of mathematical sophistication was actually quite low; algebra, basic trigonometry and calculus in some of the derivations, but only once or twice in any of the problems. In fact, most of the problems could be solved with simple algebra. The beauty of the problems illustrated below is that it is not the mathematics that is challenging, but understanding the physics of the problem. Many of the problems in 8.MReV could be solved without using any math at all.

One way to approach the problem below is to consider the possible, extreme values the variables might take on. What would the distance be if the bike was traveling at an incredibly high rate of speed? Alternatively, what would the distance be if the car accelerated very slowly? By looking at these two extremes, the solution can be found for the typical case without any math at all.

A car is stopped at an intersection with a red light, and a biker (in a bike lane) with velocity “v” is approaching the car from behind. When the biker is a distance “d” from the intersection, the light turns green, and the car begins to accelerate at a constant acceleration “a”. In terms of only “d” (not “v” and “a”) calculate the distance at which the biker catches the car if he only barely catches it while it is accelerating.

This trick, checking the cases where the variables take on extreme values, is handy in almost any field. I have used this trick frequently in solving problems in finance, computing and many other fields.

The solution to the problem below is not something you might guess, but again the mathematics is quite simple, while an understanding of the words used, “average” and “horizontal”, provides the solution. Frequently, the most challenging part of solving a problem is using the correct language, in a precise way, when describing the problem.

Two soccer players kick a ball back and forth toward each other. They start off 50 meters apart, and walk toward each other with equal speeds. They kick the ball continuously until they meet in the middle one minute later. What is the magnitude of the ball’s average horizontal velocity in meters per second?

The problem below has a simple result that you can discover this winter on Spy Pond.

You stand at the end of a long board of length L. The board rests on a frictionless frozen surface of a pond. You want to jump to the opposite end of the board. What is the minimum take-off speed v measured with respect to the pond that would allow you to accomplish that? The board and you have the same mass m.

This last problem has an elegant solution that you discover yourself this summer, although it might be easier for you to be in the boat and not your dog.

A dog sits on the left end of a boat of length L that is initially adjacent to a dock. The dog then runs toward the dock, but stops at the end of the boat. If the boat is H times heavier than the dog, how close does the dog get to the dock? Ignore any drag force from the water.

Results

I finished the graded portion of 8.MReV on Friday, although I intend to give the optional sections a try. I scored 96% on the checkpoint exercises, 92% on the quizzes and an 86% on the homework for a final score of 88%; the homework accounted for most of the grade. In conclusion, I highly recommend this course for every high school AP physics teacher, aspiring students of any science and those who teach introductory mechanics at any level. This course would also be useful for graduate students in physics studying for their qualifying exam to review the material and challenging problems that typically are found there. However, I think that those who would benefit the most from 8.MReV would be anyone who needs to solve problems in their day to day life, which is to say everyone.

May 17 13

MITx 6.00x

by Wise Guy

Not sure if anyone cares, but here is an update for the online courses I started back in February, original post here. Since that time, the BerkeleyX Computer Graphics course, 184.1x, has ended and two others, MITx’s Introduction to Computer Science, 6.00x, and Washington University’s Computational Finance and Financial Econometrics are coming to a close. HarvardX’s Greek Hero class is still ongoing, but as in my first post, this is my most neglected course.

Today, I’ll relate some of my experiences with the MIT course, which I found to be the best implemented of all four courses as well as the most valuable learning experience. Hopefully, now that my workload has dropped, I’ll write-up my experience in the Berkeley course soon.

MITx:6.00x

The first third of 6.00x was basic python programming and some classic topics in computer science; search and sort algorithms, recursion, orders of complexity, topics in debugging, that sort of thing. The lecturer, Eric Grimson, is one of these super smart people who can introduce really neat concepts in a simple pedagogical style. The course material started easily enough with examples drawn from mathematics; factorials and Fibonacci sequences to introduce recursion, root finding problems for divide and conquer algorithms, etc. At one point, we were doing a simple algorithm to find the greatest common denominator (GCD), which my fourth grader had been learning in elementary school, when Grimson made the connection with Euclid who some might claim created the first computer algorithm and went further to show the connection in discovering primes for data encryption techniques. Although familiar with all of these topics (I worked with a friend on developing RSA encryption of bank cards decades ago), Grimson’s genius was tying them all together in a simple exposition of basic programming.

Besides algorithms in math, we had problem sets in a couple of simple text parsing programs including word guessing games and Hangman, as well as programming solutions to other games like the Towers of Hanoi and data encryption. All fun stuff. The course was relatively straightforward and then came a mid-term exam and my first test in years.

The first 6.00x midterm exam was available from Thursday, March 21 to Monday March 25. The test was structured for 90 minutes, but students were allowed 12 hours to complete, since many students come from regions with slow internet connections and questionable access to power. I took the test on Friday, March 22, starting at 7:00am. I had to stop after the first half hour to drive to work and then again from 8:30-9:30 to take care of some business. I did finish in 4 hours, but spent significantly less than half that time working on the test. I only used the reference material provided as well as the python IDE (IDLE) and shell. I got a 92 on the exam, screwing up some True/False questions, which made me feel good, but my happiness was short-lived.

Immediately after the test, I became a bit complacent and worked on some other courses, life issues and business. Frankly I neglected 6:00x for a week. Big mistake! The lecturer in the videos changed, which was a bit disconcerting, but more importantly, we hit an inflection point in the course with the introduction to object oriented programming. I mentioned in my last post that I never took a computer science class, but have some experience in programming. Sure, I have seen and modified plenty of object oriented code, but never took the time to actually learn the basics. Well, let me tell you, the second segment of 6.00x was completely different from the first set of lectures. The next problem set, that had a two week deadline straddling the exam period, was friggin’ hard testing my basic understanding of classes, methods and inheritance. The problem was to write an RSS news feed parser (RIP Aaron Schwartz). I worked hard on this problem set and finished it over three days. Ever experience the feeling when you struggle with a set of new concepts and you just don’t get it, and then one day, bang! you understand it? Well, I guess problem set 6 in 6.00x was this “birthing” pain I had signed up for.

The next few lectures solidified my new found understanding of OOP with an introduction to some simple graphing (pylab) and a cool problem set to simulate an iRobot Roomba cleaning machine; my kids loved watching the simulation of hundreds of simulated Roombas “cleaning” a 1000×1000 grid.

The course then jumped into some concepts in statistics and probability to set us up for a problem set involving a virus/drug simulation trial which was pretty cool. At some point, life intervened and I didn’t finish this problem set, but did enough work so that I finally mastered some basic OOP concepts. I was lucky to have a good understanding of stats, so the coin flipping examples and card drawing exercises were easy and I could focus on the coding. I finished the ninth week doing curve fitting with stochastic simulations of drug treatment plans, building on the virus simulations of the previous weeks. Fortunately, the course provided robust (executable only!) solutions of the problem set I had failed to complete. Whew. This is one shortcoming in other on (and off) line courses that build on previous lectures; if you miss a concept you cannot complete the course. I found that the 6.00x staff and professors handled this issue quite well.

This brought us to the second mid-term exam. Same rules as before, but I was much better prepared. I took my time and had to deal with a bunch of unrelated stuff, but finished the exam and scored around the same as before. I looked at the discussion forum after the exam ended. There had been various discussion on the number of students enrolled (I had seen guesstimates from 20,000 to 60,000). Apparently, the number had dropped significantly after the first exam and even more so after the second exam. I would love to see the actual enrollment and completion statistics.

Generally, I have avoided the discussion groups, which can provide valuable hints to problem sets, but were also weighted down by complaints and demands to staff for help in overcoming hurdles. Frankly, problem solving is the whole point of most academic courses of study. Reading some of the discussions well after the fact brought smiles since a lot of people struggled with the same stuff that I had. Misery does love company!

At this point, with one problem set left, a couple of lecture series and the final exam, I have already passed the course. I intend to stick it out and try for a good final grade, but I have already accomplished what I wanted and feel good about 6.00x in particular. I have nothing but praise for the class, the set-up was easily the most impressive of the four courses I have taken so far. I really appreciated the work the 6.00x staff put into the auto-grader, even when it appeared to make mistakes :–()!! but most of all I thank MIT and edX for making this intellectual exercise possible.

Update

I finished the MITx 6.00x on Friday. The last set of lectures were really interesting covering topics in graph theory, dynamic programming and a series of guest lectures by researchers. One in particular, column oriented database design was of great relevance since that is how I have stored my data, in what I call case series, for the past 10 years or so. I’ll have to look into the c-store page to see how a group of smart guys implemented this idea.

The final exam started on Thursday. I’m somewhat disappointed at my performance on the final, mostly because I attempted it on Friday while work and other concerns occupied my mind. That said, I’m ok with my overall performance, getting a final grade of 92 for the course. I’ve already signed up for half a dozen more classes and look forward to finding other courses nearly as good as 6.00x!