Students in a UW summer fellowship program called Data Science for Social Good work to coax valuable information from overlooked data, and one potential upshot might be improved bus service.
If you’re a regular bus rider, you might think that the area’s transit agencies use the information from your ORCA card to learn which buses are most crowded during rush hour, and to fine-tune the area’s routes.
You would be wrong.
Turns out none of the area’s transit agencies have ever made significant use of the trove of data from ORCA cards — the prepaid, plastic cards used to pay for more than 60 percent of all rides on the area’s nine regional transit systems.
So this summer, a team of Ph.D. students took 21 million ORCA-card readings and wrangled the data into a form that can be used to discover where, and when, we go when we ride the bus.
Most Read Local Stories
- In blue Seattle, Trump supporters are starting to come out of hiding | Danny Westneat
- Leaked emails show Washington state Rep. Matt Shea endorsed training children to fight in holy war
- Scorned customer throws sign through window at Beth's Cafe in Seattle
- Dump truck crashes into Subway sandwich shop in Seattle's Pioneer Square, 5 injured VIEW
- Critics judge Jay Inslee's artwork: 'Kitschy anthropomorphism' and 'a sense of humor' VIEW
“They’ve demonstrated how you can create really usable information out of data sets that already exist, that are really big and complicated,” said Mark Hallenbeck, director of the Washington State Transportation Center, a transportation research agency with offices at the UW and Washington State University.
The students ran out of time to explore the data in depth. But they’ll pass their work on to the transit agencies.
The project “can tell the agencies where services can and need to be improved,” said Hallenbeck. And the project demonstrated that it can be done in such a way as to protect the identities of individual ORCA-card users.
The project was part of a summer fellowship program, Data Science for Social Good, now in its second year at the UW. Run by the UW’s eScience Institute and underwritten in part by Microsoft, the program aims to use data for the benefit of society. It’s modeled after similar programs at the University of Chicago and Georgia Tech. At the UW, more than 200 students applied for the 16 fellowships. (The program was open to students from other universities, and about half the participants were from schools other than the UW.)
The projects were not all about transit. Students on other teams researched whether food-product reviews (such as those on Amazon.com) could help regulators quickly discover cases of contaminated foods. They explored ways to map the region’s sidewalks to help pedestrians with disabilities get around. And they showed how data sets could be used to estimate poverty in different cities.
The ideas for the projects were submitted by UW researchers and area nonprofits. Each team of students ranked those proposals based on how interesting they found them, and the fellowship program’s directors made final decisions about which projects would go forward.
While all of the students on the ORCA project were Ph.D. students, none are computer scientists. Alicia Yiqin Shen is working on a Ph.D. in psychology, Carolina Johnson in political science, Sean Wang in geography and Victoria Sass in sociology.
But they had one thing in common: They’re regular bus riders.
“Transit is just a topic everybody cares about,” said Wang, who rode the bus to school as a kid growing up in Bellevue.
With the help of Bernease Herman and Anthony Arendt, data scientists with the UW’s eScience Institute, the group started by finding a way to take out the information that could identify individual riders. Then they married the ORCA data with another data source that tracks each bus geographically.
Why wasn’t this done already?
When you tap an ORCA card against a card reader, the system stores the time that you got on the bus, along with the bus’s identification number. But the system doesn’t record which route that bus was traveling at the time, since an individual bus may travel a number of different routes throughout the day.
The transit agencies do keep bus-route information in a separate database, though; it’s the same data used to help riders learn when the next bus will arrive using the popular app OneBusAway, Hallenbeck said. (Incidentally, OneBusAway was also created by UW graduate students, who began the project as a way to improve commuting in the region.)
The students used a powerful computer to stitch the data together, a process that took 36 hours in all. But that was only the beginning.
At first, Johnson said, the team thought the merged data could be used as-is. “It looked really clean,” she said.
“But it was all an optical illusion,” Sass added.
For example, one data point showed somebody picked up a bus in the middle of the Pacific Ocean. Another bus trip seemed to start on Vancouver Island. One bus appeared to be traveling two directions at once, on the same side of the street.
What followed was an eight-week exercise in data-cleaning as the team spent morning, noon and night obsessing over how to make the data yield its secrets.
At one point, Johnson said, “I was dreaming in SQL query language” — the computer-programming language used to extract data from a database.
The students also had to figure out a way to estimate the ride patterns of people who don’t use ORCA cards, which include a disproportionate number in low-income neighborhoods who tend to use cash.
The data also doesn’t tell where a rider got off the bus, because the card is only tapped when a rider enters. But it’s possible to extrapolate the trip’s end by looking at patterns, Hallenbeck said. For example, a rider who gets on an eastbound bus 44 in Ballard at 8:30 a.m., then gets on that same bus going westbound at 5:30 p.m. from the University District likely got off the bus in the morning in the University District.
By the time they finished cleaning up the data, the 10-week fellowship was almost up. The students finished the project by creating a computer dashboard so the area’s transit agencies can easily explore ridership patterns on their own.
And now that the students have shown that the data can be useful, and revealed where the pitfalls lie, the agencies should be able to update the database with new information.
The information from ORCA taps should also be far more accurate than current ridership estimates, which are done using sensors that detect when people get on and off a bus. Only about 30 percent of buses have those sensors, Hallenbeck said.
When might bus riders see improvements? Hallenbeck hopes the regional transit agencies will make quick use of the information, perhaps incorporating it in the next schedule change this fall. Knowing where people get on and off the bus, and where they transfer, also might lead to other improvements — such as new bus shelters in popular locations.
The students had only one specific slice of ridership data — from mid-February to mid-April 2015. Incorporating data from 2016 into the system, after the light-rail line to Husky Stadium opened, should show interesting patterns that reveal how riders’ commute patterns changed, Hallenbeck said.
Eventually, the agencies might release aggregated data to the public so armchair transit geeks could play with the numbers and see if they can devise ways to improve the system.
When the students presented their information to their peers Thursday, representatives from the city of Seattle, Washington State Department of Transportation and Sound Transit were in the audience, asking questions.
Sound Transit, WSDOT and the Puget Sound Regional Council have also asked the students to come to agency headquarters to make a presentation.
“We’re really just scratching the surface of what can be done here,” Sass said.