lolloj - Fotolia

Challenges of checkpointing in distributed systems include security

Transparent checkpointing can help developers, but challenges crop up with security and storage. A computer science professor explains some of the caveats involved.

In an effort to mitigate the effects of both expected and unexpected service shutdowns, developers and application managers use checkpointing routines to save the state of an application, service or conversation between two or more computers. Doing so can readily return an app to the state that it was in before a crash.

As an alternative to application-specific checkpointing, which requires developers to write a new checkpointing routine each time they develop an app, transparent checkpointing aims to be application-agnostic. This approach allows developers to simply write their applications and implement a checkpointing program that works without being tailored to the specific application.

However, transparent checkpointing in distributed systems presents a number of interesting challenges, particularly when it comes to dealing with device and application diversity as well as certain security issues. In the second part of a two-part Q&A, Professor Gene Cooperman of the College of Computer and Information Science at Northeastern University in Boston tackles these concerns. Then make sure to check out part one where Cooperman discusses how transparent application checkpointing will impact developers.

Computing paradigms change, applications change and devices change. How can transparent application checkpointing in distributed systems keep up with the pace of technology without having to tweak those checkpointing processes to be application-specific when applications are becoming so diverse?

Gene Cooperman: Ultimately, because applications change, there are always new APIs popping up. Old APIs can change from time to time. So, we look for the most stable API that we can find. Usually, software is built in layers. At the bottom layer is an operating system; because Linux is free and there are many versions, often it will be based on Linux. Then above that, one needs to make calls to the operating systems to do system services like write to the screen, read from the keyboard, write to the network … but there's a standard for that. This is called the POSIX API, and our approach is unusual in that we decided to base everything as close to the POSIX API as possible.

Gene Coopereman, College of Computer and Information Science, Northeastern University, BostonGene Cooperman

What we're betting is that things can change above the POSIX API in all sorts of wild ways, but those are the higher software layers. Before [anything] hits the operating system, [it has] to go through this POSIX API. We put wrappers around this POSIX API so that we can see what services are being asked from the kernel. When it's time to restart later, we will set up the program, copy the memory and then ask the kernel for those same assistant services that were in use at the time at checkpointing. So, we're hoping that by staying close to the POSIX API, our code will be stable even though the layers above that may change a lot.

Is there any kind of security implication in terms of either people getting hold of that save state or being able to take a company's application and checkpoint it and use it for their own purposes -- for instance, bypass security protocols?

Cooperman: In the past, when companies developed security, they weren't really thinking that there is a transparent checkpointing program out there. So, they didn't design their security with that in mind. If they design it with transparent checkpointing in mind, there's no problem. But if they're not thinking about transparent checkpointing, then here are the issues that could happen. In a typical case, if a program represents software that has been bought from another company, there's a license associated with that and often there's a license server. Every time the program starts up, it calls this license server and asks does it have permission to keep running. And if it does, then it keeps running. So, the well-structured programs will check with that license server every so often to verify that it still has permission to run.

If they only do this once at the beginning, then a user could checkpoint after talking to the license server. And, now, they can restart and they will never talk to a license server again, even if the license has expired. I haven't heard of any bad cases where people were intentionally trying to get around licenses by this means. But as checkpointing becomes better known, I guess this could happen in the future.

The other trap for large companies [occurs if they have], let's say, four seats which give them permission to run four copies of the software at once, but not more than four. … [They might] checkpoint Monday, keep thinking about what was happening, and, the next day, [they can] restart and continue.

That's fine, but then when they restart, they need to make sure that they're not accidentally using a fifth seat. They need to make sure that the software is again reconnecting with a licensed server and asking the licensed server, 'I need an extra seat now -- is there one extra seat available?' This is a case where the employees of the corporation are not intending to cheat at all, they want to be honest. But they might accidentally use up their seat without even realizing it.

You mentioned before that you are saving data to discs. As amounts of data explode, do you at some point have to ditch the disc and move towards cloud or virtual storage when checkpointing in distributed systems?

Cooperman: Yes, we do. And I think that will happen. We're already in the middle of a transition where many computers are switching from hard disk to SSD (solid-state drive). Intel and Micron have announced a new kind of memory, 3D cross point, which is even faster than the SSD but can still save files permanently. And then, of course, if you're in the cloud, the cloud will usually try to use many discs or SSDs in parallel so that you can write pieces to it faster. For us, we just changed the back end of our software. So, if there is a new file system based on parallel discs in the cloud, we just change where we write the data and we can rewrite for that.

On another hand, there is a different challenge that faces us. So, I'm sure you've written a lot about big data, that's one of the other big efforts. And so, in this supercomputing problem, we had 38 terabytes to save, but because there was a very fast file system, we could do it in only 11 minutes. There are groups that work with data larger than 38 terabytes. And so, up until now, we have not been able to, let's say, transparently checkpoint typical big data to style programs. This is something that's of interest to us for future research.    

Next Steps

Disaster recovery testing: It needs to be part of your DR plan

Backup and recovery plans can benefit from large SSDs, experts say

How supercomputers are helping developers get access to cognitive cloud

Dig Deeper on Managing software development teams