Google Summer of Code 2011
Welcome to CernVM Google Summer of Code 2011 ideas page
Below you will find the list of GSoC 2011 project ideas which are grouped into three categories (ideas related to CERN Virtual Machine itself, CernVM File System-related ideas, and CernVM Co-Pilot related ideas). We also introduce here so-called 'Blue sky' ideas which must be elaborated further. The list of ideas should not be considered to be complete, so if you have a cool idea regarding CernVM we would certainly be very glad to hear it. The list of our mentors (and their areas of expertise) can be found below.
We encourage students who are planning to apply for a CernVM to do that (and contact us) as early as possible because we have learned from previous GSoC participants that an initial student application often needs to be reworked in close collaboration with a future mentor.
Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC webapp before the 8th of April (19:00 UTC). The application template can be found here.
Project ideas
CERN Virtual Machine related projects
CernVM is a Virtual Software Appliance designed to provide a complete and portable environment for developing and running the data analysis of the CERN Large Hadron Collider (LHC) on any end-user computer (laptop, desktop) as well as on the Grid and on Cloud resources, independently of host operating systems. This "thin" appliance contains only a minimal operating system required to bootstrap and run the experiment software. The experiment software is delivered using the CernVM File System (CernVM-FS) that decouples the operating system from the experiment software life cycle.
CernVM release testing
Description: CernVM supports VirtualBox, VMware, Xen, KVM and Microsoft Hyper-V hypervisors. Each new release of a CernVM image needs to be thoroughly tested with all of them. Although the test cases are not many (about 20 tests) the large overall number of hypervisor/host OS/Virtual Machine edition combinations makes the process very time consuming. The task would be to develop a program which will install and configure CernVM instances, run the set of tests and report the results.
Mentors: Predrag Buncic, Artem Harutyunyan
Requirements: Experience with cross-platform software development; experience with libVirt would be a big plus.
CernVM File System related projects
The CernVM File System is a Fuse file system developed to deliver High Energy Physics (HEP) software stacks onto (virtual) worker nodes. HEP software is quite large with tens of gigabytes per release and 1-2 releases per week while, at the same time, almost all the individual files are very small resulting in tens of millions of files. CernVM-FS uses content addressable storage (like the GIT versioning system) and for distribution HTTP. File meta data are organized in trees of SQlite file catalogs. Files and file meta data are downloaded on demand and aggressively cached. The CernVM File System is part of the CernVM appliance but it compiles and runs on physical Linux boxes as well. It is mostly BSD licensed with small GPL parts. CernVM-FS source code can be downloaded from here, the documentation is available here.
Replication Orchestrator
Description: CernVM-FS clients get new data from an HTTP server (cernvm-webfs.cern.ch). Cluster installations use local Squid proxies to reduce latency and the load on the backend. Still, the backend webserver represents a single point of failure. While there are already institutions willing to host mirror servers, CernVM-FS lacks an orchestrator to consistently replicate new file system snapshots. The project will allow replication to take place in a distributed manner, i.e. the mirror servers will form a peer-to-peer network.
Mentor: Jakob Blomer
Requirements: Good knowledge of C/C++, Linux, and bash scripting, HTTP basics, knowledge of Jabber/XMPP is useful, interest in distributed systems.
Pluggable Cryptography
Description: CernVM-FS uses cryptographic routines in quite a lot of crucial spots. Foremost, it uses SHA1 as part of the data format because for the content addressable storage all files are renamed according to their SHA1 content hash. It uses RSA public key encryption in order to digitally sign the file catalogs and AES in an experimental encryption extension. The project will allow all of those routines to be replaceable, in particular the SHA1 algorithm as SHA1 is not fully trusted anymore.
Mentor: Jakob Blomer
Requirements: Good knowledge of C/C++, knowledge of practical cryptography (like symmetric / asymmetric encryption, digital signatures, secure hashes), knowledge of OpenSSL is useful, interest in interface design.
CernVM-FS enabled OpenAFS Kernel Module
Description: Before any data is distributed using CernVM-FS it first has to be converted into the CernVM-FS content addressable storage format. For large data sets this conversion is done incrementally (for instance for every other software release when published). In order to do the incremental transformation efficiently, a file system change log is created by a kernel module based on redirfs. That way, the change log can be used to quickly find the spots where files have been modified. Currently the kernel module runs correctly with local file systems and NFS but it breaks with AFS. As CERN has a long history of operating AFS, we want to solve this problem and to patch the CernVM-FS change log logic directly into the OpenAFS kernel module code.
Mentor: Jakob Blomer
Requirements: Good C knowledge, Linux kernel knowledge, no fear of occasional system crashes…
CernVM Co-Pilot related projects
CernVM Co-Pilot is a framework which allows to instantiate a distributed computing infrastructure on top of virtualized computing resources. Such resources include enterprise cloud computing infrastructures (e.g. Amazon EC2), scientific computing clouds (e.g. Nimbus), as well as volunteer computing clouds (e.g. powered by the BOINC volunteer computing platform). Until now, Co-Pilot has been used to carry out computational tasks for the CERN ALICE and ATLAS experiments operated at the Large Hadron Collider (LHC), as well as to run an LHC Monte-Carlo data generation application of CERN's Theoretical Physics Group. CernVM Co-Pilot is distributed as part of CernVM. Co-Pilot source code is available here (directories starting with 'copilot'); the documentation is available here.
Extension of Co-Pilot Job Manager
Description: Currently the submission of new jobs to the Co-Pilot Generic Job Manager is possible only from the machine where the Job Manager instance is running. The CernVM Co-Pilot protocol must be extended to allow remote job submission, job status queries, as well as job output retrieval. After an extended protocol has been developed the changes should be implemented in the corresponding Co-Pilot components (Core, Job Manager, Job submission).
Mentor: Artem Harutyunyan
Requirements: Experience with Perl or Python, network programming experience, experience with Jabber/XMPP would be a big plus.
Co-Pilot monitoring
Description: Currently the Co-Pilot framework lacks monitoring features. The first two steps would be to extend Co-Pilot by implementing a monitoring component and monitoring functionality to the Co-Pilot Agent. The system can further be improved by also implementing a web based monitoring frontend, which will gather data from the monitoring components and display it in a nice way.
Mentor: Ben Segal
Requirements: Experience with Perl or Python, network programming experience, web development experience (HTML/CSS/Javascript), experience with Jabber/XMPP would be a big plus.
'Blue sky' ideas
- Reliable and scalable storage backend for CernVM-FS. Currently CernVM-FS uses ZFS for the storage backend. The idea is to come up with an architecture for implementing the backend storage system for CernVM-FS. The storage should be reliable, scalable, it should support snapshots + rollbacks, should be optimized for hosting many (~ 10^8) small (~ 10 kB) files. The storage will be used to transform large directory trees into content addressable storage, the task would also involve distribution of the transformation among a couple (or potentially many) worker and storage nodes.
Mentor: Jakob Blomer, Predrag Buncic, Artem Harutyunyan
Requirements: Good knowledge of existing storage technologies (e.g. distributed file systems), experience with parallel/distributed programming.
- LHCb adapter for Co-Pilot. The idea is to implement a Co-Pilot system Job Manager which will make possible to get jobs from the DIRAC system.
Mentor: DIRAC expert + Artem Harutyunyan
Requirements:Experience with Perl or Python, network programming experience, experience with LHCb DIRAC.
- CMS adapter for Co-Pilot. The idea is to implement a plugin (aka Adapter) for the Co-Pilot system which will make possible to get jobs from the CRAB system
Mentor: CRAB expert + Artem Harutyunyan
Requirements:Experience with Perl or Python, network programming experience, experience with CMS CRAB.
- CernVM installer improvements. CernVM installer is a cross-platform GUI toolkit (written in C++, using Qt library) for downloading, installing and configuring CernVM releases with VirtualBox. The toolkit will be released to public with the next version of CernVM, its source is available here. The task would be to implement libVirt integration (so other hypervisors can be supported as well) and write the new frontend based on ncurses.
Mentor: Pere Mato
Requirements: Extensive knowledge of Bash, experience with libvirt virsh.
- Tier 3 Grid site out of the box. Grid middleware and the application software stacks of scientific collaborations are not easy to deploy and maintain. We think that virtualisation technologies can greatly facilitate the job of Grid site administrators. The task would be to implement a workflow manager (e.g. based on Taverna) which can be used to ease the Grid site deployment and maintenace task. This is a very important problem for relatively small universities and institutions which need to fulfill their commitments in providing Grid resources with a limited manpower.
Mentor: Predrag Buncic
Requirements: Experience with Perl (or Python) and Bash, experience with Taverna and/or other workflow management tool will be a plus
- Replacing chirp in Co-Pilot. Currently Co-Pilot uses chirp for transfering the files from and to the job manager. The task will be to replace chirp with another file transfer mechanism (e.g. XMPP based ).
Mentor: Artem Harutyunyan
Requirements: Experience with Perl (or Python), experience with XMPP will be a plus
Mentors
Here is the list of our mentors and their areas of experitse:
- Jakob Blomer CernVM File System
- Predrag Buncic CERN Virtual Machine
- Artem Harutyunyan CernVM Co-Pilot
- Pere Mato CernVM Installer
- Ben Segal CernVM Co-Pilot (BOINC and volunteer computing)
Contact information
Please do not hesitate to contact us if you are planning to apply for CernVM GSoC project:
- CernVM GSoC mailing list: cernvm-gsoc-AT-cern-DOT-ch (no subscriprion needed).
- CernVM GSoC Jabber/XMPP chat room: cernvm-gsoc@conference.jabber.org . We noticed that Gmail/Gtalk XMPP accounts have problems posting messages to the chat room, so please register yourself an XMPP account on some other server (it takes no time to register an account at http://www.jabber.org/create-an-account/).
- IRC is restricted at CERN so please use Jabber instead.
