HEP.TrackX Weekly

Name: HEP.TrackX Weekly
Start: 2016-09-30T13:30:00-07:00
End: 2016-09-30T14:30:00-07:00
Location: vidyo

Friday 30 Sept 2016, 13:30 → 14:30 US/Pacific

vidyo

Paolo Calafiura, Steven Farrell

Description

Please join the meeting by clicking this link:
https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=UoHjbOgfcsrBsDfBwOlOsCObR3E

If you want to join by phone, please use one of the phone numbers listed in the link below: http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone and enter the meeting extension 103958871 in order to join.

Hide

Present: JimK, Lindsey, Mayur, Jean-Roch

Apologies: Paolo

News:

Proposed hackathon after CHEP (@LBL) Oct 14,15, 17
- Notice C++ summit @LBL Oct 17/18
- JimK and Lindsey will not be able to travel to LBNL - is remote participation possible?

Outstanding Action Items:

Pietro → introduce group activities
Maria: I want to try to put all the agendas and the links and the minutes in a Basecamp : do you people use Basecamp? Pietro Shall we try? Others?
- PC: Indico+google seems to work but I am open to evaluate basecamp
Github private repository https://github.com/cmscaltech/HEP.TrkX (the cmswill dissapear)
- Steve’s ID: sparticlesteve
- Mayur’s ID: mudigonda
- Jim K: jbkowalkowski
- Paolo: pcalafiura
FIle sharing: MS made a 50GB Box @ caltech. → Paolo to test
- Paolo Initial feedback: look and feel similar to google drive, even more to MS OneDrive.
  - Relies on Microsoft plugins to open/create documents like live minutes.
  - Concurrent editing in Word online was laggy initially but become seamless after a couple of edits (possibly a network fluke?).
  - Resulting document is saved as docx, rather than having to be exported as in google drive
  - Once one understands how to sync folders (not immediately apparent), OS X sync client works fine
  - Can not use “offline” MS Word to write to a document being edited online, even after installing recommended “box edit” application
  - Box app for android works, but no live edit, nor automatic sync of edited doc
  - Next week I’ll try the linux and MS clients
Data Storage: JBK
- Talking with Amitoj in SCD about storage. Still working things out. Asked about getting 10 TB of space
- 1) Simulation data generated from resources at Caltech, LBNL, FNAL.
- 2) FNAL hosts ALL the simulation data generated from the 3 resources mentioned in step #1? yes.
- 3) Primarily LBNL consumes the simulation data hosted at FNAL for testing algorithms and for training "deep learning” networks (including generation of training samples). Are there any IO requirements here for remote access to data hosted at FNAL? Answered that more that LBNL will be processing the data
- 4) FNAL will ALSO store the training samples and any output data produced from step #3? yes
Web page: MS will be ready this weekend
Computing platform:
- PC submitted NERSC ERCAP request for 1.2M hours 5+5 TB storage
- MS setting up GPU tower

Completed:
- Paolo, Maria → Send (tentative) dates of all meetings - DONE
- Name of the project: MS HEP-TrkX or HEP.TrkX - AGREED
- Dates of kickoff workshop: proposal is Nov 1-3, at FNAL. Accepted by folks on the call. AGREED
- Mayur LSTM working doc - DONE

Round the Table for Status Reports:

Jim K:

Meeting rooms reserved for the Nov 1-3 workshop at FNAL
We will be putting on an agenda for the workshop very soon (by Monday)
- need topics, schedule filled out
- need invite list (other than those on this email list)
- we can reserve a block of rooms at the hotel up the street if wanted

Lindsey:

Talked with Paolo and Andy Salzburger about aCTS. He is interested in knowing how pile is handled. Wants to get the CMS phase II geometry going. Interested in extending the simulation (independent of this effort) to include extra timing data. He will be at the CPAD meeting, but it sounds like he will not be available to meet outside that meeting.

Jean-Roch:

Has some of the MPI-based optimizer going on 8 GPGPUs (one node). He has already sent out information about the deep Kalman filter to the email list.

Mayur:

Will be adding a bit more detail to the document describing the tests they’ve been doing the RNNs.

JBK - storage request information

Here is the request I put in for storage.

A new cross-cutting HL-LHC advanced algorithm & machine learning R&D project is now starting with FNAL, LBNL and Caltech. This is only a one year project. We have some modest needs for storage of simulation data and training data. I think it would be best for FNAL to host this data. There will be less than ten people contributing to this project overall.

Since this is an LHC cross-cutting project, the data will be generated with a generic HL-LHC detector description using a stand-alone detector simulation tool i.e. independent of ATLAS-specific or CMS-specific simulation applications. This stand-alone simulator is expected to be run at LBNL (NERSC), Caltech, and FNAL. The data will be migrated from temporary scratch areas to FNAL after it is generated. Copies will likely be retained in scratch storage areas at NERSC and Caltech as long as possible for additional post-processing.

The expected size for the main simulation dataset is 1TB. The dataset will be likely be generated over the first quarter of the project. There will be plenty of exploration and configuration testing for the simulator, so intermediate results might be 2-3TB.

The advanced algorithms include “deep learning” networks, which require large training samples. It is likely that all three sites (NERSC, Caltech, and FNAL) will be used to post-process the simulation data to produce training samples. The vast majority of algorithm training requires access to GPGPU and KNL resources and will be done at NERSC and Caltech.

We would like to have 10TB of disk available for all aspects of this project. At the end of the year, we expect 1-3TB of data to be permanently archived.

There are no specific network data transfer requirements for this project. We anticipate moving data between sites slowly in the background using whatever facilities are available. Data movement between sites will certainly be modest, and will be better defined after our first workshop in early November.

Distributed learning (Jean-Roch) : we have made progress with Dustin to implement an MPI distributed training system. Ran on the 2x titans box at Caltech, the 8x GTX1080 at Caltech and soon on Piz Daint (CSCS) and Cooley (ANL). We are starting to train models (RNN, CNN, and novel architecture) over a dataset created with Delphes We have no numbers of performance but this could certainly be of use. N.B. its is wired to Keras and both backends are investigated. We might use TF for model distribution over the data parallelism in the end. Very exciting !

Deep Kalman Filter (Jean-Roch) : forwarding a message from Uri Shalit whom we met in NY in July during DS@HEP-16 about recent update to their software for deep kalman filters (learning the dynamic with deep neural net). This might be of relevance to us.

Visualizing weights (Mayur ): Worked on setting up tensorboard weight visualizations for the model. In addition, played around (re-did) experiments from the past to explore choices of model parameters. That is, number of hidden layers and LSTM units

There are minutes attached to this event. Show them.

The agenda of this meeting is empty