Cyberinfrastructure-Enabled Collaboration Networks

Jian Qin (Principal Investigator)
Jeff Hemsley (Co-Principal Investigator)

Project period: 2016-2018

Cyberinfrastructure enables collaborative research and significantly impacts scientific capacity and knowledge diffusion. In response to the growing need for quantitatively evaluating outcomes and impact of federal investment on research, this project deploys new data, tools, metrics, and methods for assessing the impact of cyberinfrastructure and the data services built on them. This research helps researchers and policy makers understand how cyberinfrastructure affects collaboration dynamics and network structures of researchers. Datasets organized by longitudinal, thematic, topical, geographical, institutional, and author dimensions provided, which researchers, policy makers, and students can access and use to explore data-intensive science of science and innovation policy related research.

Metadata from GenBank, patent data from U.S. Patent and Trademark Office and funding data from NIH ExPORT are analyzed with descriptive statistics and models from Complex Network Analysis. The project not only examines the topological properties of the data submission and publication networks, but also the temporal ordering of collaborative relationships and the overlap of the sequence submission and publication networks. Through slicing, plotting, and visualizing data, appropriate sampling strategies and algorithms are developed to more deeply explore collaboration networks, both structurally and temporally. Algorithms used in community detection, machine learning, and visualization serve as primary computational methods in this research. Data products to be shared with research communities include 1) discovery lifecycle datasets containing sequence submissions, publications, and patents as well as the links between them and 2) funding factor datasets containing links between U.S. federal funding data and the discovery lifecycle datasets.

Community as Collection

Yun Huang (Principal Investigator)
David Lankes (Co-Principal Investigator)
Jian Qin (Co-Principal Investigator)
 Project period: 2015-2018

Syracuse University’s School of Information Studies (iSchool) will partner with Coulter Library at Onondaga Community College and Fayetteville Free Library to design a Community Profile System to include human expertise, particularly in the STEM fields. The system will enable librarians to collect communities’ learning needs, identify relevant community experts, and link the resources to serve the learning needs in a cost-efficient manner. The tangible products include the Community Profile System and its web and mobile applications. As libraries shift from collection-driven to community-driven service models, the Community Profile System will fill a much needed gap in the community – oriented librarianship toolbox. The partners are building a system that will realize community-oriented librarianship in a cost-efficient manner. The collaboration and partnerships will ensure the design, test, and assessment of the tool to meet its goal of a national adoption in diverse settings.

Domain-aware management of heterogeneous workflows: Active data management for gravitational-wave science workflows

Duncan Brown (Principal Investigator)
Ewa Deelman (Co-Principal Investigator)
Peter Couvares (Co-Principal Investigator)
Jian Qin (Co-Principal Investigator)

Project period: 2014-2017

Analysis and management of large data sets are vital for progress in the data-intensive realm of scientific research and education. Scientists are producing, analyzing, storing and retrieving massive amounts of data. The anticipated growth in the analysis of scientific data raises complex issues of stewardship, curation and long-term access. Scientific data is tracked and described by metadata. This award will fund the design, development, and deployment of metadata-aware workflows to enable the management of large data sets produced by scientific analysis. Scientific workflows for data analysis are used by a broad community of scientists including astronomy, biology, ecology, and physics. Making workflows metadata-aware is an important step towards making scientific results easier to share, to reuse, and to support reproducibility. This project will pilot new workflow tools using data from the Laser Interferometer Gravitational-wave Observatory (LIGO), a data-intensive project at the frontiers of astrophysics. The goal of LIGO is to use gravitational waves—ripples in the fabric of spacetime—to explore the physics of black holes and understand the nature of gravity. 

Efficient methods for accessing and mining the large data sets generated by LIGO’s diverse gravitational-wave searches are critical to the overall success of gravitational-wave physics and astronomy. Providing these capabilities will maximize existing NSF investments in LIGO, support new modes of collaboration within the LIGO Scientific Collaboration, and better enable scientists to explain their results to a wider community, including the critical issue of data and analysis provenance for LIGO’s first detections. The interdisciplinary collaboration involved in this project brings together computational and informatics theories and methods to solve data and workflow management problems in gravitational-wave physics. The research generated from this project will make a significant contribution to the theory and methods in identification of science requirements, metadata modeling, eScience workflow management, data provenance, reproducibility, data discovery and analysis. The LIGO scientists participating in this project will ensure that the needs of the community are met. The cyberinfrastructure and data-management scientists will ensure that the software products are well-designed and that the work funded by this award is useful to a broader community.

Discovering Collaboration Network Structures and Dynamics in Big Data

Jian Qin (Principal Investigator)
Jeffrey Stanton (Co-Principal Investigator)
Jun Wang (Co-Principal Investigator)
     Project period: 2013 – 2016

Understanding how individual scientists interact with one another and how such interaction impacts research productivity and knowledge diffusion is important for understanding the dynamics of scientific research collaboration. At the same time, information about patterns of collaboration and their consequences have implications for science policy. In quantitative research on collaboration networks, publication co-authorships and citation-linkages have been the primary source of data. As large data repositories, one of the signposts for cyberinfrastructure-enabled, data-driven science, become increasingly prevalent, however, they offer an alternative source of information about networks of scientific collaboration. This project investigates research collaboration networks emerging around one such international data repository, GenBank, and develops data products to support data-driven science policymaking and research. By utilizing this novel data source the project provides an unprecedented opportunity to validate and expand the theory of complex networks while generating rich data outputs and products to support science policy research and policymaking. This study fills a number of theoretical and methodological gaps identified by the 2008 roadmap for Science of Science Policy (SoSP), with a specific focus on how scientific collaboration networks form and evolve. The outcomes of this study address the lack of models and tools for network analysis, visual analytics, and science mapping outlined in the 2008 roadmap for SoSP. To accomplish the data collection and processing required for this project new computational programs will be developed to parse, extract, store, transform, split, merge, and filter the data; these will be applicable to the analysis of other similar data sources for science policy and innovation research.

Development and Dissemination of A Capability Maturity Model for Research Data Management Training and Performance Assessment

Jian Qin (Principal Investigator)
Kevin Crowston (Co-PI)
Charlotte Flynn, Doctoral RA
Arden Kirkland, Masters RA
     ICPSRlogo    sloanJune 2013- May 2014.      Completed.

The broad goals of this project are to document, foster and promulgate best practices in research data management, practices that support research transparency and the replication of scientific results. We do so in order to cultivate a new generation of researchers and data managers who are both the best practice beneficiaries and contributors. Furthermore, as more organizations invest in RDM, it has become increasingly important for administrators, researchers, and managers to be able to evaluate RDM process for sustainability, efficiency, and effectiveness, which requires a baseline for comparison.

eScience Librarianship: Education and Training     2009-2012. Completed.

Syracuse University’s School of Information Studies (iSchool) partnered with Cornell University Library to respond to Category Five of the IMLS Laura Bush 21st Century Librarian Program, Building Institutional Capacity. The partnership built upon several prior IMLS-funded programs to 1) recruit students to librarianship with necessary discipline-based education in the sciences, such as the ARL Academy, and 2) research and develop a new curriculum that responds to needs for management of new and different types of digital resources, at amounts previously unimagined, for long-term access and use.

Scientific Data Literacy     2008 – 2010. Completed.


The Scientific Data Literacy curriculum development project was funded as part of the NSF’s investment to encourage the development and dissemination of effective techniques in undergraduate education in science, technology, engineering, and mathematics (STEM). Over a two-year period SDL project researchers planned and twice conducted a new undergraduate course that combines both these NSF targeted areas. The project website offers instructor resources and other useful information for scientific data literacy training.