The IT Services at the School of Information Studies (iSchool), Syracuse University has set up a dedicated data server for the proposed GenBank metadata mining project. The server has the necessary software including database server system and programming language environment and sufficient storage space that can be expanded based on project needs.
Raw data obtained from GenBank repository and results generated from data processing and analysis will be archived on the data server and backed up regularly on site.
1. Products of the research: types of data, samples, physical collections, software, curriculum materials, and other materials to be produced
This project will produce four types of data:
1. Raw data: the metadata section extracted from GenBank records.
2. Processed data: this category includes parsed metadata that are separated by unit of analysis and the data external to GenBank.
3. Output data: results from network modeling and analysis as well as scientometric analysis and mashups. These will include derived datasets, calculated datasets, statistical analysis output, graphs and charts.
4. Compiled data: combinations of output data that are compiled for a variety of purposes such as longitudinal study, cross sector study, cross disciplinary study, and science mapping.
Two types of computer programs will also result from this project. One is the computer program code for automatic data collection, parsing, name entity disambiguation, duplicate record detection, and visualization, some of which will be produced as software tools, and the other type is created in R and other network modeling and analysis software.
We will make the processed and compiled data available to the public under Creative Commons licenses. We will provide source code and may also, as appropriate, make available binary distributions for the tools developed during this project. In addition, we will provide documentation for both data and software tools including manual pages, user guides, and administrator guides, etc.
2. Standards to be used for data and metadata format and content
The raw data will be kept in its original plain text format while processed data will be stored in Microsoft SQL Server and later converted to data formats acceptable by analysis software such as R. Output data will be preserved in their original formats but compiled data will use standard data formats such as Excel, XML, and HTML5 for interoperability.
Data sets and computer program code will be managed by a two-phase scheme. The first phase is designed to manage “active files” that will be changed and moved frequently in a time frame. The active data and code files will follow a naming convention to be as descriptive as possible and organized in a carefully designed file directory system. Version control will be enforced by using a version control system such as OpenCVS. Before active data are moved to the next management phase, the data files will go through a retention process with a checklist. Intermediary data files may be deleted after review and verification. Verified data files will be moved to the curation workspace for metadata description and indexing.
The second phase of management is for the stable, verified datasets and program code files that will be made available for open access. These datasets and program code will be archived in an open data repository (to be determined) that is a Dublin Core metadata compliant repository system. The metadata schema in the repository system will be modified according our data curation needs. Once the metadata is created for the datasets, we will register the datasets in a larger data repository for broader discoverability and dissemination.
3. Access to data and data sharing practices and policies
Our datasets, software tools, documentation, and publications will be made available on a project website. The datasets and software tools will be released for free under a Creative Common license. They will be shared with any parties who agree to use the data and tools under the terms of the license.
All participants in this proposal will conduct research and publish the results of their work. Papers will be published in peer-reviewed conferences, journals, or book that publishes in English, or as peer-reviewed data report. We will make all publications available on our website, except when prohibited by the copyright of the publisher. We will also provide access to raw and processed data used in publications on the website, and this data will be available for free to any scientists and science policy researchers who want to use it for comparison or analysis.
4. Policies and provisions for reuse, redistribution, and the production of derivatives
Data sets and software provided through the project website may be used as specified under the Creative Commons license. Publications provided on the website may be distributed freely in academic environments and may be cited with appropriate attribution according to standard academic practice. Data provided on the website may be used for analysis and comparison of scientific results.
We intend to conduct similar research using data repositories in other disciplines in future. This will necessitate the computer programs and tools remain open and interactive with the user community and software development community. We will encourage the community to provide feedback through various venues.
5. Plans for archiving and for preservation of access:
The proposal team is committed to preserving the project web site for at least three years after the completion of funding. Datasets of potential value for future research and policy making will be maintained indefinitely. After consultation with the appropriate NSF program officer to ascertain any exceptions, items will be discarded no sooner than 3 years after the conclusion of the grant or the public release, whichever is later.