NCBI Minute: New Version of E-utilities Supports Accession.version Identifiers

National Library of Medicine
1 Feb 201728:00
EducationalLearning
32 Likes 10 Comments

TLDRThe webinar introduces a significant update to the E-utilities API by the National Center for Biotechnology Information (NCBI), which now fully supports accession.version identifiers for sequence records, both as input and output. This development is part of a series addressing the transition from G.I. numbers to accession.version identifiers, which are becoming the primary identifiers for new sequence data. The presentation covers how the E-utilities, including ESearch, ESummary, EFetch, and ELink, have been adapted to accommodate this change. It also highlights the new capabilities of EFetch to access GI-less sequences through their accession.version identifiers, which is particularly relevant for Whole Genome Shotgun (WGS) and Transcriptome Shotgun Assembly (TSA) datasets. The webinar provides practical examples and a Perl script for exploring the new functionalities, and encourages participants to subscribe to utilities-announce@ncbi for updates and to reach out with questions or concerns.

Takeaways
  • 📈 The E-utilities API now fully supports accession.version identifiers, both as input and output, marking a significant upgrade in functionality.
  • 🔍 The phased-out G.I. numbers will not be assigned to new sequences, but existing G.I. numbers will remain accessible and unchanged.
  • 🚀 ESearch can now return results as accession.version identifiers when the ID type parameter is set to 'acc'.
  • 🔗 ELink can take accession.version identifiers as input and provide linked records in another database, also outputting accessions.
  • 📚 EFetch can download records for sequences without G.I. numbers using their accession.version identifiers.
  • 🛠 The introduction of the 'idtype' parameter allows for controlling the type of identifiers used in various E-utilities.
  • 📝 ESummary can now accept accession.version as input, whereas it previously could not, enhancing its capabilities.
  • 📉 GI numbers are being removed from public presentations such as flat file formats and FASTA definition lines.
  • 🔍 ESummary can be used to check if a sequence has a G.I. number by returning an error tag for sequences without a GI.
  • 🌐 The scope of E-utilities has expanded to include GI-less sequences, which are part of the NCBI's future direction for sequence identification.
  • ❗ It is recommended that users subscribe to utilities-announce@ncbi for updates on E-utilities to stay informed about upcoming changes.
Q & A
  • What is the main topic of the webinar?

    -The webinar is about the new functionality within the E-utilities API that supports accession.version identifiers for sequence records.

  • What is the significance of the 'idtype' parameter in the E-utilities?

    -The 'idtype' parameter is crucial as it controls the type of identifiers used by the E-utilities, allowing them to transition from G.I. numbers to accession.version identifiers.

  • Why is NCBI phasing out G.I. numbers?

    -NCBI is phasing out G.I. numbers to transition to accession.version identifiers as the primary identifier for sequence records, due to various reasons including the dependency of their tools on G.I. numbers.

  • How can users access the materials for the webinar?

    -The materials for the webinar, including the PowerPoint, can be accessed through the FTP site which is linked via the coursesandwebinars link on the screen or directly via the go.usa.gov URL.

  • What happens to the GI numbers of sequences that have already been assigned them?

    -The GI numbers are not being deleted; sequences that have been assigned a G.I. number will continue to have that number and will be retrievable by it for the foreseeable future.

  • How does the transition to accession.version identifiers affect the visibility of GI numbers?

    -GI numbers are no longer visible on the flat file formats on the web or in FASTA definition lines, including in the BLAST databases. However, they are still present in the raw data files like ASN.1 or XML.

  • What is the role of ESearch in the context of the new functionality?

    -ESearch can now return a set of accession.version identifiers corresponding to the result set of a text query, offering users the option to receive accession.version identifiers instead of GI numbers.

  • How can users determine if a sequence has a G.I. number or not?

    -Users can use ESummary to check if a sequence has a G.I. number. ESummary will return an error tag for accessions that do not have a G.I. identifier.

  • What is the impact of the transition on ELink and EPost functionalities?

    -ELink and EPost functionalities are affected as sequences without G.I. numbers are not part of the Entrez system, hence cannot be indexed by ESearch, do not have document summaries for ESummary, and cannot be placed on the Entrez history for EPost.

  • What new functionality does EFetch gain with the transition to accession.version identifiers?

    -EFetch gains the ability to access all NCBI sequences, whether they have a G.I. number or not, through their accession.version identifiers, including GI-less sequences.

  • How can users stay updated with the changes and updates regarding the E-utilities?

    -Users can follow NCBI's social media posts, blog, news, Facebook, and Twitter accounts for updates. Subscribing to utilities-announce@ncbi is also recommended for announcements about upcoming changes.

Outlines
00:00
😀 Introduction to E-utilities API Update

The webinar introduces a new functionality within the E-utilities API that supports accession.version identifiers. The presenter provides access to materials such as a PowerPoint and additional resources via an FTP site. The webinar is part of a series discussing changes in how NCBI handles sequence identifiers. The E-utilities API now fully supports accession.version identifiers, both as input and output. The presenter outlines the transition from using G.I. numbers to accession.version identifiers and introduces the 'idtype' parameter that controls this transition across various e-utilities.

05:01
🔍 E-utilities Functions and Accession.Version Upgrade

The E-utilities are explained as tools that work across various Entrez databases, providing functions like ESearch, ESummary, EFetch, ELink, and EPost. These tools handle unique identifiers (UIDs) for records. The introduction of accession.version as UIDs is detailed, highlighting that while G.I. numbers are being phased out, they are not deleted and will remain accessible. The new functionality allows for the use of accession.version identifiers in place of G.I. numbers in ESummary and EFetch, and the ability for ESearch to output accession.version. The paragraph also covers the impact on UIDs and the upgrade in functionality for the APIs.

10:02
🔗 ELink and ESearch Behavior with Accession.Version

The behavior of ELink and ESearch with the new accession.version identifiers is discussed. ELink can now accept accession numbers as input and return accessions in the output. ESearch, when provided with the 'idtype=acc' parameter, will return accession.versions instead of GI numbers. The paragraph also addresses the issue of sequences without G.I. numbers, which are not part of the Entrez system and thus not accessible through ESearch, ESummary, or ELink, but can now be downloaded using EFetch.

15:05
🌐 Access to GI-less Sequences and the Norway Spruce Project

The presenter discusses the new capability of EFetch to access GI-less sequences using their accession.version identifiers. This includes Whole Genome Shotgun (WGS) and Transcriptome Shotgun Assembly (TSA) datasets, many of which do not have GI numbers for individual contigs. The Norway Spruce project is highlighted as an example of a large dataset without GI numbers, where only the master record has a GI number. The presenter also explains how to check if a sequence has a GI number using ESummary and provides a Perl script for exploring the new functionalities.

20:07
🔬 EFetch Functionality and Sequence Identification Checks

The presenter demonstrates the use of EFetch to obtain data for GI-less sequences, such as the Norway Spruce WGS data. They also show how to use ESummary to check if a sequence has a GI number by including it in a mixed list of identifiers. The script provided on the FTP site is mentioned for attendees to experiment with the new functionalities. The presenter encourages questions and further engagement through social media and subscription to the utilities-announce@ncbi mailing list for updates.

25:11
❓ Handling Questions and Final Remarks

The presenter addresses questions from the audience, including the best way to programmatically obtain the FTP location for entire WGS assemblies using BioProject or by searching through the nuccore database. They also discuss the current behavior of ESummary regarding the order of input UIDs and the presence of GI-less accession.versions. The webinar concludes with an invitation for further inquiries and an offer of assistance for any issues or suggestions regarding the new functionalities.

Mindmap
Keywords
💡E-utilities API
The E-utilities API is a set of tools provided by the National Center for Biotechnology Information (NCBI) that allows users to interact with the Entrez system programmatically. It is central to the video's content as it has been updated to support 'accession.version' identifiers, which is a significant change in how sequence identifiers are handled at NCBI.
💡Accession.version Identifiers
Accession.version identifiers are unique identifiers assigned to biological sequences in databases like GenBank. They are replacing the older G.I. numbers and are crucial to the video's narrative as they are now fully supported by the E-utilities API for both input and output, marking a transition in sequence identification practices.
💡G.I. Number
G.I. (GenInfo Identifier) numbers are integer identifiers that have been used by NCBI for many years to identify sequences. The video discusses the phasing out of G.I. numbers in favor of accession.version identifiers, which is a key change being implemented by NCBI.
💡Entrez System
The Entrez system is a search and retrieval system for accessing databases such as PubMed, protein, and nucleotide sequences. It is integral to the video as the E-utilities API, which interfaces with the Entrez system, is being updated to accommodate the new accession.version identifiers.
💡ESearch, ESummary, EFetch, ELink, EPost
These are specific utilities within the E-utilities suite that serve different functions for data retrieval and manipulation within the Entrez system. They are highlighted in the video as they are being updated to work with accession.version identifiers, which affects how users can retrieve and interact with sequence data.
💡idtype Parameter
The idtype parameter is a new feature introduced in the E-utilities API that allows users to control the type of identifier used in their queries. It is central to the transition from G.I. numbers to accession.version identifiers and is demonstrated in the video through various examples.
💡Phasing Out
The term 'phasing out' is used in the video to describe the process by which NCBI is discontinuing the assignment of G.I. numbers to new sequences. This process is significant as it marks a shift in the primary identifier for sequence records within NCBI databases.
💡FTP Site
The FTP site mentioned in the video is a resource where users can access materials related to the webinar, including the PowerPoint presentation. It also alludes to the method by which users can download large datasets, such as Whole Genome Shotgun (WGS) sequences, which are not fully integrated into the Entrez system.
💡Whole Genome Shotgun (WGS) Datasets
WGS datasets are large collections of sequences that represent a complete genome, generated by a sequencing method called 'shotgun sequencing.' The video discusses how many of these datasets, including those without G.I. numbers, are now accessible through the updated E-utilities API using accession.version identifiers.
💡NCBI Learn Page
The NCBI Learn page is a resource for users to find out about events, webinars, and other educational content offered by NCBI. It is mentioned in the video as a place to stay informed about updates and changes to the E-utilities and other NCBI tools.
💡Utilities-Announce@ncbi
Utilities-Announce@ncbi is a mailing list that users can subscribe to in order to receive announcements about updates to the E-utilities. It is highlighted in the video as a way for users to stay informed about changes and new functionalities.
Highlights

The E-utilities API now fully supports accession.version identifiers, both as input and output.

NCBI is phasing out G.I. numbers in favor of accession.version identifiers for sequence records.

The transition from G.I. numbers will not involve deletion of existing G.I. numbers, but new sequences will not be assigned G.I. numbers.

G.I. numbers are no longer visible in flat file formats, FASTA definition lines, or BLAST databases.

After March 15, GenBank and RefSeq releases will not contain G.I. numbers in their flat file or FASTA presentations.

The E-utilities, including ESearch, ESummary, EFetch, ELink, and EPost, now function with accession.version identifiers.

The introduction of the 'idtype' parameter allows control over the type of identifiers used in E-utilities.

EFetch can now download sequences without G.I. numbers using their accession.version identifiers.

ESummary can be used to check if a sequence has a G.I. number by returning an error tag for sequences without G.I. numbers.

The Norway Spruce project (CBVK) serves as an example of a large dataset without G.I. numbers for individual contigs.

E-utilities now have access to a broader scope of sequences, including those without G.I. numbers.

A simple Perl script is provided for exploring the new functionality of the E-utilities API.

ESummary, ESearch, and ELink have been updated to handle accession.version identifiers, expanding their capabilities.

The order of input UIDs in ESummary is maintained, even if some have no G.I. numbers.

The webinar provides a direct link to the FTP site for accessing materials and a go.usa.gov URL for further information.

NCBI encourages users to subscribe to utilities-announce@ncbi for announcements on upcoming changes to the E-utilities.

The webinar concludes with a Q&A session addressing questions about downloading WGS sequences and the order of UIDs in ESummary.

Transcripts
Rate This

5.0 / 5 (0 votes)

Thanks for rating: