Sometimes I have to put text on a path

Saturday, June 11, 2011

Ex-ample: E-tools matlab and pubmed ; retrieve information from various Web database; read the informations into a MATLAB structures.



Bioinformatics Toolbox includes several get functions that retrieve information from various Web databases. Additionally, with some basic MATLAB programming skills, you can create your own get function to retrieve information from a specific Web database.
The following procedure illustrates how to create a function to retrieve information from the NCBI PubMed database and read the information into a MATLAB structure. The NCBI PubMed database contains biomedical literature citations and abstracts.
The following procedure illustrates how to create a function to retrieve information from the NCBI PubMed database and read the information into a MATLAB structure. The NCBI PubMed database contains biomedical literature citations and abstracts.

Creating the getpubmed Function

The following procedure shows you how to create a function named getpubmed using the MATLAB Editor. This function will retrieve citation and abstract information from PubMed literature searches and write the data to a MATLAB structure.
Specifically, this function will take one or more search terms, submit them to the PubMed database for a search, then return a MATLAB structure or structure array, with each structure containing information for an article found by the search. The returned information will include a PubMed identifier, publication date, title, abstract, authors, and citation.
The function will also include property name/property value pairs that let the user of the function limit the search by publication date and limit the number of records returned.
  1. From MATLAB, open the MATLAB Editor by selecting File > New > M-File.
  2. Define the getpubmed function, its input arguments, and return values by typing:
    function pmstruct = getpubmed(searchterm,varargin)
    % GETPUBMED Search PubMed database & write results to MATLAB structure
  3. Add code to do some basic error checking for the required input SEARCHTERM.
    % Error checking for required input SEARCHTERM
    if(nargin<1)
        error('GETPUBMED:NotEnoughInputArguments',...
              'SEARCHTERM is missing.');
    end
  4. Create variables for the two property name/property value pairs, and set their default values.
    % Set default settings for property name/value pairs,
    % 'NUMBEROFRECORDS' and 'DATEOFPUBLICATION'
    maxnum = 50; % NUMBEROFRECORDS default is 50
    pubdate = ''; % DATEOFPUBLICATION default is an empty string
  5. Add code to parse the two property name/property value pairs if provided as input.
    % Parsing the property name/value pairs
    num_argin = numel(varargin);
    for n = 1:2:num_argin
        arg = varargin{n};
        switch lower(arg)
    
            % If NUMBEROFRECORDS is passed, set MAXNUM
            case 'numberofrecords'
                maxnum = varargin{n+1};
    
            % If DATEOFPUBLICATION is passed, set PUBDATE
            case 'dateofpublication'
                pubdate = varargin{n+1};          
    
        end
    end
  6. You access the PubMed database through a search URL, which submits a search term and options, and then returns the search results in a specified format. This search URL is comprised of a base URL and defined parameters. Create a variable containing the base URL of the PubMed database on the NCBI Web site.
    % Create base URL for PubMed db site
    baseSearchURL = 'http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search';
  7. Create variables to contain five defined parameters that the getpubmed function will use, namely, db (database), term (search term), report (report type, such as MEDLINE®), format (format type, such as text), and dispmax (maximum number of records to display).
    % Set db parameter to pubmed
    dbOpt = '&db=pubmed';
    
    % Set term parameter to SEARCHTERM and PUBDATE
    % (Default PUBDATE is '')
    termOpt = ['&term=',searchterm,'+AND+',pubdate];
    
    % Set report parameter to medline
    reportOpt = '&report=medline';
    
    % Set format parameter to text
    formatOpt = '&format=text';
    
    % Set dispmax to MAXNUM
    % (Default MAXNUM is 50)
    maxOpt = ['&dispmax=',num2str(maxnum)];
  8. Create a variable containing the search URL from the variables created in the previous steps.
    % Create search URL
    searchURL = [baseSearchURL,dbOpt,termOpt,reportOpt,formatOpt,maxOpt];
  9. Use the urlread function to submit the search URL, retrieve the search results, and return the results (as text in the MEDLINE report type) inmedlineText, a character array.
    medlineText = urlread(searchURL);
  10. Use the MATLAB regexp function and regular expressions to parse and extract the information in medlineText into hits, a cell array, where each cell contains the MEDLINE-formatted text for one article. The first input is the character array to search, the second input is a search expression, which tells the regexpfunction to find all records that start with PMID-, while the third input, 'match', tells the regexp function to return the actual records, rather than the positions of the records.
    hits = regexp(medlineText,'PMID-.*?(?=PMID|
    $)','match');
  11. Instantiate the pmstruct structure returned by getpubmed to contain six fields.
    pmstruct = struct('PubMedID','','PublicationDate','','Title','',...
                 'Abstract','','Authors','','Citation','');
  12. Use the MATLAB regexp function and regular expressions to loop through each article in hits and extract the PubMed ID, publication date, title, abstract, authors, and citation. Place this information in the pmstruct structure array.
    for n = 1:numel(hits)
        pmstruct(n).PubMedID = regexp(hits{n},'(?<=PMID- ).*?(?=\n)','match', 'once');
        pmstruct(n).PublicationDate = regexp(hits{n},'(?<=DP  - ).*?(?=\n)','match', 'once');
        pmstruct(n).Title = regexp(hits{n},'(?<=TI  - ).*?(?=PG  -|AB  -)','match', 'once');
        pmstruct(n).Abstract = regexp(hits{n},'(?<=AB  - ).*?(?=AD  -)','match', 'once');
        pmstruct(n).Authors = regexp(hits{n},'(?<=AU  - ).*?(?=\n)','match');
        pmstruct(n).Citation = regexp(hits{n},'(?<=SO  - ).*?(?=\n)','match', 'once');
    end
  13. Select File > Save As.
    When you are done, your M-file should look similar to the getpubmed.m file included with the Bioinformatics Toolbox software. The samplegetpubmed.m file, including help, is located at:
    matlabroot\toolbox\bioinfo\biodemos\getpubmed.m
Note The notation matlabroot is the MATLAB root directory, which is the directory where the MATLAB software is installed on your system.

No comments:

Post a Comment