GitHub - chao-ji/cross-link: An open source software for identifying cross-linked peptide from fragment ion spectra

XLSearch, Version 1.1 Copyright of School of Informatics and Computing, Indiana University Contact: [email protected], [email protected]

I. INTRODUCTION This software is intended to perform database sequence search for identifying chemically cross-linked peptide pairs from tandem mass spectra. Usage of this software is free of charge for academic purposes.

II. PREREQUISITES

i. Software packages

	This software can be run on Unix/Linux operating systems.
	1. Python version 2.6 or higher is required.
	2. To perform the in-sample training (i.e. 'training mode'), additional
	 python modules (Numpy 1.6.1 or higher, Scipy 0.9 or higher, Scikit-learn 
	 0.15 or higher) are required.

ii. Data
	1. mzXML files containing tandem mass spectra converted using msconvert
	(http://proteowizard.sourceforge.net/tools.shtml) from RAW files.
	NOTE: Currently only mzXML format is supported.
	2. Fasta file containing the desired protein sequences to be searched
	against.

III. USAGE

XLSearch can be run in 'searching mode' and 'training mode'. Searching
mode is intended to perform the database sequence search where the peptide
spectrum matches (PSM) are assigned a score based on the computed features 
that describe the maching quality between spectrum and each individual peptide, 
as well as weights of pre-trained logisitic models. Training mode is intended 
to re-train the logistic models using authentic cross-link PSMs obtained from
the new data.

i. Searching mode
Input:	1) PARAM.txt		Contains parameters for performing the database searching.
		2) database.fasta	Fasta format text file containing amino acid sequences 
							in fasta format. Specified in 'PARAM.txt'.
	
	Steps:
	1. Preparation:
		a. Unzip the .zip file to a directory (i.e. '/xlsearch_install_dir/'). It 
		should contain the python modules in '/xlsearch_install_dir/lib/', as well 
		as the pipline script for searching and training model ('xlsearch_search.py' 
		and 'xlsearch_train.py').

		b. Create directory where search is to be performed (i.e. '/xlsearch_search_dir/').
		c. Copy the file 'xlsearch_search.py' and 'PARAM.txt' to this directory.
		d. Copy the fasta sequence file (i.e. 'database.fasta') to this directory.
		e. Create directory where the mzXML files are located (i.e. '/xlsearch_search_dir/mzxml/').
		f. Edit the parameter file 'PARAM.txt' as needed.

	2. Perform datbase search
		Under directory '/xlearch_search_dir/'

			$ python xlsearch_search.py -l /xlsearch_install_dir/
										-p PARAM.txt
										-o output.txt

		where '-l', '-p' and '-o' indicates the path to the library, parameter file
		and the output file name. All three arguments are required.

	3. Output file
		A tab-delimited text file contains top-scoring PSM for each query spectrum.
		Sorted by the joint probability score assigned to each PSM.
		The first line contains the headers of the columns:
			a. Rank of PSM
			b. Sequence of alpha peptide
			c. Sequence of beta peptide
			d. Index of cross-linking site on alpha
			e. Index of cross-linking site on beta
			f. Protein header of alpha peptide
			g. Protein header of beta peptide
			h. Charge state
			i. Joint probability score P(alpha = T, beta = T)
			j. Margianl probability P(alpha = T)
			k. Marginal probability P(beta = T)
			l. The title of the query spectrum

	4. Evalutating identified PSMs 
		The output file contains the top-scoring PSMs for each query spectrum sorted in descending
		order of the joint probability score. The percentage of false positive identification at a
		given score cutoff $S$ is estimated by counting the numbers of true-true, true-false, and
		false-false PSMs whose scores are greater than $S$. Specifically,

									FDR = (#(TF) - #(FF)) / #(TT)

		To filter the output PSMs at a given score cutoff, provide the value of 'cutoff' and 
		'is_unique' in the parameter file, where 'cutoff' indicates the desired fdr cutoff, 
		and 'is_unique' ('True' or 'False') indicates whether the unique cross-linked peptides 
		(i.e. the combination of cross-linked peptides and charge) or the redundant PSMs are counted 
		in the FDR calculation. For example, to filter for the results at 1% FDR cutoff where the 
		redundant PSMs are counted, set 'cutoff' to 0.01 and 'is_unique' to False.


		The filtered result will be written to file 'intra0.01.txt' and 'inter0.01txt' for intra-protein
		and inter-protein cross-links.

ii. Training mode
Input:	1) PARAM.txt        Contains parameters for performing the database searching.
		2) target_database.fasta	Contains only the TARGET sequences from which true-true
				PSMs can be identified.
		3) uniprot_database.fasta	Contains the pool of protein sequences from which the
			true-false and false-false PSMs can be generated based on the true-true PSMs.
		4) true_true.psm (Optional)	Contains the authentic true-true PSMs from which 
			the true-false and false-false PSMs can be genearted. Check the sample file for
			format.
		
	Steps:
	1. Preparation:
		a. Same as in searching mode.
		b. Create directory where training is to be performed (i.e. '/xlsearch_train_dir/')
		c. Copy 'xlsearch_train.py' to the current directory
		d. Copy the fasta sequence file ('target_database.fasta', 'uniprot_database.fasta')
			to the current directory
		e. Create directory where the mzXML files are located (i.e. '/xlsearch_search_dir/mzxml/').
		f. Edit the parameter file 'PARAM.txt' as needed.

	2. Perform training

		Under directory '/xlearch_train_dir/'
			$ python xlsearch_search.py -l /xlsearch_install_dir/
										-p PARAM.txt
										-o output.txt

		where '-l', '-p' indicates the path to the library, parameter file, and the output
		 file name. All three arguments are required.
		
	3. Output file
		The output will be in the following format:

		CI00	...	weight 0 of classifier I
		...
		CI15	...	weight 15 of classifier I

		CII00	... weight 0 of classifier II
		...
		CII15	... weight 15 of classifier II
		
		nTT		... number of true-true PSMs 
		nTF		... number of true-false PSMs
		nFF		...	number of false-false PSMs
		
		These lines correspond to the logistic regression parameters for classfier I and II 
		('CI' and 'CII'), and the numbers of true-true, true-false and false-false PSMs used 
		 to train them ('nTT', 'nTF', 'nFF'). The parameters in the 'PARAM.txt' can be 
		overwritten by these lines to use the updated model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
library		library
PARAM.txt		PARAM.txt
README.md		README.md
xlsearch_search.py		xlsearch_search.py
xlsearch_train.py		xlsearch_train.py

chao-ji/cross-link

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages