0% found this document useful (0 votes)
31 views7 pages

Bioinformatic Programmer Cheat Sheet

The document provides a comprehensive list of basic filesystem commands, package management commands, and job management commands for Linux, particularly in a bioinformatics context. It includes instructions for working with files, modules, and programs, as well as specific commands for tools like bcftools, samtools, and snakemake. Additionally, it outlines steps for RNA sequencing analysis, including data downloading, pipeline setup, and resource allocation on a server.

Uploaded by

flamingpizzaboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views7 pages

Bioinformatic Programmer Cheat Sheet

The document provides a comprehensive list of basic filesystem commands, package management commands, and job management commands for Linux, particularly in a bioinformatics context. It includes instructions for working with files, modules, and programs, as well as specific commands for tools like bcftools, samtools, and snakemake. Additionally, it outlines steps for RNA sequencing analysis, including data downloading, pipeline setup, and resource allocation on a server.

Uploaded by

flamingpizzaboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

pBasic filesystem commands find <directory> -iname <*word*> – finds if the

pwd – prints the present working directory word is in the name of a file in the directory
ls <optional path> – lists the files in the current ls -1 | wc -l – gives the number of files in the
directory or the file path current directory
rm -rf <path> – deletes the file listed in the path md5sum <filename> > <filenamemd5sum.txt>
cd <path> – change directory to path – generate md5sum file
cd - – goes back to previous directory md5sum -c <filenamemd5sum.txt> – verify the
cd .. – goes up one directory contents of md5 file
cd ~ – goes to home directory gzip -d <file.gz> – unzip a gzip file
cd ../ – goes t tar -tf <file.tar> – view the contents of a tar file
without extracting it
cp <source path> <destination path> – copies tar -xvf <file.tar> – extract a tar file
file at source path to new destination path tar -xvzf <file.tar.gz> – extract a tar.gz file
touch <filename> – creates a file with filename
nano <filename> – opens filename to edit Working with packages
cat <filename> – prints contents of file pip list – Allows you to see the versions that
zcat <filename> – prints contents of a you have installed for pip
compressed file without decompressing it pip show <package> – Gives information for an
head -n <number> <filename> – prints the individually installed pip package
specified number of rows at the beginning of a <package> --version – Tells the version of a
file specified package
tail -n <number> <filename> – prints the
specified number of rows at the end of a file Working with modules
less <filename> – displays the contents a file module spider – Lists all available modules
one screen at a time module spider <module> – Gives details for
ln -s <original> <symlink> – creates symlink specific module
from original file module load <module> – Loads module
ssh <user>@login.tscc.sdsc.edu – logging in to
server Working with programs
scp <local/path> bcftools
<user>@login.tscc.sdsc.edu:<destination/path bcftools view -s <sample1,sample2> <file.vcf>
> – copy files from local to server > <filtered.vcf> – creates a subset of the
source ~/.bashrc – reload changes to the original vcf file
bashrc without logging out bcftools merge <vcf1.vcf.gz> <vcf2.vcf.gz> -o
chmod +x <filename> – makes file executable <combined.vcf.gz> – merges multiple vcfs into
sh <filename.sh> – runs shell script one combined vcf
conda
Learning info about files on linux conda create --name <environment> – creates
ll <optional path> – displays detailed directory an environment with that name
listings in path conda env create -f <environment.yml> –
readlink -f <symlinkName> – view symbolic link creates an environment from the yml file
file <broken_symlink> – view where broken conda list – lists all of the packages in the
symbolic link points to current conda environment
df -h <server_filepath> – checks total/available conda activate <environment> – activates the
space on a storage server conda environment with that name
du -sh <filepath> – Calculates the size of a file conda deactivate – deactivates your current
lfs quota -uh <user> <scratch_filepath> – tells conda environment
you how much space you have left in scratch conda env export > <environment.yml> –
which <executable_command> – tells you the creates yml file from current conda
filepath of the command that you are executing environment
grep <word> <filename> – finds word inside
filename
conda create --name <environment-name> -- vcftools --gzvcf <input_file.vcf.gz> --chr <chr#
clone <environment/path> – clones or #> --to-bp <end_pos> --out <output_prefix> -
environment from path
conda create --override-channels -c defaults -n -recode --remove-filtered-all --from-bp
py27 python=2.7 – creates an environment <start_pos> --recode-INFO-all – makes a
named py27 with python2 smaller vcf from a larger vcf at a specific
galyleo position
galyleo launch --account csd742 --qos condo --
partition condo --cpus 2 --time-limit 168:00:00 Dealing with tmux
--env-modules slurm,cpu/0.17.3 --conda-init tmux ls – lists current tmux sessions
~/.anaconda/etc/profile.d/conda.sh -m 14 tmux attach -d -t <session id> – reattaches a
Runs a jupyter notebook (edit for personal use previous tmux session based on id
and initialize in the conda environment you Ctrl+B then [ – scroll (q to quit)
plan to use) Ctrl+B then D – detach tmux session
gatk Ctrl+B C – Create a new window
picard CreateSequenceDictionary -R <fasta Ctrl+B X – Kill active pane
name> – creates dictionary for a fasta file Ctrl+B N or P – Move to the next or previous
github window
git clone <url.git> – installs a .git file tmux kill-session -t <targetSession> – Kills a
plink specific session
plink --pca --allow-extra-chr --vcf
<vcf_path.vcf.gz> – gets the files necessary for Running a job
a pca analysis on a combined vcf srun -N 1 -n 4 --mem 16G -t 168:00:00 -p
samtools platinum -q hcp-csd742 -A csd742 --pty bash
samtools view -H <bam filepath> – view the – (example) edit for individual use
header of a bam file squeue -u <user> – Lists job from user
samtools fastq <original.bam> > <new.fastq> – scancel -u <user> – Kills all jobs from user
converts bam file to fastq scancel <job_id> – Kills job id
samtools quickcheck <file.bam> – checks to exit- log out of server
make sure a bam file is not corrupted
samtools view -h -o <file.sam> <file.bam> –
converts a bam file into a sam
samtools view -bS <file.sam> > <file.bam> –
converts a sam file into a bam
tabix -p vcf <vcf_filepath.vcf.gz> – create an
index file for a vcf file
snakemake
snakemake -j 20 --cluster "sbatch -N 1 -n 2 --
mem 8G -t 24:00:00 -p condo -q condo -A
csd742" --rerun-incomplete --latency-wait 120
– Runs the program (edit for personal use)
snakemake -j 50 --cluster "sbatch -N 1 -n 1 --
mem 1G -t 8:00:00 -p platinum -q hcp-csd742 -
A csd742" --rerun-incomplete --latency-wait
120 --use-singularity --singularity-args "-B
/tscc/projects/ps-gleesonlab5/,/tscc/projects/ps-
gleesonlab7/,/tscc/projects/ps-gleesonlab8/,/
tscc/lustre/ddn/scratch/ton011/" – Runs a
snakemake file that uses singularity and conda
(edit for personal use)
Ctrl+C = cancel snakemake
vcftools
Toan’s Notes Pipelines are on ps-gleesonlab7
Activating jupyter notebook
conda activate Jupyter RNA seq: incoming raw data folder, gleeson8
ll -a (list hidden files)
Can derived a bam from fasq, can derived your
squeue -A csd742 (look at resources people fasq from bam
are using)
Login to Gleeson lab server:
seff <job ID> (look in specific nodes) cd /tscc/projects/ps-gleesonlab8/User/toan

Resource allocation:
snakefile, check for threads (small n) srun -N 1 -n 4 --mem 16G -t 168:00:00 -p
RNA seq, N=1, n=16 condo -q condo -A csd742 --pty bash

Scratch Space:
cd /tscc/lustre/ddn/scratch/ton011

Checking for scratch space availability:


lfs quota -uh ton011
/tscc/lustre/ddn/scratch/ton011

Resource allocation methylseq:


Mkdir analysis: create a new folder for
analysis in the WGS_DNM_pipeline folderls srun -N 1 -n 32 --mem 100G -t 168:00:00 -p
condo -q condo -A csd742 --pty bash
Quiting: Terminate run
Ctrl+B then D – detach tmux session Ctrl + C
Exit on home q=window
Kill Job
Scancel “Job ID”

How to check job ID?


squeue -A csd742?

Seff <Job ID>


RNA Seq Pipeline:

Follow direction of Github!!!!!!

Download the pipeline into your user fulder: Downloading raw data

Access “Incoming_raw_data” folder Gleeson8


cookiecutter git+ssh://[email protected]/Gleeson-
Lab/rna_seq_pipeline.git

name it with the date because you will need to


install this every time you run RNA seq. After
running the cookie cuter line, it will prompt you
on how to do it

Incoming raw data folder:

Ps-gleesonlab8/incoming_raw_data/
20241105_yuji Mkdir for new folder to put file in
Cd (keep tapping to the end because its Cd to that folder
nested) Download data using wget -r
This will show all the fastq file
Example:
You need to run the pipeline on all the fastq PACS2_RNA seq 5/15/25
(use a for loop) Your sequencing data is available on the FTP
server:
Your sequencing data is available on the FTP server:
Nextflow ftp://igm-storage.ucsd.edu/
250512_LH00444_0340_B22W5G5LT4

Username: gleeson
Password: tiJHtKP4y1
RNA_seq_PACS2

wget -r -nH --cut-dirs=1 --no-parent


ftp://gleeson:tiJHtKP4y1@igm-
storage.ucsd.edu/250512_LH00444_0340_B2
2W5G5LT4/ -P
/tscc/projects/ps-gleesonlab8/Incoming_raw_d
ata/20250515_Toan_ASOSFARI_PACS2_RNA
Seq_IGM
Edit the last part in custom_fasta to put in your
reference of interest (.fa file)

NOTE: Need to index the Fasta before you run

Edit the last part of


/tscc/nfs/home/xiy010/miniconda3/bwa index
[custom fasta]

Dry run using changed input: snakemake -n

Putting in a new reference genome:


Change the snake_conf.yaml, custom_fasta
More helpful commands Platinum: 14 days, 15 GB Ram per GPUs

seff <logID> – tell you the resources a job is


using
seff 5847641

bash script for checking the available


cores&memory on platinum and gold nodes

bash
/tscc/projects/ps-gleesonlab9/user/yix/scripts_t
emplates/check_avail.sh

Conda: max resources= 7 days, 7 GB ram per


GPUs

You might also like