CRCBuilder_Python3 - Institute of Pediatric Research

CRCBuilder_Python3

PURPOSE:
To build Core Regulatory Circuitry from H3K27ac ChIP-seq data

INSTALLATION:
1）Install Miniconda environment:
      wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
      bash Miniconda3-latest-Linux-x86_64.sh
      source ~/.bashrc
      conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
      conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
      conda config --add channels http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda
      conda config --set show_channel_urls yes
2）conda create -n crcbuilder python=3.8
      source activate crcbuilder
      conda install -c pwwang bwtool
      conda install -c bioconda meme
      conda install -c bioconda pyfasta
      conda install -c conda-forge networkx
      conda install -c conda-forge matplotlib-base

REQUIREMENTS:
Fasta files for the genome(e.g. hg38.fa) used must be placed in a directory that will be specified when runing the program (-f option). They can be downloaded from ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ (it will need to be unzipped)
The bigwig(.bw or .bigwig) file of sequencing reads for H3K27ac and its super-enhancer table (_peaks_SuperEnhancers.table.xls) generated by ROSE software. The filenames should begin with the same words. e.g.: Hela_1-H3K27ac.bw Hela_1-H3K27ac_peaks_SuperEnhancers.table.xls

CONTENT :
      CRCBuilder.py: main program
      utils.py: utility methods
      TFlist_NMid_hg.txt: TFs used and their human NMIDs
      source/CIS_BP_HOCOMOCOv11_motif.meme: Motifs library
      source/MotifDictionary.txt: TFs used and their associated motif names

USAGE:
      The program is run by calling CRCBuilder.py from the directory containing all the documents:
      python CRCBuilder.py -s [--step] -b [--bw_dir] -f [--fasta]
      -s [--step]
        Select the step to start with (CalculatePromoterActivity(CPA) / findCanidateTFs(FCT) / findMotifs(FM) / buildCRCs(BC)).
      -b [--bw_dir]
        The directory contains bigwig files for H3K27ac sequencing reads.
      -f [--fasta]
        The path of fasta file for the genome version used, the suffix must be '.fa' or '.fasta'.

EXAMPLE:
      python CRCBuilder.py -s CPA -b /mnt/data/Hela-H3K27ac/ -f /mnt/genome/hg38.fa
      python CRCBuilder.py -s FM -b /mnt/data/Hela-H3K27ac/
        (-f option could be omitted in step findMotifs(FM) and buildCRCs(BC))

OUTPUT FILES:
      SAMPLE_*_ASSIGNMENT_GENES.txt: list of gene names for genes assigned to SEs.
      SAMPLE_*_ASSIGNMENT_TRANSCRIPTS.txt: Transcripts NMIDs for transcripts assigned to SEs.
      SAMPLE_*_bg.meme: DNA background sequence file used with FIMO.
      SAMPLE_*_CANDIDATE_TF_AND_SUPER_TABLE.txt : table containing the candidate TFs and the location of their associated SEs.
      SAMPLE_*_connections.txt : table containing TF-TF interconnections.
      SAMPLE_*_EXPRESSED_GENES.txt: list of genes considered expressed (top 2/3).
      SAMPLE_*_EXPRESSED_TRANSCRIPTS.txt: list of transcripts considered expressed.
      SAMPLE_*_SUBPEAKS.fa: fasta file of SE constituent sequences used with FIMO.
      mergeAUTOREG_*.txt: list of TFs gene names predicted to bind their own SE.
      mergeCRC_SCORES_*.txt: all possible CRCs, ranked based on the average frequency of occurrence of the TFs they contain across all the possible interconnected auto regulatory loops.

CRC GRAPH CONVERT:
Submit the CRC members extracted from mergeCRC_SCORES_*.txt: