README.md 3.85 KB
Newer Older
Theo Serralta's avatar
Theo Serralta committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108

# Genomic CNV and SV Detection with GPU Acceleration

This project performs copy number variation (CNV) and structural variant (SV) detection on genomic data, leveraging 
GPU acceleration to enhance performance for large datasets. It includes calculations of mappability, GC content, depth, 
and normalization, followed by variant detection and result output in various formats, including VCF for SVs.

## Features

- **Copy Number Variation (CNV) Analysis**: Analyzes depth of coverage across genomic windows to detect CNVs.
- **Structural Variant (SV) Detection**: Identifies SVs (e.g., deletions, inversions, translocations) using paired-end 
  and split-read alignments.
- **GPU Acceleration**: Utilizes CUDA-enabled GPU processing to improve the efficiency of mappability, GC content, 
  depth, and normalization calculations.
- **Customizable Parameters**: Adjustable settings for window size, step size, and z-score thresholds.

### Author:
**SERRALTA Theo**

### Collaborators:
**DUFFOURD Yannis**

### Laboratory:
**GAD** 

### Date:
**28/09/2023**

## Installation

1. Ensure you have Python and CUDA installed.
2. Install the necessary Python packages:

   ```bash
   pip install numpy pysam pycuda pandas
   ```

3. Clone this repository.

## Usage

### Platform
Currently, this software is designed to run exclusively on the CCUB (Computing Center of the University of Burgundy).

### Directory
Navigate to the directory:

```bash
cd /work/gad/shared/analyse/test/cnvGPU/test_scalability/
```

### Recommended Execution with qsub

Run the following command to execute using qsub:

```bash
qsub -v NUM_CHR=<ALL_or_num_chr>,INPUTFILE=</path/to/the/input/bam/file>,LOGFILE=</path/to/the/log/file>,OUTPUT=</path/to/the/output/file>,OUTPUT_PAIRS=</path/to/the/output_pairs/file>,OUTPUT_SPLITS=</path/to/the/output_splits/file> ./wrapper_cnvGPU.sh
```

Example:

```bash
qsub -pe smp 1 -v NUM_CHR=ALL,INPUTFILE=/work/gad/shared/analyse/test/cnvGPU/test_scalability/dijen1000.bam,OUTPUT=exemple.out.tsv,OUTPUT_PAIRS=exemple.out_pairs.tsv,OUTPUT_SPLITS=exemple.out_splits.tsv,LOGFILE=exemple.log ./wrappers/wrapper_cnvGPU.sh
```

### Modifying Parameters

Certain parameters can be customized within the wrapper script:

- `window_size` (w): Default is `-w 100`
- `step_size` (s): Default is `-s 10`
- `zscore_threshold` (z): Default is `-z 1.5`
- `lengthFilter` (l): Default is `-l 200`

### Direct Execution without Wrapper

Alternatively, you can execute the program directly with Singularity:

```bash
singularity exec --nv -e /work/gad/shared/bin/singularity_images/pycuda/pycuda_sam.1.1.sif python3 /work/gad/shared/analyse/test/cnvGPU/test_scalability/cnv_sv_caller_gpu.py -b <input_bamfile> -c <int or "ALL"> -w <int> -s <int> -z <float> -l <int> -o <output_cnv_file_vcf> -p <output_pairs_file> -m <output_splits_file> -e <logfile>
```

Example:

```bash
singularity exec --nv -e /work/gad/shared/bin/singularity_images/pycuda/pycuda_sam.1.1.sif python3 /work/gad/shared/analyse/test/cnvGPU/test_scalability/cnv_sv_caller_gpu.py -b example.bam -c ALL -w 100 -s 10 -z 1.5 -l 200 -o example_cnv.vcf -p example_pairs.tsv -m example_splits.tsv -e example.log
```

## Output Files

- **VCF File**: Contains structural variant calls with relevant information on chromosome, position, variant type, 
  copy number, etc.
- **Paired-Read Events**: Details abnormal paired-end read alignments indicating possible structural variations.
- **Split-Read Events**: Lists split-read alignments for further variant investigation.

## Dependencies

- Python 3.x
- CUDA-compatible GPU
- [Numpy](https://numpy.org/), [pysam](https://pysam.readthedocs.io/), [pycuda](https://documen.tician.de/pycuda/), 
  [pandas](https://pandas.pydata.org/)

## License

## Acknowledgments

This tool was developed to assist with high-performance genomic analyses, utilizing GPU acceleration to make 
large-scale CNV and SV detection feasible on big datasets.