PyFeat-2.x: A Python-based Effective Feature Generation Tool from DNA, RNA, and Protein/Peptide Sequences
Background / Motivation / Abstract / Gist:
In Bioinformatics research, we usually notice using the same biological sequences and even the same algorithms, but the performance varies greatly. Researchers unearth the riddle; it is merely the different feature representation process. Feature representation techniques drastically enhance performance. In addition, we firmly do believe rational feature representation techniques are a significant contribution to Algorithmic-Bioinformatics and Machine Learning in Biological Sciences research. The PyFeat–2.x is a comprehensive Deep Learning friendly Python-based tool for generating various numerical feature representation schemes from DNA, RNA, and protein primary structure sequences.
1. How does it works? or, What is the primary goal of this package?
We incorporate a lot of state-of-the-art feature groups for DNA, RNA, and protein/peptide sequence. The PyFeat-2.x
takes FASTA file (anyName.fasta, anyName.fa, anyName.txt) as an input with some parameters for a specific type of sequences. Afterward, it produces the features as NumPy array format (~.npy).
Feature Generate Process | Feature Merge Process |
---|---|
![]() |
![]() |
2. Download Package
2.1. Direct Download
We can directly download the package by clicking the link.
Note: The package will download in zip format
(.zip)
namedPyFeat-2.x-master.zip
.
2.2. Clone a GitHub Repository (Optional)
Cloning a repository syncs it to our local machine. After cloning, we can add and edit files and then push and pull updates (Examples are given below for Linux-based Operating System.).
- Clone over HTTPS:
user@machine:~$ git clone https://github.com/mrzResearchArena/PyFeat-2.x.git
- Clone over SSH:
user@machine:~$ git clone git@github.com:mrzResearchArena/PyFeat-2.x.git
Note-1: If the clone was successful, a new sub-directory appears on our local drive. This directory has the same name (PyFeat-2.x) as the
GitHub
repository that we cloned.
Note-2: We can run any Linux-based command from any valid location or path, by default, a command generally runs from
/home/user/
.
Note-3:
user
is the name of our computer but your computer name can be different (e.g.,/home/bioinformatics/
,/home/LinusTorvalds/
).
3. Required Python Packages:
- Install: python (version >= 3.6)
- Install: numpy (version >= 1.15.0)
Table 1: Details parameters/arguments for the features generation:
Long Argument | Short Argument | Variable Type | Default | Choices | Is it a feature? | Applicable | Argument Help |
---|---|---|---|---|---|---|---|
--sequenceType | -seq | string | -seq=PROT | {DNA, dna, RNA, rna, PROT, prot} | ❌ | ⛔ | Please use {DNA, dna} for DNA squences, {RNA, rna} for RNA squences, {PROT, prot} for protein/peptide squences. |
--fasta | -fa | string | -fa=samplePROT.fa | None | ❌ | ⛔ | Please enter the UNIX-like path. Example: -fa=/home/user/anyFASTA.fa |
--terminusLength | -t | integer | -t=50 | None | ❌ | ⛔ | The terminusLength 30 to 100 performed well. |
--gGap | -g | integer | -g=5 | None | ❌ | ⛔ | The -g value is between 1 to 5 performed well. |
--kTuple | -k | integer | -k=3 | None | ❌ | ⛔ | The -k value is between 1 to 3 performed well. |
--FkMers | -fkmer | integer | -fkmer=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--FgGaps11 | -fg11 | integer | -fg11=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--FgGaps12 | -fg12 | integer | -fg12=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--FgGaps21 | -fg21 | integer | -fg21=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--PgGaps11 | -pg11 | integer | -pg11=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--PgGaps12 | -pg12 | integer | -pg12=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--PgGaps21 | -pg21 | integer | -pg21=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--PkMers | -pkmer | integer | -pkmer=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--binaryProfileFeature | -bpf | integer | -bpf=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--BLOSUM62 | -blosum62 | integer | -blosum62=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--PAM250 | -pam250 | integer | -pam250=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesP1 | -pcpP1 | integer | -pcpP1=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesP2 | -pcpP2 | integer | -pcpP2=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesP3 | -pcpP3 | integer | -pcpP3=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesP4 | -pcpP4 | integer | -pcpP4=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesP5 | -pcpP5 | integer | -pcpP5=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesD1 | -pcpD1 | integer | -pcpD1=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesD2 | -pcpD2 | integer | -pcpD2=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--physicochemicalPropertiesR1 | -pcpR1 | integer | -pcpR1=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--BLASTn | -blastn | integer | -blastn=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
--transitionTransversion | -tt | integer | -tt=0 | {1, 0} | ✔️ |
|
1 and 0 denotes (On/Active) and (Off/Deactivate) respectively. |
Note: The
= sign
is the optional for parameter agument, but if you use the= sign
, please make sure there will be no space besides the= sign
. (To know more explanation, please see Table 2.)
Note: We can use both long arguments and short arguments.
Note: Helpful Tips: You can always use a short argument to reduce typing and time.
Table 2: How does the use parameter?
Argument Segment | Is it a valid? | Issue |
---|---|---|
-seq=PROT | ✔️ | None |
-seq = PROT | ❌ | Please trim the blank spaces on both sides of the = sign . |
-seq PROT | ✔️ | None |
-seq PROT | ✔️ | None; Yes, multi-spaces are allowed. |
Note-1: We can also use
--seqType
(long argument) instead of-seq
(short argument/shortcut argument).
Note-2: Helpful Tips: You can avoid
=
in arguments.
3. Generate Feature:
Example-1:
-
Generate only theBinary Profile Feature
user@machine:~$ python main.py -fa anyFASTA.fasta -bpf 1 # default: -seq PROT.
-
Generate only theBinary Profile Feature
with the 40 terminus length.
user@machine:~$ python main.py -fa anyFASTA.fasta -bpf 1 -t 40 # default: -t 50.
Example-2:
-
Generate only thePosition-wised g-Gaps & monoMono Style
user@machine:~$ python main.py -fa anyFASTA.fasta -fg11 1 # default: -t 50 -g 5, -k 3.
-
Generate only thePosition-wised g-Gaps & monoMono Style
with 3-gaps and 4-mers.
user@machine:~$ python main.py -fa anyFASTA.fasta -fg11 1 -g 3 -k 4 # default: -t 30, -g 5, -k 3.
Example-3:
-
Generate only thetransversion
user@machine:~$ python main.py -fa anyFASTA.fasta -tt 1 -seq DNA # default: -seq PROT.
Example-4:
-
Generate multiple dataset in a single command
user@machine:~$ python main.py -fa anyFASTA.fasta -bits 1 blosum62 1 pam250 1 g11 1
Note: The PyFeat-2.x tool is able to generate multiple dataset with different parameters.
4. Merge Feature:
Input Format:
user@machine:~$ python merge.py <anyFileName> <anyFileName.npy> <anyFileName.npy>
user@machine:~$ python merge.py <anyFileName> <anyFileName.npy> <anyFileName.npy> <anyFileName.npy>
Example-1:
-
Merge Two File
user@machine:~$ python merge.py merge fg11-16.npy fg11-32.npy
Example-2:
-
Merge Three File, or Merge Multiple File
user@machine:~$ python merge.py merge fg11-16.npy fg11-32.npy fg11-32.npy
Note: The PyFeat-2.x tool is able to merge the multiple datasets.