Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headers suggestions #96

Open
ghuls opened this issue Sep 5, 2017 · 7 comments
Open

Headers suggestions #96

ghuls opened this issue Sep 5, 2017 · 7 comments

Comments

@ghuls
Copy link

ghuls commented Sep 5, 2017

It would be nice to have some additional header related options.

In the files I work with, there is very often a header line which starts with a comment.

#chrom  txStart txEnd   name2
chr1    66999638        67216822        SGIP1
chr1    16767166        16786584        NECAP2
chr1    48998526        50489626        AGBL4
# chrom  txStart txEnd   name2
chr1    66999638        67216822        SGIP1
chr1    16767166        16786584        NECAP2
chr1    48998526        50489626        AGBL4

It would be nice to have an option to strip of the leading '#' or '# ' of the header name.

xsv input --header-uncomment

Some other files have multiple commented lines:

##fileformat=VCFv4.0
##FILTER=<ID=NOT_POLY_IN_1000G,Description="Alternate allele count = 0">
##FILTER=<ID=amb,Description="Ambiguous SNP.  Could not determine true forward strand. May have ref/alt mismatches">
##FILTER=<ID=dup,Description="Duplicate assay at same position with worse Gentrain Score">
##FILTER=<ID=id10,Description="Within 10 bp of an known indel">
##FILTER=<ID=id20,Description="Within 20 bp of an known indel">
##FILTER=<ID=id5,Description="Within 5 bp of an known indel">
##FILTER=<ID=id50,Description="Within 50 bp of an known indel">
##FILTER=<ID=refN,Description="Reference base is N. Assay is designed for 2 alt alleles">
##FORMAT=<ID=GC,Number=.,Type=Float,Description="Gencall Score">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FilterLiftedVariants="analysis_type=FilterLiftedVariants input_file=[] sample_metadata=[] read_buffer_size=null phone_home=STANDARD read_filter=[] intervals=null excludeIntervals=null reference_sequence=/humgen/1kg/reference/human_g1k_v37.fasta rodBind=[/gap/birdsuite/1kg/0.928975161471502.sorted.vcf] rodToIntervalTrackName=null BTI_merge_rule=UNION DBSNP=null downsampling_type=null downsample_to_fraction=null downsample_to_coverage=null baq=OFF baqGapOpenPenalty=40.0 performanceLog=null useOriginalQualities=false defaultBaseQualities=-1 validation_strictness=SILENT unsafe=null num_threads=1 interval_merging=ALL read_group_black_list=null processingTracker=null restartProcessingTracker=false processingTrackerStatusFile=null processingTrackerID=-1 allow_intervals_with_unindexed_bam=false enable_experimental_low_memory_sharding=false logging_level=INFO log_to_file=null quiet_output_mode=false debug_mode=false help=false out=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub NO_HEADER=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VCFWriterStub"
##INFO=<ID=CR,Number=.,Type=Float,Description="SNP Callrate">
##INFO=<ID=GentrainScore,Number=.,Type=Float,Description="Gentrain Score">
##INFO=<ID=HW,Number=.,Type=Float,Description="Hardy-Weinberg Equilibrium">
##reference=human_g1k_v37.fasta
##source=infiniumFinalReportConverterV1.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    82154   rs4477212       A       G       .       PASS    CR=100.0;GentrainScore=0.7826;HW=1.0
chr1    534247  SNP1-524110     C       T       .       PASS    CR=99.93414;GentrainScore=0.7423;HW=1.0
chr1    565286  SNP1-555149     C       T       .       PASS    CR=98.8266;GentrainScore=0.7029;HW=1.0
chr1    569624  SNP1-559487     T       C       .       PASS    CR=97.8022;GentrainScore=0.8070;HW=1.0
chr1    689186  rs4000335       G       A       .       NOT_POLY_IN_1000G       CR=99.86885;GentrainScore=0.7934;HW=1.0
chr1    723918  SNP1-713781     G       A       .       PASS    CR=99.933174;GentrainScore=0.4541;HW=0.30050507
chr1    729632  SNP1-719495     C       T       .       PASS    CR=99.409645;GentrainScore=0.6870;HW=1.0
chr1    752566  rs3094315       G       A       .       PASS    CR=99.896645;GentrainScore=0.8141;HW=2.0501487E-8
chr1    752721  rs3131972       A       G       .       PASS    CR=99.90196;GentrainScore=0.8578;HW=7.131948E-9
chr1    754063  SNP1-743926     G       T       .       PASS    CR=99.933556;GentrainScore=0.5893;HW=0.7085589
chr1    756652  SNP1-746515     T       G       .       NOT_POLY_IN_1000G       CR=100.0;GentrainScore=0.6899;HW=1.0
chr1    757691  SNP1-747554     T       C       .       PASS    CR=99.865814;GentrainScore=0.5544;HW=0.01883173

In this case it would be useful to be able to skip the first 17 lines and use line 18 as header:

xsv input --skip-lines 1-17 --header-uncomment

Or an option to skipping commented lines (and a --comment argument to specify the comment character or string) would be nice too:

xsv input --skip-commented-lines --comment '#'

Also an option to set a header would be nice as this could be used to set headers on files without header or when combined with the --skip-commented-lines option to set a header in case the header line was commented.

xsv input --no-headers --set-headers 'CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO' --skip-commented-lines --comment '#'

--set-headers, --skip-commented-lines, --comment are probably the most general applicable options to implement.

The following option seems to be missing in input, which seems odd to me:

xsv input --no-headers
@BurntSushi
Copy link
Owner

The underlying CSV reader supports commented lines, so I think this should be straight-forward to add.

@Sue9104
Copy link

Sue9104 commented Sep 14, 2017

I'm getting with the same question. Though a few commands like "sed/grep -v" before using xsv can solve the problem, it's a little annoying especially in xsv headers and xsv join.

@dufferzafar
Copy link

Another vote for --set-headers !

@nlauchande
Copy link

Another vote for --set-headers .

@onetom
Copy link

onetom commented Sep 20, 2019

I was just re-evaluating various CSV libraries, so this was the 1st time I actually tried using xsv.

Intuitively I was expecting to find header rename functionality under the headers command.

After gathering the general thinking behind the command line arguments, we could argue for putting it under input, format, cat or even select (for ppl with the SQL) background.

Is there a design document for the structure of the command line options somewhere?

Knowing the logic behind the design would even help with learning this command line interface,
but would definitely help with extending it.

In spirit I feel xsv is trying to follow the unix philosophy of do one thing and do it well, but starts to suffer from the cat -v problem.

What I was expecting to find are the following:

  1. common options come before the sub-command name
  2. sub-commands are orthogonal

It's okay if it doesn't work this way, but it should be stated somewhere; that would be helpful.

@BurntSushi
Copy link
Owner

BurntSushi commented Sep 20, 2019

There is no design document. The Unix philosophy is a means to an end. Please focus on specific and actionable issues and avoid abstract discussion of philosophies that are so broad that they can be interpreted in any one of a number of ways. Linking to catv, for example, is almost certainly unhelpful. (I empathize with a lot of the opinions expressed on catv, but I mostly see the web site as a place where a bunch of people go to whine. Sorry.)

@62mkv
Copy link

62mkv commented Jun 7, 2023

I would also love to have an option to "add headers" (or, actually, --set-headers as suggested above, would be less ambiguous). Maybe I'd even give it a shot! if @BurntSushi would approve this. this is such a useful little thing, this xsv... ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants