Вопрос по – Code Golf: быстро создать список ключевых слов из текста, включая количество экземпляров

12

Я уже разработал это решение для себя с помощью PHP, но мне любопытно, как это можно сделать по-другому - даже лучше. Два языка, которые меня в первую очередь интересуют, это PHP и Javascript, но мне было бы интересно посмотреть, как быстро это можно сделать и на любом другом основном языке сегодня (в основном C #, Java и т. Д.).

Return only words with an occurrence greater than X Return only words with a length greater than Y Ignore common terms like "and, is, the, etc" Feel free to strip punctuation prior to processing (ie. "John's" becomes "John") Return results in a collection/array

Дополнительный кредит

Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement

Экстра-Экстра Кредит

Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example: *"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."* Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Исходный текст:http://sampsonresume.com/labs/c.txt

Формат ответа

It would be great to see the results of your code, output, in addition to how long the operation lasted.
Мы, ребята из НЛП, делаем это постоянно. Это называется очисткой данных, построением языковой модели и последующим декодированием языковой модели (возможно, с использованием метрики недоумения). Для чистых исходных данных, конечно, shell-скрипты. bmargulies
Я вижу, вы удалили немного о цитируемых высказываниях. Следует ли тогда игнорировать кавычки, а как насчет слов внутри них? Noldorin
Error: User Rate Limit Exceeded Sampson
Error: User Rate Limit Exceededflag all code golf questions as spam or offensive день. Tim Post♦

Ваш Ответ

13   ответов
1

REBOL

Error: User Rate Limit Exceeded

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

Error: User Rate Limit Exceeded

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
0

Error: User Rate Limit Exceeded

$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");

$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);

arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
    echo $word . ' <b>x</b> ' . $rarity . '<br/>';

foreach ($array as $word) { // words longer than n (=5)
//    if(strlen($word) > 5)echo $word.'<br/>';
}

Error: User Rate Limit Exceeded

statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1,
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1

var_dumpError: User Rate Limit Exceeded

Error: User Rate Limit Exceeded0.047Error: User Rate Limit ExceededfileError: User Rate Limit Exceeded

Error: User Rate Limit Exceeded Sampson
Error: User Rate Limit Exceeded
3

C# 3.0 (with LINQ)

Error: User Rate Limit Exceeded

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

Error: User Rate Limit ExceededO(n^2)Error: User Rate Limit ExceededO(n)Error: User Rate Limit Exceeded

Error: User Rate Limit Exceeded

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

Error: User Rate Limit Exceeded

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded Sampson
2

Error: User Rate Limit Exceeded

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

Error: User Rate Limit Exceeded

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

Error: User Rate Limit Exceeded

Error: User Rate Limit Exceeded

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
Error: User Rate Limit Exceededfrom itertools import groupby as gbError: User Rate Limit Exceededfrom itertools import*Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded'may'Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded Sampson
2

Error: User Rate Limit Exceeded

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`[email protected]#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

Error: User Rate Limit Exceeded

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
1

PythonError: User Rate Limit Exceeded

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

Error: User Rate Limit Exceeded

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
Error: User Rate Limit Exceeded Sampson
1

Error: User Rate Limit Exceeded

  1. Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.

  2. Use LINQ, filter by length.

  3. Use LINQ, filter with 'badwords'.Contains.

Error: User Rate Limit Exceeded
0

Error: User Rate Limit ExceededError: User Rate Limit ExceededError: User Rate Limit ExceededError: User Rate Limit ExceededError: User Rate Limit ExceededError: User Rate Limit Exceeded).

Error: User Rate Limit Exceededto_SError: User Rate Limit ExceededError: User Rate Limit ExceededError: User Rate Limit Exceeded

Error: User Rate Limit ExceededError: User Rate Limit Exceeded.

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

Error: User Rate Limit Exceeded

=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1
4

F#Error: User Rate Limit Exceeded

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Error: User Rate Limit Exceeded Sampson
Error: User Rate Limit Exceeded
Error: User Rate Limit ExceededcountByError: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
3
#! perl
use strict;
use warnings;

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  print "$word occurred $words{$word} times.";
}

Error: User Rate Limit Exceeded

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    print "$word occurred $words{$word} times.";
  }
}

Вы также можете довольно легко отсортировать вывод:

...
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    push @output, "$word occurred $words{$word} times.";
  }
}
$re = qr/occurred (\d+) /;
print sort {
  $a = $a =~ $re;
  $b = $b =~ $re;
  $a <=> $b
} @output;

Error: User Rate Limit Exceeded



Brad

Edit: this is how I would rewrite this last example

...
for my $word (
  sort { $words{$a} <=> $words{$b} } keys %words
){
  next unless length($word) >= $MINLEN;
  last unless $words{$word) >= $MIN_OCCURRENCE;

  print "$word occurred $words{$word} times.";
}

Or if I needed it to run faster I might even write it like this:

for my $word_data (
  sort {
    $a->[1] <=> $b->[1] # numerical sort on count
  } grep {
    # remove values that are out of bounds
    length($_->[0]) >= $MINLEN &&      # word length
    $_->[1] >= $MIN_OCCURRENCE # count
  } map {
    # [ word, count ]
    [ $_, $words{$_} ]
  } keys %words
){
  my( $word, $count ) = @$word_data;
  print "$word occurred $count times.";
}

Error: User Rate Limit Exceeded( it does so it in that order )

This is a slight variant of the Error: User Rate Limit Exceeded.

Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded Sampson
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
3

Ruby

Error: User Rate Limit Exceededarray#injectError: User Rate Limit Exceeded

Error: User Rate Limit Exceeded

Error: User Rate Limit Exceeded

Implementation

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

Test Rig

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.,com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

Test Results

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
11

GNU scripting

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr

Результаты:

  7 be
  6 to
[...]
  1 2.
  1 -

С появлением больше чем X:

sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'

Error: User Rate Limit Exceeded

sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c

Не обращайте внимания на общие термины, такие как & quot ;, и т. Д. & Quot; (при условии, что общие термины находятся в файле "игнорируются")

sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c

Не стесняйтесь удалять пунктуацию перед обработкой (то есть. «Джон» становится «Джоном»):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c

Error: User Rate Limit Exceeded

Error: User Rate Limit Exceeded Sampson
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
6

Perl in only 43 characters.

perl -MYAML -anE'$_{$_}[email protected];say Dump\%_'

Вот пример его использования:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}[email protected];say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

Error: User Rate Limit Exceeded

perl -MYAML -anE'$_{lc$_}[email protected];say Dump\%_'

For it to work on the specified text requires 58 characters.

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}[email protected];END{say Dump\%_}'
real    0m0.679s
user    0m0.304s
sys     0m0.084s

Вот последний пример, немного расширенный.

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}
Error: User Rate Limit Exceeded Sampson
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded
Error: User Rate Limit Exceeded

Похожие вопросы